This project has retired. For details please refer to its Attic page.
Sample (Apache Crunch 0.7.0 API)

org.apache.crunch.lib
Class Sample

java.lang.Object
  extended by org.apache.crunch.lib.Sample

public class Sample
extends Object

Methods for performing random sampling in a distributed fashion, either by accepting each record in a PCollection with an independent probability in order to sample some fraction of the overall data set, or by using reservoir sampling in order to pull a uniform or weighted sample of fixed size from a PCollection of an unknown size. For more details on the reservoir sampling algorithms used by this library, see the A-ES algorithm described in Efraimidis (2012).


Constructor Summary
Sample()
           
 
Method Summary
static
<T,N extends Number>
PCollection<Pair<Integer,T>>
groupedWeightedReservoirSample(PTable<Integer,Pair<T,N>> input, int[] sampleSizes)
          The most general purpose of the weighted reservoir sampling patterns that allows us to choose a random sample of elements for each of N input groups.
static
<T,N extends Number>
PCollection<Pair<Integer,T>>
groupedWeightedReservoirSample(PTable<Integer,Pair<T,N>> input, int[] sampleSizes, Long seed)
          Same as the other groupedWeightedReservoirSample method, but include a seed for testing purposes.
static
<T> PCollection<T>
reservoirSample(PCollection<T> input, int sampleSize)
          Select a fixed number of elements from the given PCollection with each element equally likely to be included in the sample.
static
<T> PCollection<T>
reservorSample(PCollection<T> input, int sampleSize, Long seed)
          A version of the reservoir sampling algorithm that uses a given seed, primarily for testing purposes.
static
<S> PCollection<S>
sample(PCollection<S> input, double probability)
          Output records from the given PCollection with the given probability.
static
<S> PCollection<S>
sample(PCollection<S> input, Long seed, double probability)
          Output records from the given PCollection using a given seed.
static
<K,V> PTable<K,V>
sample(PTable<K,V> input, double probability)
          A PTable<K, V> analogue of the sample function.
static
<K,V> PTable<K,V>
sample(PTable<K,V> input, Long seed, double probability)
          A PTable<K, V> analogue of the sample function, with the seed argument exposed for testing purposes.
static
<T,N extends Number>
PCollection<T>
weightedReservoirSample(PCollection<Pair<T,N>> input, int sampleSize)
          Selects a weighted sample of the elements of the given PCollection, where the second term in the input Pair is a numerical weight.
static
<T,N extends Number>
PCollection<T>
weightedReservoirSample(PCollection<Pair<T,N>> input, int sampleSize, Long seed)
          The weighted reservoir sampling function with the seed term exposed for testing purposes.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

Sample

public Sample()
Method Detail

sample

public static <S> PCollection<S> sample(PCollection<S> input,
                                        double probability)
Output records from the given PCollection with the given probability.

Parameters:
input - The PCollection to sample from
probability - The probability (0.0 < p %lt; 1.0)
Returns:
The output PCollection created from sampling

sample

public static <S> PCollection<S> sample(PCollection<S> input,
                                        Long seed,
                                        double probability)
Output records from the given PCollection using a given seed. Useful for unit testing.

Parameters:
input - The PCollection to sample from
seed - The seed for the random number generator
probability - The probability (0.0 < p < 1.0)
Returns:
The output PCollection created from sampling

sample

public static <K,V> PTable<K,V> sample(PTable<K,V> input,
                                       double probability)
A PTable<K, V> analogue of the sample function.

Parameters:
input - The PTable to sample from
probability - The probability (0.0 < p < 1.0)
Returns:
The output PTable created from sampling

sample

public static <K,V> PTable<K,V> sample(PTable<K,V> input,
                                       Long seed,
                                       double probability)
A PTable<K, V> analogue of the sample function, with the seed argument exposed for testing purposes.

Parameters:
input - The PTable to sample from
seed - The seed for the random number generator
probability - The probability (0.0 < p < 1.0)
Returns:
The output PTable created from sampling

reservoirSample

public static <T> PCollection<T> reservoirSample(PCollection<T> input,
                                                 int sampleSize)
Select a fixed number of elements from the given PCollection with each element equally likely to be included in the sample.

Parameters:
input - The input data
sampleSize - The number of elements to select
Returns:
A PCollection made up of the sampled elements

reservorSample

public static <T> PCollection<T> reservorSample(PCollection<T> input,
                                                int sampleSize,
                                                Long seed)
A version of the reservoir sampling algorithm that uses a given seed, primarily for testing purposes.

Parameters:
input - The input data
sampleSize - The number of elements to select
seed - The test seed
Returns:
A PCollection made up of the sampled elements

weightedReservoirSample

public static <T,N extends Number> PCollection<T> weightedReservoirSample(PCollection<Pair<T,N>> input,
                                                                          int sampleSize)
Selects a weighted sample of the elements of the given PCollection, where the second term in the input Pair is a numerical weight.

Parameters:
input - the weighted observations
sampleSize - The number of elements to select
Returns:
A random sample of the given size that respects the weighting values

weightedReservoirSample

public static <T,N extends Number> PCollection<T> weightedReservoirSample(PCollection<Pair<T,N>> input,
                                                                          int sampleSize,
                                                                          Long seed)
The weighted reservoir sampling function with the seed term exposed for testing purposes.

Parameters:
input - the weighted observations
sampleSize - The number of elements to select
seed - The test seed
Returns:
A random sample of the given size that respects the weighting values

groupedWeightedReservoirSample

public static <T,N extends Number> PCollection<Pair<Integer,T>> groupedWeightedReservoirSample(PTable<Integer,Pair<T,N>> input,
                                                                                               int[] sampleSizes)
The most general purpose of the weighted reservoir sampling patterns that allows us to choose a random sample of elements for each of N input groups.

Parameters:
input - A PTable with the key a group ID and the value a weighted observation in that group
sampleSizes - An array of length N, with each entry is the number of elements to include in that group
Returns:
A PCollection of the sampled elements for each of the groups

groupedWeightedReservoirSample

public static <T,N extends Number> PCollection<Pair<Integer,T>> groupedWeightedReservoirSample(PTable<Integer,Pair<T,N>> input,
                                                                                               int[] sampleSizes,
                                                                                               Long seed)
Same as the other groupedWeightedReservoirSample method, but include a seed for testing purposes.

Parameters:
input - A PTable with the key a group ID and the value a weighted observation in that group
sampleSizes - An array of length N, with each entry is the number of elements to include in that group
seed - The test seed
Returns:
A PCollection of the sampled elements for each of the groups


Copyright © 2013 The Apache Software Foundation. All Rights Reserved.