public class Sample extends Object
PCollection with an independent probability in order to sample some
 fraction of the overall data set, or by using reservoir sampling in order to pull a uniform
 or weighted sample of fixed size from a PCollection of an unknown size. For more details
 on the reservoir sampling algorithms used by this library, see the A-ES algorithm described in
 Efraimidis (2012).| Constructor and Description | 
|---|
| Sample() | 
| Modifier and Type | Method and Description | 
|---|---|
| static <T,N extends Number> | groupedWeightedReservoirSample(PTable<Integer,Pair<T,N>> input,
                              int[] sampleSizes)The most general purpose of the weighted reservoir sampling patterns that allows us to choose
 a random sample of elements for each of N input groups. | 
| static <T,N extends Number> | groupedWeightedReservoirSample(PTable<Integer,Pair<T,N>> input,
                              int[] sampleSizes,
                              Long seed)Same as the other groupedWeightedReservoirSample method, but include a seed for testing
 purposes. | 
| static <T> PCollection<T> | reservoirSample(PCollection<T> input,
               int sampleSize)Select a fixed number of elements from the given  PCollectionwith each element
 equally likely to be included in the sample. | 
| static <T> PCollection<T> | reservoirSample(PCollection<T> input,
               int sampleSize,
               Long seed)A version of the reservoir sampling algorithm that uses a given seed, primarily for
 testing purposes. | 
| static <S> PCollection<S> | sample(PCollection<S> input,
      double probability)Output records from the given  PCollectionwith the given probability. | 
| static <S> PCollection<S> | sample(PCollection<S> input,
      Long seed,
      double probability)Output records from the given  PCollectionusing a given seed. | 
| static <K,V> PTable<K,V> | sample(PTable<K,V> input,
      double probability)A  PTable<K, V>analogue of thesamplefunction. | 
| static <K,V> PTable<K,V> | sample(PTable<K,V> input,
      Long seed,
      double probability)A  PTable<K, V>analogue of thesamplefunction, with the seed argument
 exposed for testing purposes. | 
| static <T,N extends Number> | weightedReservoirSample(PCollection<Pair<T,N>> input,
                       int sampleSize)Selects a weighted sample of the elements of the given  PCollection, where the second term in
 the inputPairis a numerical weight. | 
| static <T,N extends Number> | weightedReservoirSample(PCollection<Pair<T,N>> input,
                       int sampleSize,
                       Long seed)The weighted reservoir sampling function with the seed term exposed for testing purposes. | 
public static <S> PCollection<S> sample(PCollection<S> input, double probability)
PCollection with the given probability.input - The PCollection to sample fromprobability - The probability (0.0 < p %lt; 1.0)PCollection created from samplingpublic static <S> PCollection<S> sample(PCollection<S> input, Long seed, double probability)
PCollection using a given seed. Useful for unit
 testing.input - The PCollection to sample fromseed - The seed for the random number generatorprobability - The probability (0.0 < p < 1.0)PCollection created from samplingpublic static <K,V> PTable<K,V> sample(PTable<K,V> input, double probability)
PTable<K, V> analogue of the sample function.input - The PTable to sample fromprobability - The probability (0.0 < p < 1.0)PTable created from samplingpublic static <K,V> PTable<K,V> sample(PTable<K,V> input, Long seed, double probability)
PTable<K, V> analogue of the sample function, with the seed argument
 exposed for testing purposes.input - The PTable to sample fromseed - The seed for the random number generatorprobability - The probability (0.0 < p < 1.0)PTable created from samplingpublic static <T> PCollection<T> reservoirSample(PCollection<T> input, int sampleSize)
PCollection with each element
 equally likely to be included in the sample.input - The input datasampleSize - The number of elements to selectPCollection made up of the sampled elementspublic static <T> PCollection<T> reservoirSample(PCollection<T> input, int sampleSize, Long seed)
input - The input datasampleSize - The number of elements to selectseed - The test seedPCollection made up of the sampled elementspublic static <T,N extends Number> PCollection<T> weightedReservoirSample(PCollection<Pair<T,N>> input, int sampleSize)
PCollection, where the second term in
 the input Pair is a numerical weight.input - the weighted observationssampleSize - The number of elements to selectpublic static <T,N extends Number> PCollection<T> weightedReservoirSample(PCollection<Pair<T,N>> input, int sampleSize, Long seed)
input - the weighted observationssampleSize - The number of elements to selectseed - The test seedpublic static <T,N extends Number> PCollection<Pair<Integer,T>> groupedWeightedReservoirSample(PTable<Integer,Pair<T,N>> input, int[] sampleSizes)
input - A PTable with the key a group ID and the value a weighted observation in that groupsampleSizes - An array of length N, with each entry is the number of elements to include in that groupPCollection of the sampled elements for each of the groupspublic static <T,N extends Number> PCollection<Pair<Integer,T>> groupedWeightedReservoirSample(PTable<Integer,Pair<T,N>> input, int[] sampleSizes, Long seed)
input - A PTable with the key a group ID and the value a weighted observation in that groupsampleSizes - An array of length N, with each entry is the number of elements to include in that groupseed - The test seedPCollection of the sampled elements for each of the groupsCopyright © 2017 The Apache Software Foundation. All rights reserved.