public class Sample extends Object
PCollection
with an independent probability in order to sample some
fraction of the overall data set, or by using reservoir sampling in order to pull a uniform
or weighted sample of fixed size from a PCollection
of an unknown size. For more details
on the reservoir sampling algorithms used by this library, see the A-ES algorithm described in
Efraimidis (2012).Constructor and Description |
---|
Sample() |
Modifier and Type | Method and Description |
---|---|
static <T,N extends Number> |
groupedWeightedReservoirSample(PTable<Integer,Pair<T,N>> input,
int[] sampleSizes)
The most general purpose of the weighted reservoir sampling patterns that allows us to choose
a random sample of elements for each of N input groups.
|
static <T,N extends Number> |
groupedWeightedReservoirSample(PTable<Integer,Pair<T,N>> input,
int[] sampleSizes,
Long seed)
Same as the other groupedWeightedReservoirSample method, but include a seed for testing
purposes.
|
static <T> PCollection<T> |
reservoirSample(PCollection<T> input,
int sampleSize)
Select a fixed number of elements from the given
PCollection with each element
equally likely to be included in the sample. |
static <T> PCollection<T> |
reservoirSample(PCollection<T> input,
int sampleSize,
Long seed)
A version of the reservoir sampling algorithm that uses a given seed, primarily for
testing purposes.
|
static <S> PCollection<S> |
sample(PCollection<S> input,
double probability)
Output records from the given
PCollection with the given probability. |
static <S> PCollection<S> |
sample(PCollection<S> input,
Long seed,
double probability)
Output records from the given
PCollection using a given seed. |
static <K,V> PTable<K,V> |
sample(PTable<K,V> input,
double probability)
A
PTable<K, V> analogue of the sample function. |
static <K,V> PTable<K,V> |
sample(PTable<K,V> input,
Long seed,
double probability)
A
PTable<K, V> analogue of the sample function, with the seed argument
exposed for testing purposes. |
static <T,N extends Number> |
weightedReservoirSample(PCollection<Pair<T,N>> input,
int sampleSize)
Selects a weighted sample of the elements of the given
PCollection , where the second term in
the input Pair is a numerical weight. |
static <T,N extends Number> |
weightedReservoirSample(PCollection<Pair<T,N>> input,
int sampleSize,
Long seed)
The weighted reservoir sampling function with the seed term exposed for testing purposes.
|
public static <S> PCollection<S> sample(PCollection<S> input, double probability)
PCollection
with the given probability.input
- The PCollection
to sample fromprobability
- The probability (0.0 < p %lt; 1.0)PCollection
created from samplingpublic static <S> PCollection<S> sample(PCollection<S> input, Long seed, double probability)
PCollection
using a given seed. Useful for unit
testing.input
- The PCollection
to sample fromseed
- The seed for the random number generatorprobability
- The probability (0.0 < p < 1.0)PCollection
created from samplingpublic static <K,V> PTable<K,V> sample(PTable<K,V> input, double probability)
PTable<K, V>
analogue of the sample
function.input
- The PTable
to sample fromprobability
- The probability (0.0 < p < 1.0)PTable
created from samplingpublic static <K,V> PTable<K,V> sample(PTable<K,V> input, Long seed, double probability)
PTable<K, V>
analogue of the sample
function, with the seed argument
exposed for testing purposes.input
- The PTable
to sample fromseed
- The seed for the random number generatorprobability
- The probability (0.0 < p < 1.0)PTable
created from samplingpublic static <T> PCollection<T> reservoirSample(PCollection<T> input, int sampleSize)
PCollection
with each element
equally likely to be included in the sample.input
- The input datasampleSize
- The number of elements to selectPCollection
made up of the sampled elementspublic static <T> PCollection<T> reservoirSample(PCollection<T> input, int sampleSize, Long seed)
input
- The input datasampleSize
- The number of elements to selectseed
- The test seedPCollection
made up of the sampled elementspublic static <T,N extends Number> PCollection<T> weightedReservoirSample(PCollection<Pair<T,N>> input, int sampleSize)
PCollection
, where the second term in
the input Pair
is a numerical weight.input
- the weighted observationssampleSize
- The number of elements to selectpublic static <T,N extends Number> PCollection<T> weightedReservoirSample(PCollection<Pair<T,N>> input, int sampleSize, Long seed)
input
- the weighted observationssampleSize
- The number of elements to selectseed
- The test seedpublic static <T,N extends Number> PCollection<Pair<Integer,T>> groupedWeightedReservoirSample(PTable<Integer,Pair<T,N>> input, int[] sampleSizes)
input
- A PTable
with the key a group ID and the value a weighted observation in that groupsampleSizes
- An array of length N, with each entry is the number of elements to include in that groupPCollection
of the sampled elements for each of the groupspublic static <T,N extends Number> PCollection<Pair<Integer,T>> groupedWeightedReservoirSample(PTable<Integer,Pair<T,N>> input, int[] sampleSizes, Long seed)
input
- A PTable
with the key a group ID and the value a weighted observation in that groupsampleSizes
- An array of length N, with each entry is the number of elements to include in that groupseed
- The test seedPCollection
of the sampled elements for each of the groupsCopyright © 2016 The Apache Software Foundation. All rights reserved.