|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.apache.crunch.lib.Sample
public class Sample
Methods for performing random sampling in a distributed fashion, either by accepting each
record in a PCollection
with an independent probability in order to sample some
fraction of the overall data set, or by using reservoir sampling in order to pull a uniform
or weighted sample of fixed size from a PCollection
of an unknown size. For more details
on the reservoir sampling algorithms used by this library, see the A-ES algorithm described in
Efraimidis (2012).
Constructor Summary | |
---|---|
Sample()
|
Method Summary | ||
---|---|---|
static
|
groupedWeightedReservoirSample(PTable<Integer,Pair<T,N>> input,
int[] sampleSizes)
The most general purpose of the weighted reservoir sampling patterns that allows us to choose a random sample of elements for each of N input groups. |
|
static
|
groupedWeightedReservoirSample(PTable<Integer,Pair<T,N>> input,
int[] sampleSizes,
Long seed)
Same as the other groupedWeightedReservoirSample method, but include a seed for testing purposes. |
|
static
|
reservoirSample(PCollection<T> input,
int sampleSize)
Select a fixed number of elements from the given PCollection with each element
equally likely to be included in the sample. |
|
static
|
reservoirSample(PCollection<T> input,
int sampleSize,
Long seed)
A version of the reservoir sampling algorithm that uses a given seed, primarily for testing purposes. |
|
static
|
sample(PCollection<S> input,
double probability)
Output records from the given PCollection with the given probability. |
|
static
|
sample(PCollection<S> input,
Long seed,
double probability)
Output records from the given PCollection using a given seed. |
|
static
|
sample(PTable<K,V> input,
double probability)
A PTable<K, V> analogue of the sample function. |
|
static
|
sample(PTable<K,V> input,
Long seed,
double probability)
A PTable<K, V> analogue of the sample function, with the seed argument
exposed for testing purposes. |
|
static
|
weightedReservoirSample(PCollection<Pair<T,N>> input,
int sampleSize)
Selects a weighted sample of the elements of the given PCollection , where the second term in
the input Pair is a numerical weight. |
|
static
|
weightedReservoirSample(PCollection<Pair<T,N>> input,
int sampleSize,
Long seed)
The weighted reservoir sampling function with the seed term exposed for testing purposes. |
Methods inherited from class java.lang.Object |
---|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public Sample()
Method Detail |
---|
public static <S> PCollection<S> sample(PCollection<S> input, double probability)
PCollection
with the given probability.
input
- The PCollection
to sample fromprobability
- The probability (0.0 < p %lt; 1.0)
PCollection
created from samplingpublic static <S> PCollection<S> sample(PCollection<S> input, Long seed, double probability)
PCollection
using a given seed. Useful for unit
testing.
input
- The PCollection
to sample fromseed
- The seed for the random number generatorprobability
- The probability (0.0 < p < 1.0)
PCollection
created from samplingpublic static <K,V> PTable<K,V> sample(PTable<K,V> input, double probability)
PTable<K, V>
analogue of the sample
function.
input
- The PTable
to sample fromprobability
- The probability (0.0 < p < 1.0)
PTable
created from samplingpublic static <K,V> PTable<K,V> sample(PTable<K,V> input, Long seed, double probability)
PTable<K, V>
analogue of the sample
function, with the seed argument
exposed for testing purposes.
input
- The PTable
to sample fromseed
- The seed for the random number generatorprobability
- The probability (0.0 < p < 1.0)
PTable
created from samplingpublic static <T> PCollection<T> reservoirSample(PCollection<T> input, int sampleSize)
PCollection
with each element
equally likely to be included in the sample.
input
- The input datasampleSize
- The number of elements to select
PCollection
made up of the sampled elementspublic static <T> PCollection<T> reservoirSample(PCollection<T> input, int sampleSize, Long seed)
input
- The input datasampleSize
- The number of elements to selectseed
- The test seed
PCollection
made up of the sampled elementspublic static <T,N extends Number> PCollection<T> weightedReservoirSample(PCollection<Pair<T,N>> input, int sampleSize)
PCollection
, where the second term in
the input Pair
is a numerical weight.
input
- the weighted observationssampleSize
- The number of elements to select
public static <T,N extends Number> PCollection<T> weightedReservoirSample(PCollection<Pair<T,N>> input, int sampleSize, Long seed)
input
- the weighted observationssampleSize
- The number of elements to selectseed
- The test seed
public static <T,N extends Number> PCollection<Pair<Integer,T>> groupedWeightedReservoirSample(PTable<Integer,Pair<T,N>> input, int[] sampleSizes)
input
- A PTable
with the key a group ID and the value a weighted observation in that groupsampleSizes
- An array of length N, with each entry is the number of elements to include in that group
PCollection
of the sampled elements for each of the groupspublic static <T,N extends Number> PCollection<Pair<Integer,T>> groupedWeightedReservoirSample(PTable<Integer,Pair<T,N>> input, int[] sampleSizes, Long seed)
input
- A PTable
with the key a group ID and the value a weighted observation in that groupsampleSizes
- An array of length N, with each entry is the number of elements to include in that groupseed
- The test seed
PCollection
of the sampled elements for each of the groups
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |