|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectorg.apache.crunch.lib.Sample
public class Sample
Methods for performing random sampling in a distributed fashion, either by accepting each
record in a PCollection with an independent probability in order to sample some
fraction of the overall data set, or by using reservoir sampling in order to pull a uniform
or weighted sample of fixed size from a PCollection of an unknown size. For more details
on the reservoir sampling algorithms used by this library, see the A-ES algorithm described in
Efraimidis (2012).
| Constructor Summary | |
|---|---|
Sample()
|
|
| Method Summary | ||
|---|---|---|
static
|
groupedWeightedReservoirSample(PTable<Integer,Pair<T,N>> input,
int[] sampleSizes)
The most general purpose of the weighted reservoir sampling patterns that allows us to choose a random sample of elements for each of N input groups. |
|
static
|
groupedWeightedReservoirSample(PTable<Integer,Pair<T,N>> input,
int[] sampleSizes,
Long seed)
Same as the other groupedWeightedReservoirSample method, but include a seed for testing purposes. |
|
static
|
reservoirSample(PCollection<T> input,
int sampleSize)
Select a fixed number of elements from the given PCollection with each element
equally likely to be included in the sample. |
|
static
|
reservorSample(PCollection<T> input,
int sampleSize,
Long seed)
A version of the reservoir sampling algorithm that uses a given seed, primarily for testing purposes. |
|
static
|
sample(PCollection<S> input,
double probability)
Output records from the given PCollection with the given probability. |
|
static
|
sample(PCollection<S> input,
Long seed,
double probability)
Output records from the given PCollection using a given seed. |
|
static
|
sample(PTable<K,V> input,
double probability)
A PTable<K, V> analogue of the sample function. |
|
static
|
sample(PTable<K,V> input,
Long seed,
double probability)
A PTable<K, V> analogue of the sample function, with the seed argument
exposed for testing purposes. |
|
static
|
weightedReservoirSample(PCollection<Pair<T,N>> input,
int sampleSize)
Selects a weighted sample of the elements of the given PCollection, where the second term in
the input Pair is a numerical weight. |
|
static
|
weightedReservoirSample(PCollection<Pair<T,N>> input,
int sampleSize,
Long seed)
The weighted reservoir sampling function with the seed term exposed for testing purposes. |
|
| Methods inherited from class java.lang.Object |
|---|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Constructor Detail |
|---|
public Sample()
| Method Detail |
|---|
public static <S> PCollection<S> sample(PCollection<S> input,
double probability)
PCollection with the given probability.
input - The PCollection to sample fromprobability - The probability (0.0 < p %lt; 1.0)
PCollection created from sampling
public static <S> PCollection<S> sample(PCollection<S> input,
Long seed,
double probability)
PCollection using a given seed. Useful for unit
testing.
input - The PCollection to sample fromseed - The seed for the random number generatorprobability - The probability (0.0 < p < 1.0)
PCollection created from sampling
public static <K,V> PTable<K,V> sample(PTable<K,V> input,
double probability)
PTable<K, V> analogue of the sample function.
input - The PTable to sample fromprobability - The probability (0.0 < p < 1.0)
PTable created from sampling
public static <K,V> PTable<K,V> sample(PTable<K,V> input,
Long seed,
double probability)
PTable<K, V> analogue of the sample function, with the seed argument
exposed for testing purposes.
input - The PTable to sample fromseed - The seed for the random number generatorprobability - The probability (0.0 < p < 1.0)
PTable created from sampling
public static <T> PCollection<T> reservoirSample(PCollection<T> input,
int sampleSize)
PCollection with each element
equally likely to be included in the sample.
input - The input datasampleSize - The number of elements to select
PCollection made up of the sampled elements
public static <T> PCollection<T> reservorSample(PCollection<T> input,
int sampleSize,
Long seed)
input - The input datasampleSize - The number of elements to selectseed - The test seed
PCollection made up of the sampled elements
public static <T,N extends Number> PCollection<T> weightedReservoirSample(PCollection<Pair<T,N>> input,
int sampleSize)
PCollection, where the second term in
the input Pair is a numerical weight.
input - the weighted observationssampleSize - The number of elements to select
public static <T,N extends Number> PCollection<T> weightedReservoirSample(PCollection<Pair<T,N>> input,
int sampleSize,
Long seed)
input - the weighted observationssampleSize - The number of elements to selectseed - The test seed
public static <T,N extends Number> PCollection<Pair<Integer,T>> groupedWeightedReservoirSample(PTable<Integer,Pair<T,N>> input,
int[] sampleSizes)
input - A PTable with the key a group ID and the value a weighted observation in that groupsampleSizes - An array of length N, with each entry is the number of elements to include in that group
PCollection of the sampled elements for each of the groups
public static <T,N extends Number> PCollection<Pair<Integer,T>> groupedWeightedReservoirSample(PTable<Integer,Pair<T,N>> input,
int[] sampleSizes,
Long seed)
input - A PTable with the key a group ID and the value a weighted observation in that groupsampleSizes - An array of length N, with each entry is the number of elements to include in that groupseed - The test seed
PCollection of the sampled elements for each of the groups
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||