| Package | Description |
|---|---|
| org.apache.crunch |
Client-facing API and core abstractions.
|
| org.apache.crunch.contrib.bloomfilter |
Support for creating Bloom Filters.
|
| org.apache.crunch.contrib.text | |
| org.apache.crunch.examples |
Example applications demonstrating various aspects of Crunch.
|
| org.apache.crunch.impl.dist | |
| org.apache.crunch.impl.dist.collect | |
| org.apache.crunch.impl.mem |
In-memory Pipeline implementation for rapid prototyping and testing.
|
| org.apache.crunch.impl.mr |
A Pipeline implementation that runs on Hadoop MapReduce.
|
| org.apache.crunch.impl.spark | |
| org.apache.crunch.impl.spark.collect | |
| org.apache.crunch.lib |
Joining, sorting, aggregating, and other commonly used functionality.
|
| org.apache.crunch.lib.join |
Inner and outer joins on collections.
|
| org.apache.crunch.util |
An assorted set of utilities.
|
| Modifier and Type | Interface and Description |
|---|---|
interface |
PGroupedTable<K,V>
The Crunch representation of a grouped
PTable, which corresponds to the output of
the shuffle phase of a MapReduce job. |
interface |
PTable<K,V>
A sub-interface of
PCollection that represents an immutable,
distributed multi-map of keys and values. |
| Modifier and Type | Method and Description |
|---|---|
PCollection<S> |
PCollection.aggregate(Aggregator<S> aggregator)
Returns a
PCollection that contains the result of aggregating all values in this instance. |
PCollection<S> |
PCollection.cache()
Marks this data as cached using the default
CachingOptions. |
PCollection<S> |
PCollection.cache(CachingOptions options)
Marks this data as cached using the given
CachingOptions. |
<T> PCollection<T> |
Pipeline.create(Iterable<T> contents,
PType<T> ptype)
Creates a
PCollection containing the values found in the given Iterable
using an implementation-specific distribution mechanism. |
<T> PCollection<T> |
Pipeline.create(Iterable<T> contents,
PType<T> ptype,
CreateOptions options)
Creates a
PCollection containing the values found in the given Iterable
using an implementation-specific distribution mechanism. |
<T> PCollection<T> |
Pipeline.emptyPCollection(PType<T> ptype)
Creates an empty
PCollection of the given PType. |
PCollection<S> |
PCollection.filter(FilterFn<S> filterFn)
Apply the given filter function to this instance and return the resulting
PCollection. |
PCollection<S> |
PCollection.filter(String name,
FilterFn<S> filterFn)
Apply the given filter function to this instance and return the resulting
PCollection. |
PCollection<K> |
PTable.keys()
Returns a
PCollection made up of the keys in this PTable. |
<T> PCollection<T> |
PCollection.parallelDo(DoFn<S,T> doFn,
PType<T> type)
Applies the given doFn to the elements of this
PCollection and
returns a new PCollection that is the output of this processing. |
<T> PCollection<T> |
PCollection.parallelDo(String name,
DoFn<S,T> doFn,
PType<T> type)
Applies the given doFn to the elements of this
PCollection and
returns a new PCollection that is the output of this processing. |
<T> PCollection<T> |
PCollection.parallelDo(String name,
DoFn<S,T> doFn,
PType<T> type,
ParallelDoOptions options)
Applies the given doFn to the elements of this
PCollection and
returns a new PCollection that is the output of this processing. |
<T> PCollection<T> |
Pipeline.read(Source<T> source)
Converts the given
Source into a PCollection that is
available to jobs run using this Pipeline instance. |
<T> PCollection<T> |
Pipeline.read(Source<T> source,
String named)
Converts the given
Source into a PCollection that is
available to jobs run using this Pipeline instance. |
PCollection<String> |
Pipeline.readTextFile(String pathName)
A convenience method for reading a text file.
|
<S> PCollection<S> |
Pipeline.union(List<PCollection<S>> collections) |
PCollection<S> |
PCollection.union(PCollection<S>... collections)
Returns a
PCollection instance that acts as the union of this
PCollection and the input PCollections. |
PCollection<S> |
PCollection.union(PCollection<S> other)
Returns a
PCollection instance that acts as the union of this
PCollection and the given PCollection. |
PCollection<V> |
PTable.values()
Returns a
PCollection made up of the values in this PTable. |
PCollection<S> |
PCollection.write(Target target)
Write the contents of this
PCollection to the given Target,
using the storage format specified by the target. |
PCollection<S> |
PCollection.write(Target target,
Target.WriteMode writeMode)
Write the contents of this
PCollection to the given Target,
using the given Target.WriteMode to handle existing
targets. |
| Modifier and Type | Method and Description |
|---|---|
Map<String,PCollection<?>> |
PipelineCallable.getAllPCollections()
Returns the mapping of labels to PCollection dependencies for this instance.
|
| Modifier and Type | Method and Description |
|---|---|
<T> void |
Pipeline.cache(PCollection<T> pcollection,
CachingOptions options)
Caches the given PCollection so that it will be processed at most once
during pipeline execution.
|
PipelineCallable<Output> |
PipelineCallable.dependsOn(String label,
PCollection<?> pcollect)
Requires that the given
PCollection be materialized to disk before this instance may be
executed. |
<T> Iterable<T> |
Pipeline.materialize(PCollection<T> pcollection)
Create the given PCollection and read the data it contains into the
returned Collection instance for client use.
|
PCollection<S> |
PCollection.union(PCollection<S>... collections)
Returns a
PCollection instance that acts as the union of this
PCollection and the input PCollections. |
PCollection<S> |
PCollection.union(PCollection<S> other)
Returns a
PCollection instance that acts as the union of this
PCollection and the given PCollection. |
void |
Pipeline.write(PCollection<?> collection,
Target target)
Write the given collection to the given target on the next pipeline run.
|
void |
Pipeline.write(PCollection<?> collection,
Target target,
Target.WriteMode writeMode)
Write the contents of the
PCollection to the given Target,
using the storage format specified by the target and the given
WriteMode for cases where the referenced Target
already exists. |
<T> void |
Pipeline.writeTextFile(PCollection<T> collection,
String pathName)
A convenience method for writing a text file.
|
| Modifier and Type | Method and Description |
|---|---|
<S> PCollection<S> |
Pipeline.union(List<PCollection<S>> collections) |
| Modifier and Type | Method and Description |
|---|---|
static <T> PObject<org.apache.hadoop.util.bloom.BloomFilter> |
BloomFilterFactory.createFilter(PCollection<T> collection,
BloomFilterFn<T> filterFn) |
| Modifier and Type | Method and Description |
|---|---|
static <T> PCollection<T> |
Parse.parse(String groupName,
PCollection<String> input,
Extractor<T> extractor)
Parses the lines of the input
PCollection<String> and returns a PCollection<T> using
the given Extractor<T>. |
static <T> PCollection<T> |
Parse.parse(String groupName,
PCollection<String> input,
PTypeFamily ptf,
Extractor<T> extractor)
Parses the lines of the input
PCollection<String> and returns a PCollection<T> using
the given Extractor<T> that uses the given PTypeFamily. |
| Modifier and Type | Method and Description |
|---|---|
static <T> PCollection<T> |
Parse.parse(String groupName,
PCollection<String> input,
Extractor<T> extractor)
Parses the lines of the input
PCollection<String> and returns a PCollection<T> using
the given Extractor<T>. |
static <T> PCollection<T> |
Parse.parse(String groupName,
PCollection<String> input,
PTypeFamily ptf,
Extractor<T> extractor)
Parses the lines of the input
PCollection<String> and returns a PCollection<T> using
the given Extractor<T> that uses the given PTypeFamily. |
static <K,V> PTable<K,V> |
Parse.parseTable(String groupName,
PCollection<String> input,
Extractor<Pair<K,V>> extractor)
Parses the lines of the input
PCollection<String> and returns a PTable<K, V> using
the given Extractor<Pair<K, V>>. |
static <K,V> PTable<K,V> |
Parse.parseTable(String groupName,
PCollection<String> input,
PTypeFamily ptf,
Extractor<Pair<K,V>> extractor)
Parses the lines of the input
PCollection<String> and returns a PTable<K, V> using
the given Extractor<Pair<K, V>> that uses the given PTypeFamily. |
| Modifier and Type | Method and Description |
|---|---|
PCollection<org.apache.hadoop.hbase.client.Put> |
WordAggregationHBase.createPut(PTable<String,String> extractedText)
Create puts in order to insert them in hbase.
|
| Modifier and Type | Method and Description |
|---|---|
<S> PCollection<S> |
DistributedPipeline.create(Iterable<S> contents,
PType<S> ptype) |
<S> PCollection<S> |
DistributedPipeline.create(Iterable<S> contents,
PType<S> ptype,
CreateOptions options) |
<S> PCollection<S> |
DistributedPipeline.emptyPCollection(PType<S> ptype) |
<S> PCollection<S> |
DistributedPipeline.read(Source<S> source) |
<S> PCollection<S> |
DistributedPipeline.read(Source<S> source,
String named) |
PCollection<String> |
DistributedPipeline.readTextFile(String pathName) |
<S> PCollection<S> |
DistributedPipeline.union(List<PCollection<S>> collections) |
| Modifier and Type | Method and Description |
|---|---|
<T> ReadableSource<T> |
DistributedPipeline.getMaterializeSourceTarget(PCollection<T> pcollection)
Retrieve a ReadableSourceTarget that provides access to the contents of a
PCollection. |
void |
DistributedPipeline.write(PCollection<?> pcollection,
Target target) |
void |
DistributedPipeline.write(PCollection<?> pcollection,
Target target,
Target.WriteMode writeMode) |
<T> void |
DistributedPipeline.writeTextFile(PCollection<T> pcollection,
String pathName) |
| Modifier and Type | Method and Description |
|---|---|
<S> PCollection<S> |
DistributedPipeline.union(List<PCollection<S>> collections) |
| Modifier and Type | Class and Description |
|---|---|
class |
BaseDoCollection<S> |
class |
BaseDoTable<K,V> |
class |
BaseGroupedTable<K,V> |
class |
BaseInputCollection<S> |
class |
BaseInputTable<K,V> |
class |
BaseUnionCollection<S> |
class |
BaseUnionTable<K,V> |
class |
EmptyPCollection<T> |
class |
EmptyPTable<K,V> |
class |
PCollectionImpl<S> |
class |
PTableBase<K,V> |
| Modifier and Type | Method and Description |
|---|---|
PCollection<S> |
PCollectionImpl.aggregate(Aggregator<S> aggregator) |
PCollection<S> |
PCollectionImpl.cache() |
PCollection<S> |
PCollectionImpl.cache(CachingOptions options) |
PCollection<S> |
PCollectionImpl.filter(FilterFn<S> filterFn) |
PCollection<S> |
PCollectionImpl.filter(String name,
FilterFn<S> filterFn) |
PCollection<K> |
PTableBase.keys() |
<T> PCollection<T> |
PCollectionImpl.parallelDo(DoFn<S,T> fn,
PType<T> type) |
<T> PCollection<T> |
PCollectionImpl.parallelDo(String name,
DoFn<S,T> fn,
PType<T> type) |
<T> PCollection<T> |
PCollectionImpl.parallelDo(String name,
DoFn<S,T> fn,
PType<T> type,
ParallelDoOptions options) |
PCollection<S> |
PCollectionImpl.union(PCollection<S>... collections) |
PCollection<S> |
PCollectionImpl.union(PCollection<S> other) |
PCollection<V> |
PTableBase.values() |
PCollection<S> |
PCollectionImpl.write(Target target) |
PCollection<S> |
PCollectionImpl.write(Target target,
Target.WriteMode writeMode) |
| Modifier and Type | Method and Description |
|---|---|
PCollection<S> |
PCollectionImpl.union(PCollection<S>... collections) |
PCollection<S> |
PCollectionImpl.union(PCollection<S> other) |
| Modifier and Type | Method and Description |
|---|---|
static <T> PCollection<T> |
MemPipeline.collectionOf(Iterable<T> collect) |
static <T> PCollection<T> |
MemPipeline.collectionOf(T... ts) |
<T> PCollection<T> |
MemPipeline.create(Iterable<T> contents,
PType<T> ptype) |
<T> PCollection<T> |
MemPipeline.create(Iterable<T> iterable,
PType<T> ptype,
CreateOptions options) |
<T> PCollection<T> |
MemPipeline.emptyPCollection(PType<T> ptype) |
<T> PCollection<T> |
MemPipeline.read(Source<T> source) |
<T> PCollection<T> |
MemPipeline.read(Source<T> source,
String named) |
PCollection<String> |
MemPipeline.readTextFile(String pathName) |
static <T> PCollection<T> |
MemPipeline.typedCollectionOf(PType<T> ptype,
Iterable<T> collect) |
static <T> PCollection<T> |
MemPipeline.typedCollectionOf(PType<T> ptype,
T... ts) |
<S> PCollection<S> |
MemPipeline.union(List<PCollection<S>> collections) |
| Modifier and Type | Method and Description |
|---|---|
<T> void |
MemPipeline.cache(PCollection<T> pcollection,
CachingOptions options) |
<T> Iterable<T> |
MemPipeline.materialize(PCollection<T> pcollection) |
void |
MemPipeline.write(PCollection<?> collection,
Target target) |
void |
MemPipeline.write(PCollection<?> collection,
Target target,
Target.WriteMode writeMode) |
<T> void |
MemPipeline.writeTextFile(PCollection<T> collection,
String pathName) |
| Modifier and Type | Method and Description |
|---|---|
<S> PCollection<S> |
MemPipeline.union(List<PCollection<S>> collections) |
| Modifier and Type | Method and Description |
|---|---|
<T> void |
MRPipeline.cache(PCollection<T> pcollection,
CachingOptions options) |
<T> Iterable<T> |
MRPipeline.materialize(PCollection<T> pcollection) |
| Modifier and Type | Method and Description |
|---|---|
<S> PCollection<S> |
SparkPipeline.create(Iterable<S> contents,
PType<S> ptype,
CreateOptions options) |
<S> PCollection<S> |
SparkPipeline.emptyPCollection(PType<S> ptype) |
| Modifier and Type | Method and Description |
|---|---|
<T> void |
SparkPipeline.cache(PCollection<T> pcollection,
CachingOptions options) |
org.apache.spark.storage.StorageLevel |
SparkRuntime.getStorageLevel(PCollection<?> pcollection) |
<T> Iterable<T> |
SparkPipeline.materialize(PCollection<T> pcollection) |
| Constructor and Description |
|---|
SparkRuntime(SparkPipeline pipeline,
org.apache.spark.api.java.JavaSparkContext sparkContext,
org.apache.hadoop.conf.Configuration conf,
Map<PCollectionImpl<?>,Set<Target>> outputTargets,
Map<PCollectionImpl<?>,org.apache.crunch.materialize.MaterializableIterable> toMaterialize,
Map<PCollection<?>,org.apache.spark.storage.StorageLevel> toCache,
Map<PipelineCallable<?>,Set<Target>> allPipelineCallables) |
| Modifier and Type | Class and Description |
|---|---|
class |
CreatedCollection<T>
Represents a Spark-based PCollection that was created from a Java
Iterable of
values. |
class |
CreatedTable<K,V>
Represents a Spark-based PTable that was created from a Java
Iterable of
key-value pairs. |
class |
DoCollection<S> |
class |
DoTable<K,V> |
class |
InputCollection<S> |
class |
InputTable<K,V> |
class |
PGroupedTableImpl<K,V> |
class |
UnionCollection<S> |
class |
UnionTable<K,V> |
| Modifier and Type | Method and Description |
|---|---|
static <S> PCollection<S> |
Aggregate.aggregate(PCollection<S> collect,
Aggregator<S> aggregator) |
static <T> PCollection<Tuple3<T,T,T>> |
Set.comm(PCollection<T> coll1,
PCollection<T> coll2)
Find the elements that are common to two sets, like the Unix
comm utility. |
static <U,V> PCollection<Pair<U,V>> |
Cartesian.cross(PCollection<U> left,
PCollection<V> right)
Performs a full cross join on the specified
PCollections (using the
same strategy as Pig's CROSS operator). |
static <U,V> PCollection<Pair<U,V>> |
Cartesian.cross(PCollection<U> left,
PCollection<V> right,
int parallelism)
Performs a full cross join on the specified
PCollections (using the
same strategy as Pig's CROSS operator). |
static <T> PCollection<T> |
Set.difference(PCollection<T> coll1,
PCollection<T> coll2)
Compute the set difference between two sets of elements.
|
static <S> PCollection<S> |
Distinct.distinct(PCollection<S> input)
Construct a new
PCollection that contains the unique elements of a
given input PCollection. |
static <S> PCollection<S> |
Distinct.distinct(PCollection<S> input,
int flushEvery)
A
distinct operation that gives the client more control over how frequently
elements are flushed to disk in order to allow control over performance or
memory consumption. |
static <T,N extends Number> |
Sample.groupedWeightedReservoirSample(PTable<Integer,Pair<T,N>> input,
int[] sampleSizes)
The most general purpose of the weighted reservoir sampling patterns that allows us to choose
a random sample of elements for each of N input groups.
|
static <T,N extends Number> |
Sample.groupedWeightedReservoirSample(PTable<Integer,Pair<T,N>> input,
int[] sampleSizes,
Long seed)
Same as the other groupedWeightedReservoirSample method, but include a seed for testing
purposes.
|
static <T> PCollection<T> |
Set.intersection(PCollection<T> coll1,
PCollection<T> coll2)
Compute the intersection of two sets of elements.
|
static <K,V> PCollection<K> |
PTables.keys(PTable<K,V> ptable)
Extract the keys from the given
PTable<K, V> as a PCollection<K>. |
static <T> PCollection<T> |
Sample.reservoirSample(PCollection<T> input,
int sampleSize)
Select a fixed number of elements from the given
PCollection with each element
equally likely to be included in the sample. |
static <T> PCollection<T> |
Sample.reservoirSample(PCollection<T> input,
int sampleSize,
Long seed)
A version of the reservoir sampling algorithm that uses a given seed, primarily for
testing purposes.
|
static <S> PCollection<S> |
Sample.sample(PCollection<S> input,
double probability)
Output records from the given
PCollection with the given probability. |
static <S> PCollection<S> |
Sample.sample(PCollection<S> input,
Long seed,
double probability)
Output records from the given
PCollection using a given seed. |
static <T> PCollection<T> |
Shard.shard(PCollection<T> pc,
int numPartitions)
Creates a
PCollection<T> that has the same contents as its input argument but will
be written to a fixed number of output files. |
static <T> PCollection<T> |
Sort.sort(PCollection<T> collection)
Sorts the
PCollection using the natural ordering of its elements in ascending order. |
static <T> PCollection<T> |
Sort.sort(PCollection<T> collection,
int numReducers,
Sort.Order order)
Sorts the
PCollection using the natural ordering of its elements in
the order specified using the given number of reducers. |
static <T> PCollection<T> |
Sort.sort(PCollection<T> collection,
Sort.Order order)
Sorts the
PCollection using the natural order of its elements with the given Order. |
static <K,V1,V2,T> |
SecondarySort.sortAndApply(PTable<K,Pair<V1,V2>> input,
DoFn<Pair<K,Iterable<Pair<V1,V2>>>,T> doFn,
PType<T> ptype)
Perform a secondary sort on the given
PTable instance and then apply a
DoFn to the resulting sorted data to yield an output PCollection<T>. |
static <K,V1,V2,T> |
SecondarySort.sortAndApply(PTable<K,Pair<V1,V2>> input,
DoFn<Pair<K,Iterable<Pair<V1,V2>>>,T> doFn,
PType<T> ptype,
int numReducers)
Perform a secondary sort on the given
PTable instance and then apply a
DoFn to the resulting sorted data to yield an output PCollection<T>, using
the given number of reducers. |
static <U,V> PCollection<Pair<U,V>> |
Sort.sortPairs(PCollection<Pair<U,V>> collection,
Sort.ColumnOrder... columnOrders)
Sorts the
PCollection of Pairs using the specified column
ordering. |
static <V1,V2,V3,V4> |
Sort.sortQuads(PCollection<Tuple4<V1,V2,V3,V4>> collection,
Sort.ColumnOrder... columnOrders)
Sorts the
PCollection of Tuple4s using the specified column
ordering. |
static <V1,V2,V3> PCollection<Tuple3<V1,V2,V3>> |
Sort.sortTriples(PCollection<Tuple3<V1,V2,V3>> collection,
Sort.ColumnOrder... columnOrders)
Sorts the
PCollection of Tuple3s using the specified column
ordering. |
static <T extends Tuple> |
Sort.sortTuples(PCollection<T> collection,
int numReducers,
Sort.ColumnOrder... columnOrders)
Sorts the
PCollection of TupleNs using the specified column
ordering and a client-specified number of reducers. |
static <T extends Tuple> |
Sort.sortTuples(PCollection<T> collection,
Sort.ColumnOrder... columnOrders)
Sorts the
PCollection of tuples using the specified column ordering. |
static <K,V> PCollection<V> |
PTables.values(PTable<K,V> ptable)
Extract the values from the given
PTable<K, V> as a PCollection<V>. |
static <T,N extends Number> |
Sample.weightedReservoirSample(PCollection<Pair<T,N>> input,
int sampleSize)
Selects a weighted sample of the elements of the given
PCollection, where the second term in
the input Pair is a numerical weight. |
static <T,N extends Number> |
Sample.weightedReservoirSample(PCollection<Pair<T,N>> input,
int sampleSize,
Long seed)
The weighted reservoir sampling function with the seed term exposed for testing purposes.
|
| Modifier and Type | Method and Description |
|---|---|
static <T,U> Pair<PCollection<T>,PCollection<U>> |
Channels.split(PCollection<Pair<T,U>> pCollection)
Splits a
PCollection of any Pair of objects into a Pair of
PCollection}, to allow for the output of a DoFn to be handled using
separate channels. |
static <T,U> Pair<PCollection<T>,PCollection<U>> |
Channels.split(PCollection<Pair<T,U>> pCollection)
Splits a
PCollection of any Pair of objects into a Pair of
PCollection}, to allow for the output of a DoFn to be handled using
separate channels. |
static <T,U> Pair<PCollection<T>,PCollection<U>> |
Channels.split(PCollection<Pair<T,U>> pCollection,
PType<T> firstPType,
PType<U> secondPType)
Splits a
PCollection of any Pair of objects into a Pair of
PCollection}, to allow for the output of a DoFn to be handled using
separate channels. |
static <T,U> Pair<PCollection<T>,PCollection<U>> |
Channels.split(PCollection<Pair<T,U>> pCollection,
PType<T> firstPType,
PType<U> secondPType)
Splits a
PCollection of any Pair of objects into a Pair of
PCollection}, to allow for the output of a DoFn to be handled using
separate channels. |
| Modifier and Type | Method and Description |
|---|---|
static <S> PCollection<S> |
Aggregate.aggregate(PCollection<S> collect,
Aggregator<S> aggregator) |
static <K,V> PTable<K,V> |
PTables.asPTable(PCollection<Pair<K,V>> pcollect)
Convert the given
PCollection<Pair<K, V>> to a PTable<K, V>. |
static <T> PCollection<Tuple3<T,T,T>> |
Set.comm(PCollection<T> coll1,
PCollection<T> coll2)
Find the elements that are common to two sets, like the Unix
comm utility. |
static <T> PCollection<Tuple3<T,T,T>> |
Set.comm(PCollection<T> coll1,
PCollection<T> coll2)
Find the elements that are common to two sets, like the Unix
comm utility. |
static <S> PTable<S,Long> |
Aggregate.count(PCollection<S> collect)
Returns a
PTable that contains the unique elements of this collection mapped to a count
of their occurrences. |
static <S> PTable<S,Long> |
Aggregate.count(PCollection<S> collect,
int numPartitions)
Returns a
PTable that contains the unique elements of this collection mapped to a count
of their occurrences. |
static <U,V> PCollection<Pair<U,V>> |
Cartesian.cross(PCollection<U> left,
PCollection<V> right)
Performs a full cross join on the specified
PCollections (using the
same strategy as Pig's CROSS operator). |
static <U,V> PCollection<Pair<U,V>> |
Cartesian.cross(PCollection<U> left,
PCollection<V> right)
Performs a full cross join on the specified
PCollections (using the
same strategy as Pig's CROSS operator). |
static <U,V> PCollection<Pair<U,V>> |
Cartesian.cross(PCollection<U> left,
PCollection<V> right,
int parallelism)
Performs a full cross join on the specified
PCollections (using the
same strategy as Pig's CROSS operator). |
static <U,V> PCollection<Pair<U,V>> |
Cartesian.cross(PCollection<U> left,
PCollection<V> right,
int parallelism)
Performs a full cross join on the specified
PCollections (using the
same strategy as Pig's CROSS operator). |
static <T> PCollection<T> |
Set.difference(PCollection<T> coll1,
PCollection<T> coll2)
Compute the set difference between two sets of elements.
|
static <T> PCollection<T> |
Set.difference(PCollection<T> coll1,
PCollection<T> coll2)
Compute the set difference between two sets of elements.
|
static <S> PCollection<S> |
Distinct.distinct(PCollection<S> input)
Construct a new
PCollection that contains the unique elements of a
given input PCollection. |
static <S> PCollection<S> |
Distinct.distinct(PCollection<S> input,
int flushEvery)
A
distinct operation that gives the client more control over how frequently
elements are flushed to disk in order to allow control over performance or
memory consumption. |
static <X> PTable<X,Long> |
TopList.globalToplist(PCollection<X> input)
Create a list of unique items in the input collection with their count, sorted descending by their frequency.
|
static <T> PCollection<T> |
Set.intersection(PCollection<T> coll1,
PCollection<T> coll2)
Compute the intersection of two sets of elements.
|
static <T> PCollection<T> |
Set.intersection(PCollection<T> coll1,
PCollection<T> coll2)
Compute the intersection of two sets of elements.
|
static <S> PObject<Long> |
Aggregate.length(PCollection<S> collect)
Returns the number of elements in the provided PCollection.
|
static <S> PObject<S> |
Aggregate.max(PCollection<S> collect)
Returns the largest numerical element from the input collection.
|
static <S> PObject<S> |
Aggregate.min(PCollection<S> collect)
Returns the smallest numerical element from the input collection.
|
static <T> PCollection<T> |
Sample.reservoirSample(PCollection<T> input,
int sampleSize)
Select a fixed number of elements from the given
PCollection with each element
equally likely to be included in the sample. |
static <T> PCollection<T> |
Sample.reservoirSample(PCollection<T> input,
int sampleSize,
Long seed)
A version of the reservoir sampling algorithm that uses a given seed, primarily for
testing purposes.
|
static <S> PCollection<S> |
Sample.sample(PCollection<S> input,
double probability)
Output records from the given
PCollection with the given probability. |
static <S> PCollection<S> |
Sample.sample(PCollection<S> input,
Long seed,
double probability)
Output records from the given
PCollection using a given seed. |
static <T> PCollection<T> |
Shard.shard(PCollection<T> pc,
int numPartitions)
Creates a
PCollection<T> that has the same contents as its input argument but will
be written to a fixed number of output files. |
static <T> PCollection<T> |
Sort.sort(PCollection<T> collection)
Sorts the
PCollection using the natural ordering of its elements in ascending order. |
static <T> PCollection<T> |
Sort.sort(PCollection<T> collection,
int numReducers,
Sort.Order order)
Sorts the
PCollection using the natural ordering of its elements in
the order specified using the given number of reducers. |
static <T> PCollection<T> |
Sort.sort(PCollection<T> collection,
Sort.Order order)
Sorts the
PCollection using the natural order of its elements with the given Order. |
static <U,V> PCollection<Pair<U,V>> |
Sort.sortPairs(PCollection<Pair<U,V>> collection,
Sort.ColumnOrder... columnOrders)
Sorts the
PCollection of Pairs using the specified column
ordering. |
static <V1,V2,V3,V4> |
Sort.sortQuads(PCollection<Tuple4<V1,V2,V3,V4>> collection,
Sort.ColumnOrder... columnOrders)
Sorts the
PCollection of Tuple4s using the specified column
ordering. |
static <V1,V2,V3> PCollection<Tuple3<V1,V2,V3>> |
Sort.sortTriples(PCollection<Tuple3<V1,V2,V3>> collection,
Sort.ColumnOrder... columnOrders)
Sorts the
PCollection of Tuple3s using the specified column
ordering. |
static <T extends Tuple> |
Sort.sortTuples(PCollection<T> collection,
int numReducers,
Sort.ColumnOrder... columnOrders)
Sorts the
PCollection of TupleNs using the specified column
ordering and a client-specified number of reducers. |
static <T extends Tuple> |
Sort.sortTuples(PCollection<T> collection,
Sort.ColumnOrder... columnOrders)
Sorts the
PCollection of tuples using the specified column ordering. |
static <T,U> Pair<PCollection<T>,PCollection<U>> |
Channels.split(PCollection<Pair<T,U>> pCollection)
Splits a
PCollection of any Pair of objects into a Pair of
PCollection}, to allow for the output of a DoFn to be handled using
separate channels. |
static <T,U> Pair<PCollection<T>,PCollection<U>> |
Channels.split(PCollection<Pair<T,U>> pCollection,
PType<T> firstPType,
PType<U> secondPType)
Splits a
PCollection of any Pair of objects into a Pair of
PCollection}, to allow for the output of a DoFn to be handled using
separate channels. |
static <T,N extends Number> |
Sample.weightedReservoirSample(PCollection<Pair<T,N>> input,
int sampleSize)
Selects a weighted sample of the elements of the given
PCollection, where the second term in
the input Pair is a numerical weight. |
static <T,N extends Number> |
Sample.weightedReservoirSample(PCollection<Pair<T,N>> input,
int sampleSize,
Long seed)
The weighted reservoir sampling function with the seed term exposed for testing purposes.
|
| Modifier and Type | Method and Description |
|---|---|
static <K,U,V,T> PCollection<T> |
OneToManyJoin.oneToManyJoin(PTable<K,U> left,
PTable<K,V> right,
DoFn<Pair<U,Iterable<V>>,T> postProcessFn,
PType<T> ptype)
Performs a join on two tables, where the left table only contains a single
value per key.
|
static <K,U,V,T> PCollection<T> |
OneToManyJoin.oneToManyJoin(PTable<K,U> left,
PTable<K,V> right,
DoFn<Pair<U,Iterable<V>>,T> postProcessFn,
PType<T> ptype,
int numReducers)
Supports a user-specified number of reducers for the one-to-many join.
|
| Modifier and Type | Method and Description |
|---|---|
<T> PCollection<T> |
CrunchTool.read(Source<T> source) |
PCollection<String> |
CrunchTool.readTextFile(String pathName) |
| Modifier and Type | Method and Description |
|---|---|
static <T> int |
PartitionUtils.getRecommendedPartitions(PCollection<T> pcollection) |
static <T> int |
PartitionUtils.getRecommendedPartitions(PCollection<T> pcollection,
org.apache.hadoop.conf.Configuration conf) |
<T> Iterable<T> |
CrunchTool.materialize(PCollection<T> pcollection) |
void |
CrunchTool.write(PCollection<?> pcollection,
Target target) |
void |
CrunchTool.writeTextFile(PCollection<?> pcollection,
String pathName) |
Copyright © 2015 The Apache Software Foundation. All Rights Reserved.