Package | Description |
---|---|
org.apache.crunch |
Client-facing API and core abstractions.
|
org.apache.crunch.contrib.bloomfilter |
Support for creating Bloom Filters.
|
org.apache.crunch.contrib.text | |
org.apache.crunch.examples |
Example applications demonstrating various aspects of Crunch.
|
org.apache.crunch.impl.dist | |
org.apache.crunch.impl.dist.collect | |
org.apache.crunch.impl.mem |
In-memory Pipeline implementation for rapid prototyping and testing.
|
org.apache.crunch.impl.mr |
A Pipeline implementation that runs on Hadoop MapReduce.
|
org.apache.crunch.impl.spark | |
org.apache.crunch.impl.spark.collect | |
org.apache.crunch.lambda |
Alternative Crunch API using Java 8 features to allow construction of pipelines using lambda functions and method
references.
|
org.apache.crunch.lib |
Joining, sorting, aggregating, and other commonly used functionality.
|
org.apache.crunch.lib.join |
Inner and outer joins on collections.
|
org.apache.crunch.util |
An assorted set of utilities.
|
Modifier and Type | Interface and Description |
---|---|
interface |
PGroupedTable<K,V>
The Crunch representation of a grouped
PTable , which corresponds to the output of
the shuffle phase of a MapReduce job. |
interface |
PTable<K,V>
A sub-interface of
PCollection that represents an immutable,
distributed multi-map of keys and values. |
Modifier and Type | Method and Description |
---|---|
PCollection<S> |
PCollection.aggregate(Aggregator<S> aggregator)
Returns a
PCollection that contains the result of aggregating all values in this instance. |
PCollection<S> |
PCollection.cache()
Marks this data as cached using the default
CachingOptions . |
PCollection<S> |
PCollection.cache(CachingOptions options)
Marks this data as cached using the given
CachingOptions . |
<T> PCollection<T> |
Pipeline.create(Iterable<T> contents,
PType<T> ptype)
Creates a
PCollection containing the values found in the given Iterable
using an implementation-specific distribution mechanism. |
<T> PCollection<T> |
Pipeline.create(Iterable<T> contents,
PType<T> ptype,
CreateOptions options)
Creates a
PCollection containing the values found in the given Iterable
using an implementation-specific distribution mechanism. |
<T> PCollection<T> |
Pipeline.emptyPCollection(PType<T> ptype)
Creates an empty
PCollection of the given PType . |
PCollection<S> |
PCollection.filter(FilterFn<S> filterFn)
Apply the given filter function to this instance and return the resulting
PCollection . |
PCollection<S> |
PCollection.filter(String name,
FilterFn<S> filterFn)
Apply the given filter function to this instance and return the resulting
PCollection . |
PCollection<K> |
PTable.keys()
Returns a
PCollection made up of the keys in this PTable. |
<T> PCollection<T> |
PCollection.parallelDo(DoFn<S,T> doFn,
PType<T> type)
Applies the given doFn to the elements of this
PCollection and
returns a new PCollection that is the output of this processing. |
<T> PCollection<T> |
PCollection.parallelDo(String name,
DoFn<S,T> doFn,
PType<T> type)
Applies the given doFn to the elements of this
PCollection and
returns a new PCollection that is the output of this processing. |
<T> PCollection<T> |
PCollection.parallelDo(String name,
DoFn<S,T> doFn,
PType<T> type,
ParallelDoOptions options)
Applies the given doFn to the elements of this
PCollection and
returns a new PCollection that is the output of this processing. |
<T> PCollection<T> |
Pipeline.read(Source<T> source)
Converts the given
Source into a PCollection that is
available to jobs run using this Pipeline instance. |
<T> PCollection<T> |
Pipeline.read(Source<T> source,
String named)
Converts the given
Source into a PCollection that is
available to jobs run using this Pipeline instance. |
PCollection<String> |
Pipeline.readTextFile(String pathName)
A convenience method for reading a text file.
|
<S> PCollection<S> |
Pipeline.union(List<PCollection<S>> collections) |
PCollection<S> |
PCollection.union(PCollection<S>... collections)
Returns a
PCollection instance that acts as the union of this
PCollection and the input PCollection s. |
PCollection<S> |
PCollection.union(PCollection<S> other)
Returns a
PCollection instance that acts as the union of this
PCollection and the given PCollection . |
PCollection<V> |
PTable.values()
Returns a
PCollection made up of the values in this PTable. |
PCollection<S> |
PCollection.write(Target target)
Write the contents of this
PCollection to the given Target ,
using the storage format specified by the target. |
PCollection<S> |
PCollection.write(Target target,
Target.WriteMode writeMode)
Write the contents of this
PCollection to the given Target ,
using the given Target.WriteMode to handle existing
targets. |
Modifier and Type | Method and Description |
---|---|
Map<String,PCollection<?>> |
PipelineCallable.getAllPCollections()
Returns the mapping of labels to PCollection dependencies for this instance.
|
Modifier and Type | Method and Description |
---|---|
<T> void |
Pipeline.cache(PCollection<T> pcollection,
CachingOptions options)
Caches the given PCollection so that it will be processed at most once
during pipeline execution.
|
PipelineCallable<Output> |
PipelineCallable.dependsOn(String label,
PCollection<?> pcollect)
Requires that the given
PCollection be materialized to disk before this instance may be
executed. |
<T> Iterable<T> |
Pipeline.materialize(PCollection<T> pcollection)
Create the given PCollection and read the data it contains into the
returned Collection instance for client use.
|
PCollection<S> |
PCollection.union(PCollection<S>... collections)
Returns a
PCollection instance that acts as the union of this
PCollection and the input PCollection s. |
PCollection<S> |
PCollection.union(PCollection<S> other)
Returns a
PCollection instance that acts as the union of this
PCollection and the given PCollection . |
void |
Pipeline.write(PCollection<?> collection,
Target target)
Write the given collection to the given target on the next pipeline run.
|
void |
Pipeline.write(PCollection<?> collection,
Target target,
Target.WriteMode writeMode)
Write the contents of the
PCollection to the given Target ,
using the storage format specified by the target and the given
WriteMode for cases where the referenced Target
already exists. |
<T> void |
Pipeline.writeTextFile(PCollection<T> collection,
String pathName)
A convenience method for writing a text file.
|
Modifier and Type | Method and Description |
---|---|
<S> PCollection<S> |
Pipeline.union(List<PCollection<S>> collections) |
Modifier and Type | Method and Description |
---|---|
static <T> PObject<org.apache.hadoop.util.bloom.BloomFilter> |
BloomFilterFactory.createFilter(PCollection<T> collection,
BloomFilterFn<T> filterFn) |
Modifier and Type | Method and Description |
---|---|
static <T> PCollection<T> |
Parse.parse(String groupName,
PCollection<String> input,
Extractor<T> extractor)
Parses the lines of the input
PCollection<String> and returns a PCollection<T> using
the given Extractor<T> . |
static <T> PCollection<T> |
Parse.parse(String groupName,
PCollection<String> input,
PTypeFamily ptf,
Extractor<T> extractor)
Parses the lines of the input
PCollection<String> and returns a PCollection<T> using
the given Extractor<T> that uses the given PTypeFamily . |
Modifier and Type | Method and Description |
---|---|
static <T> PCollection<T> |
Parse.parse(String groupName,
PCollection<String> input,
Extractor<T> extractor)
Parses the lines of the input
PCollection<String> and returns a PCollection<T> using
the given Extractor<T> . |
static <T> PCollection<T> |
Parse.parse(String groupName,
PCollection<String> input,
PTypeFamily ptf,
Extractor<T> extractor)
Parses the lines of the input
PCollection<String> and returns a PCollection<T> using
the given Extractor<T> that uses the given PTypeFamily . |
static <K,V> PTable<K,V> |
Parse.parseTable(String groupName,
PCollection<String> input,
Extractor<Pair<K,V>> extractor)
Parses the lines of the input
PCollection<String> and returns a PTable<K, V> using
the given Extractor<Pair<K, V>> . |
static <K,V> PTable<K,V> |
Parse.parseTable(String groupName,
PCollection<String> input,
PTypeFamily ptf,
Extractor<Pair<K,V>> extractor)
Parses the lines of the input
PCollection<String> and returns a PTable<K, V> using
the given Extractor<Pair<K, V>> that uses the given PTypeFamily . |
Modifier and Type | Method and Description |
---|---|
PCollection<org.apache.hadoop.hbase.client.Put> |
WordAggregationHBase.createPut(PTable<String,String> extractedText)
Create puts in order to insert them in hbase.
|
Modifier and Type | Method and Description |
---|---|
<S> PCollection<S> |
DistributedPipeline.create(Iterable<S> contents,
PType<S> ptype) |
<S> PCollection<S> |
DistributedPipeline.create(Iterable<S> contents,
PType<S> ptype,
CreateOptions options) |
<S> PCollection<S> |
DistributedPipeline.emptyPCollection(PType<S> ptype) |
<S> PCollection<S> |
DistributedPipeline.read(Source<S> source) |
<S> PCollection<S> |
DistributedPipeline.read(Source<S> source,
String named) |
PCollection<String> |
DistributedPipeline.readTextFile(String pathName) |
<S> PCollection<S> |
DistributedPipeline.union(List<PCollection<S>> collections) |
Modifier and Type | Method and Description |
---|---|
<T> ReadableSource<T> |
DistributedPipeline.getMaterializeSourceTarget(PCollection<T> pcollection)
Retrieve a ReadableSourceTarget that provides access to the contents of a
PCollection . |
void |
DistributedPipeline.write(PCollection<?> pcollection,
Target target) |
void |
DistributedPipeline.write(PCollection<?> pcollection,
Target target,
Target.WriteMode writeMode) |
<T> void |
DistributedPipeline.writeTextFile(PCollection<T> pcollection,
String pathName) |
Modifier and Type | Method and Description |
---|---|
<S> PCollection<S> |
DistributedPipeline.union(List<PCollection<S>> collections) |
Modifier and Type | Class and Description |
---|---|
class |
BaseDoCollection<S> |
class |
BaseDoTable<K,V> |
class |
BaseGroupedTable<K,V> |
class |
BaseInputCollection<S> |
class |
BaseInputTable<K,V> |
class |
BaseUnionCollection<S> |
class |
BaseUnionTable<K,V> |
class |
EmptyPCollection<T> |
class |
EmptyPTable<K,V> |
class |
PCollectionImpl<S> |
class |
PTableBase<K,V> |
Modifier and Type | Method and Description |
---|---|
PCollection<S> |
PCollectionImpl.aggregate(Aggregator<S> aggregator) |
PCollection<S> |
PCollectionImpl.cache() |
PCollection<S> |
PCollectionImpl.cache(CachingOptions options) |
PCollection<S> |
PCollectionImpl.filter(FilterFn<S> filterFn) |
PCollection<S> |
PCollectionImpl.filter(String name,
FilterFn<S> filterFn) |
PCollection<K> |
PTableBase.keys() |
<T> PCollection<T> |
PCollectionImpl.parallelDo(DoFn<S,T> fn,
PType<T> type) |
<T> PCollection<T> |
PCollectionImpl.parallelDo(String name,
DoFn<S,T> fn,
PType<T> type) |
<T> PCollection<T> |
PCollectionImpl.parallelDo(String name,
DoFn<S,T> fn,
PType<T> type,
ParallelDoOptions options) |
PCollection<S> |
PCollectionImpl.union(PCollection<S>... collections) |
PCollection<S> |
PCollectionImpl.union(PCollection<S> other) |
PCollection<V> |
PTableBase.values() |
PCollection<S> |
PCollectionImpl.write(Target target) |
PCollection<S> |
PCollectionImpl.write(Target target,
Target.WriteMode writeMode) |
Modifier and Type | Method and Description |
---|---|
PCollection<S> |
PCollectionImpl.union(PCollection<S>... collections) |
PCollection<S> |
PCollectionImpl.union(PCollection<S> other) |
Modifier and Type | Method and Description |
---|---|
static <T> PCollection<T> |
MemPipeline.collectionOf(Iterable<T> collect) |
static <T> PCollection<T> |
MemPipeline.collectionOf(T... ts) |
<T> PCollection<T> |
MemPipeline.create(Iterable<T> contents,
PType<T> ptype) |
<T> PCollection<T> |
MemPipeline.create(Iterable<T> iterable,
PType<T> ptype,
CreateOptions options) |
<T> PCollection<T> |
MemPipeline.emptyPCollection(PType<T> ptype) |
<T> PCollection<T> |
MemPipeline.read(Source<T> source) |
<T> PCollection<T> |
MemPipeline.read(Source<T> source,
String named) |
PCollection<String> |
MemPipeline.readTextFile(String pathName) |
static <T> PCollection<T> |
MemPipeline.typedCollectionOf(PType<T> ptype,
Iterable<T> collect) |
static <T> PCollection<T> |
MemPipeline.typedCollectionOf(PType<T> ptype,
T... ts) |
<S> PCollection<S> |
MemPipeline.union(List<PCollection<S>> collections) |
Modifier and Type | Method and Description |
---|---|
<T> void |
MemPipeline.cache(PCollection<T> pcollection,
CachingOptions options) |
<T> Iterable<T> |
MemPipeline.materialize(PCollection<T> pcollection) |
void |
MemPipeline.write(PCollection<?> collection,
Target target) |
void |
MemPipeline.write(PCollection<?> collection,
Target target,
Target.WriteMode writeMode) |
<T> void |
MemPipeline.writeTextFile(PCollection<T> collection,
String pathName) |
Modifier and Type | Method and Description |
---|---|
<S> PCollection<S> |
MemPipeline.union(List<PCollection<S>> collections) |
Modifier and Type | Method and Description |
---|---|
<T> void |
MRPipeline.cache(PCollection<T> pcollection,
CachingOptions options) |
<T> Iterable<T> |
MRPipeline.materialize(PCollection<T> pcollection) |
Modifier and Type | Method and Description |
---|---|
<S> PCollection<S> |
SparkPipeline.create(Iterable<S> contents,
PType<S> ptype,
CreateOptions options) |
<S> PCollection<S> |
SparkPipeline.emptyPCollection(PType<S> ptype) |
Modifier and Type | Method and Description |
---|---|
<T> void |
SparkPipeline.cache(PCollection<T> pcollection,
CachingOptions options) |
org.apache.spark.storage.StorageLevel |
SparkRuntime.getStorageLevel(PCollection<?> pcollection) |
<T> Iterable<T> |
SparkPipeline.materialize(PCollection<T> pcollection) |
Constructor and Description |
---|
SparkRuntime(SparkPipeline pipeline,
org.apache.spark.api.java.JavaSparkContext sparkContext,
org.apache.hadoop.conf.Configuration conf,
Map<PCollectionImpl<?>,Set<Target>> outputTargets,
Map<PCollectionImpl<?>,org.apache.crunch.materialize.MaterializableIterable> toMaterialize,
Map<PCollection<?>,org.apache.spark.storage.StorageLevel> toCache,
Map<PipelineCallable<?>,Set<Target>> allPipelineCallables) |
Modifier and Type | Class and Description |
---|---|
class |
CreatedCollection<T>
Represents a Spark-based PCollection that was created from a Java
Iterable of
values. |
class |
CreatedTable<K,V>
Represents a Spark-based PTable that was created from a Java
Iterable of
key-value pairs. |
class |
DoCollection<S> |
class |
DoTable<K,V> |
class |
InputCollection<S> |
class |
InputTable<K,V> |
class |
PGroupedTableImpl<K,V> |
class |
UnionCollection<S> |
class |
UnionTable<K,V> |
Modifier and Type | Method and Description |
---|---|
PCollection<S> |
LCollection.underlying()
Get the underlying
PCollection for this LCollection |
Modifier and Type | Method and Description |
---|---|
default LCollection<S> |
LCollection.union(PCollection<S> other)
Union this LCollection with a
PCollection of the same type |
<S> LCollection<S> |
LCollectionFactory.wrap(PCollection<S> collection)
Wrap a PCollection into an LCollection
|
static <S> LCollection<S> |
Lambda.wrap(PCollection<S> collection) |
Modifier and Type | Method and Description |
---|---|
static <S> PCollection<S> |
Aggregate.aggregate(PCollection<S> collect,
Aggregator<S> aggregator) |
static <T> PCollection<Tuple3<T,T,T>> |
Set.comm(PCollection<T> coll1,
PCollection<T> coll2)
Find the elements that are common to two sets, like the Unix
comm utility. |
static <U,V> PCollection<Pair<U,V>> |
Cartesian.cross(PCollection<U> left,
PCollection<V> right)
Performs a full cross join on the specified
PCollection s (using the
same strategy as Pig's CROSS operator). |
static <U,V> PCollection<Pair<U,V>> |
Cartesian.cross(PCollection<U> left,
PCollection<V> right,
int parallelism)
Performs a full cross join on the specified
PCollection s (using the
same strategy as Pig's CROSS operator). |
static <T> PCollection<T> |
Set.difference(PCollection<T> coll1,
PCollection<T> coll2)
Compute the set difference between two sets of elements.
|
static <S> PCollection<S> |
Distinct.distinct(PCollection<S> input)
Construct a new
PCollection that contains the unique elements of a
given input PCollection . |
static <S> PCollection<S> |
Distinct.distinct(PCollection<S> input,
int flushEvery)
A
distinct operation that gives the client more control over how frequently
elements are flushed to disk in order to allow control over performance or
memory consumption. |
static <T,N extends Number> |
Sample.groupedWeightedReservoirSample(PTable<Integer,Pair<T,N>> input,
int[] sampleSizes)
The most general purpose of the weighted reservoir sampling patterns that allows us to choose
a random sample of elements for each of N input groups.
|
static <T,N extends Number> |
Sample.groupedWeightedReservoirSample(PTable<Integer,Pair<T,N>> input,
int[] sampleSizes,
Long seed)
Same as the other groupedWeightedReservoirSample method, but include a seed for testing
purposes.
|
static <T> PCollection<T> |
Set.intersection(PCollection<T> coll1,
PCollection<T> coll2)
Compute the intersection of two sets of elements.
|
static <K,V> PCollection<K> |
PTables.keys(PTable<K,V> ptable)
Extract the keys from the given
PTable<K, V> as a PCollection<K> . |
static <T> PCollection<T> |
Sample.reservoirSample(PCollection<T> input,
int sampleSize)
Select a fixed number of elements from the given
PCollection with each element
equally likely to be included in the sample. |
static <T> PCollection<T> |
Sample.reservoirSample(PCollection<T> input,
int sampleSize,
Long seed)
A version of the reservoir sampling algorithm that uses a given seed, primarily for
testing purposes.
|
static <S> PCollection<S> |
Sample.sample(PCollection<S> input,
double probability)
Output records from the given
PCollection with the given probability. |
static <S> PCollection<S> |
Sample.sample(PCollection<S> input,
Long seed,
double probability)
Output records from the given
PCollection using a given seed. |
static <T> PCollection<T> |
Shard.shard(PCollection<T> pc,
int numPartitions)
Creates a
PCollection<T> that has the same contents as its input argument but will
be written to a fixed number of output files. |
static <T> PCollection<T> |
Sort.sort(PCollection<T> collection)
Sorts the
PCollection using the natural ordering of its elements in ascending order. |
static <T> PCollection<T> |
Sort.sort(PCollection<T> collection,
int numReducers,
Sort.Order order)
Sorts the
PCollection using the natural ordering of its elements in
the order specified using the given number of reducers. |
static <T> PCollection<T> |
Sort.sort(PCollection<T> collection,
Sort.Order order)
Sorts the
PCollection using the natural order of its elements with the given Order . |
static <K,V1,V2,T> |
SecondarySort.sortAndApply(PTable<K,Pair<V1,V2>> input,
DoFn<Pair<K,Iterable<Pair<V1,V2>>>,T> doFn,
PType<T> ptype)
Perform a secondary sort on the given
PTable instance and then apply a
DoFn to the resulting sorted data to yield an output PCollection<T> . |
static <K,V1,V2,T> |
SecondarySort.sortAndApply(PTable<K,Pair<V1,V2>> input,
DoFn<Pair<K,Iterable<Pair<V1,V2>>>,T> doFn,
PType<T> ptype,
int numReducers)
Perform a secondary sort on the given
PTable instance and then apply a
DoFn to the resulting sorted data to yield an output PCollection<T> , using
the given number of reducers. |
static <U,V> PCollection<Pair<U,V>> |
Sort.sortPairs(PCollection<Pair<U,V>> collection,
Sort.ColumnOrder... columnOrders)
Sorts the
PCollection of Pair s using the specified column
ordering. |
static <V1,V2,V3,V4> |
Sort.sortQuads(PCollection<Tuple4<V1,V2,V3,V4>> collection,
Sort.ColumnOrder... columnOrders)
Sorts the
PCollection of Tuple4 s using the specified column
ordering. |
static <V1,V2,V3> PCollection<Tuple3<V1,V2,V3>> |
Sort.sortTriples(PCollection<Tuple3<V1,V2,V3>> collection,
Sort.ColumnOrder... columnOrders)
Sorts the
PCollection of Tuple3 s using the specified column
ordering. |
static <T extends Tuple> |
Sort.sortTuples(PCollection<T> collection,
int numReducers,
Sort.ColumnOrder... columnOrders)
Sorts the
PCollection of TupleN s using the specified column
ordering and a client-specified number of reducers. |
static <T extends Tuple> |
Sort.sortTuples(PCollection<T> collection,
Sort.ColumnOrder... columnOrders)
Sorts the
PCollection of tuples using the specified column ordering. |
static <K,V> PCollection<V> |
PTables.values(PTable<K,V> ptable)
Extract the values from the given
PTable<K, V> as a PCollection<V> . |
static <T,N extends Number> |
Sample.weightedReservoirSample(PCollection<Pair<T,N>> input,
int sampleSize)
Selects a weighted sample of the elements of the given
PCollection , where the second term in
the input Pair is a numerical weight. |
static <T,N extends Number> |
Sample.weightedReservoirSample(PCollection<Pair<T,N>> input,
int sampleSize,
Long seed)
The weighted reservoir sampling function with the seed term exposed for testing purposes.
|
Modifier and Type | Method and Description |
---|---|
static <T,U> Pair<PCollection<T>,PCollection<U>> |
Channels.split(PCollection<Pair<T,U>> pCollection)
Splits a
PCollection of any Pair of objects into a Pair of
PCollection}, to allow for the output of a DoFn to be handled using
separate channels. |
static <T,U> Pair<PCollection<T>,PCollection<U>> |
Channels.split(PCollection<Pair<T,U>> pCollection)
Splits a
PCollection of any Pair of objects into a Pair of
PCollection}, to allow for the output of a DoFn to be handled using
separate channels. |
static <T,U> Pair<PCollection<T>,PCollection<U>> |
Channels.split(PCollection<Pair<T,U>> pCollection,
PType<T> firstPType,
PType<U> secondPType)
Splits a
PCollection of any Pair of objects into a Pair of
PCollection}, to allow for the output of a DoFn to be handled using
separate channels. |
static <T,U> Pair<PCollection<T>,PCollection<U>> |
Channels.split(PCollection<Pair<T,U>> pCollection,
PType<T> firstPType,
PType<U> secondPType)
Splits a
PCollection of any Pair of objects into a Pair of
PCollection}, to allow for the output of a DoFn to be handled using
separate channels. |
Modifier and Type | Method and Description |
---|---|
static <S> PCollection<S> |
Aggregate.aggregate(PCollection<S> collect,
Aggregator<S> aggregator) |
static <K,V> PTable<K,V> |
PTables.asPTable(PCollection<Pair<K,V>> pcollect)
Convert the given
PCollection<Pair<K, V>> to a PTable<K, V> . |
static <T> PCollection<Tuple3<T,T,T>> |
Set.comm(PCollection<T> coll1,
PCollection<T> coll2)
Find the elements that are common to two sets, like the Unix
comm utility. |
static <T> PCollection<Tuple3<T,T,T>> |
Set.comm(PCollection<T> coll1,
PCollection<T> coll2)
Find the elements that are common to two sets, like the Unix
comm utility. |
static <S> PTable<S,Long> |
Aggregate.count(PCollection<S> collect)
Returns a
PTable that contains the unique elements of this collection mapped to a count
of their occurrences. |
static <S> PTable<S,Long> |
Aggregate.count(PCollection<S> collect,
int numPartitions)
Returns a
PTable that contains the unique elements of this collection mapped to a count
of their occurrences. |
static <U,V> PCollection<Pair<U,V>> |
Cartesian.cross(PCollection<U> left,
PCollection<V> right)
Performs a full cross join on the specified
PCollection s (using the
same strategy as Pig's CROSS operator). |
static <U,V> PCollection<Pair<U,V>> |
Cartesian.cross(PCollection<U> left,
PCollection<V> right)
Performs a full cross join on the specified
PCollection s (using the
same strategy as Pig's CROSS operator). |
static <U,V> PCollection<Pair<U,V>> |
Cartesian.cross(PCollection<U> left,
PCollection<V> right,
int parallelism)
Performs a full cross join on the specified
PCollection s (using the
same strategy as Pig's CROSS operator). |
static <U,V> PCollection<Pair<U,V>> |
Cartesian.cross(PCollection<U> left,
PCollection<V> right,
int parallelism)
Performs a full cross join on the specified
PCollection s (using the
same strategy as Pig's CROSS operator). |
static <T> PCollection<T> |
Set.difference(PCollection<T> coll1,
PCollection<T> coll2)
Compute the set difference between two sets of elements.
|
static <T> PCollection<T> |
Set.difference(PCollection<T> coll1,
PCollection<T> coll2)
Compute the set difference between two sets of elements.
|
static <S> PCollection<S> |
Distinct.distinct(PCollection<S> input)
Construct a new
PCollection that contains the unique elements of a
given input PCollection . |
static <S> PCollection<S> |
Distinct.distinct(PCollection<S> input,
int flushEvery)
A
distinct operation that gives the client more control over how frequently
elements are flushed to disk in order to allow control over performance or
memory consumption. |
static <X> PTable<X,Long> |
TopList.globalToplist(PCollection<X> input)
Create a list of unique items in the input collection with their count, sorted descending by their frequency.
|
static <T> PCollection<T> |
Set.intersection(PCollection<T> coll1,
PCollection<T> coll2)
Compute the intersection of two sets of elements.
|
static <T> PCollection<T> |
Set.intersection(PCollection<T> coll1,
PCollection<T> coll2)
Compute the intersection of two sets of elements.
|
static <S> PObject<Long> |
Aggregate.length(PCollection<S> collect)
Returns the number of elements in the provided PCollection.
|
static <S> PObject<S> |
Aggregate.max(PCollection<S> collect)
Returns the largest numerical element from the input collection.
|
static <S> PObject<S> |
Aggregate.min(PCollection<S> collect)
Returns the smallest numerical element from the input collection.
|
static <T> PCollection<T> |
Sample.reservoirSample(PCollection<T> input,
int sampleSize)
Select a fixed number of elements from the given
PCollection with each element
equally likely to be included in the sample. |
static <T> PCollection<T> |
Sample.reservoirSample(PCollection<T> input,
int sampleSize,
Long seed)
A version of the reservoir sampling algorithm that uses a given seed, primarily for
testing purposes.
|
static <S> PCollection<S> |
Sample.sample(PCollection<S> input,
double probability)
Output records from the given
PCollection with the given probability. |
static <S> PCollection<S> |
Sample.sample(PCollection<S> input,
Long seed,
double probability)
Output records from the given
PCollection using a given seed. |
static <T> PCollection<T> |
Shard.shard(PCollection<T> pc,
int numPartitions)
Creates a
PCollection<T> that has the same contents as its input argument but will
be written to a fixed number of output files. |
static <T> PCollection<T> |
Sort.sort(PCollection<T> collection)
Sorts the
PCollection using the natural ordering of its elements in ascending order. |
static <T> PCollection<T> |
Sort.sort(PCollection<T> collection,
int numReducers,
Sort.Order order)
Sorts the
PCollection using the natural ordering of its elements in
the order specified using the given number of reducers. |
static <T> PCollection<T> |
Sort.sort(PCollection<T> collection,
Sort.Order order)
Sorts the
PCollection using the natural order of its elements with the given Order . |
static <U,V> PCollection<Pair<U,V>> |
Sort.sortPairs(PCollection<Pair<U,V>> collection,
Sort.ColumnOrder... columnOrders)
Sorts the
PCollection of Pair s using the specified column
ordering. |
static <V1,V2,V3,V4> |
Sort.sortQuads(PCollection<Tuple4<V1,V2,V3,V4>> collection,
Sort.ColumnOrder... columnOrders)
Sorts the
PCollection of Tuple4 s using the specified column
ordering. |
static <V1,V2,V3> PCollection<Tuple3<V1,V2,V3>> |
Sort.sortTriples(PCollection<Tuple3<V1,V2,V3>> collection,
Sort.ColumnOrder... columnOrders)
Sorts the
PCollection of Tuple3 s using the specified column
ordering. |
static <T extends Tuple> |
Sort.sortTuples(PCollection<T> collection,
int numReducers,
Sort.ColumnOrder... columnOrders)
Sorts the
PCollection of TupleN s using the specified column
ordering and a client-specified number of reducers. |
static <T extends Tuple> |
Sort.sortTuples(PCollection<T> collection,
Sort.ColumnOrder... columnOrders)
Sorts the
PCollection of tuples using the specified column ordering. |
static <T,U> Pair<PCollection<T>,PCollection<U>> |
Channels.split(PCollection<Pair<T,U>> pCollection)
Splits a
PCollection of any Pair of objects into a Pair of
PCollection}, to allow for the output of a DoFn to be handled using
separate channels. |
static <T,U> Pair<PCollection<T>,PCollection<U>> |
Channels.split(PCollection<Pair<T,U>> pCollection,
PType<T> firstPType,
PType<U> secondPType)
Splits a
PCollection of any Pair of objects into a Pair of
PCollection}, to allow for the output of a DoFn to be handled using
separate channels. |
static <T,N extends Number> |
Sample.weightedReservoirSample(PCollection<Pair<T,N>> input,
int sampleSize)
Selects a weighted sample of the elements of the given
PCollection , where the second term in
the input Pair is a numerical weight. |
static <T,N extends Number> |
Sample.weightedReservoirSample(PCollection<Pair<T,N>> input,
int sampleSize,
Long seed)
The weighted reservoir sampling function with the seed term exposed for testing purposes.
|
Modifier and Type | Method and Description |
---|---|
static <K,U,V,T> PCollection<T> |
OneToManyJoin.oneToManyJoin(PTable<K,U> left,
PTable<K,V> right,
DoFn<Pair<U,Iterable<V>>,T> postProcessFn,
PType<T> ptype)
Performs a join on two tables, where the left table only contains a single
value per key.
|
static <K,U,V,T> PCollection<T> |
OneToManyJoin.oneToManyJoin(PTable<K,U> left,
PTable<K,V> right,
DoFn<Pair<U,Iterable<V>>,T> postProcessFn,
PType<T> ptype,
int numReducers)
Supports a user-specified number of reducers for the one-to-many join.
|
Modifier and Type | Method and Description |
---|---|
<T> PCollection<T> |
CrunchTool.read(Source<T> source) |
PCollection<String> |
CrunchTool.readTextFile(String pathName) |
Modifier and Type | Method and Description |
---|---|
static <T> int |
PartitionUtils.getRecommendedPartitions(PCollection<T> pcollection) |
static <T> int |
PartitionUtils.getRecommendedPartitions(PCollection<T> pcollection,
org.apache.hadoop.conf.Configuration conf) |
<T> Iterable<T> |
CrunchTool.materialize(PCollection<T> pcollection) |
void |
CrunchTool.write(PCollection<?> pcollection,
Target target) |
void |
CrunchTool.writeTextFile(PCollection<?> pcollection,
String pathName) |
Copyright © 2016 The Apache Software Foundation. All rights reserved.