| Package | Description | 
|---|---|
| org.apache.crunch | Client-facing API and core abstractions. | 
| org.apache.crunch.contrib.bloomfilter | Support for creating Bloom Filters. | 
| org.apache.crunch.contrib.text | |
| org.apache.crunch.examples | Example applications demonstrating various aspects of Crunch. | 
| org.apache.crunch.impl.dist | |
| org.apache.crunch.impl.dist.collect | |
| org.apache.crunch.impl.mem | In-memory Pipeline implementation for rapid prototyping and testing. | 
| org.apache.crunch.impl.mr | A Pipeline implementation that runs on Hadoop MapReduce. | 
| org.apache.crunch.impl.spark | |
| org.apache.crunch.impl.spark.collect | |
| org.apache.crunch.lib | Joining, sorting, aggregating, and other commonly used functionality. | 
| org.apache.crunch.lib.join | Inner and outer joins on collections. | 
| org.apache.crunch.util | An assorted set of utilities. | 
| Modifier and Type | Interface and Description | 
|---|---|
| interface  | PGroupedTable<K,V>The Crunch representation of a grouped  PTable, which corresponds to the output of
 the shuffle phase of a MapReduce job. | 
| interface  | PTable<K,V>A sub-interface of  PCollectionthat represents an immutable,
 distributed multi-map of keys and values. | 
| Modifier and Type | Method and Description | 
|---|---|
| PCollection<S> | PCollection. aggregate(Aggregator<S> aggregator)Returns a  PCollectionthat contains the result of aggregating all values in this instance. | 
| PCollection<S> | PCollection. cache()Marks this data as cached using the default  CachingOptions. | 
| PCollection<S> | PCollection. cache(CachingOptions options)Marks this data as cached using the given  CachingOptions. | 
| <T> PCollection<T> | Pipeline. create(Iterable<T> contents,
      PType<T> ptype)Creates a  PCollectioncontaining the values found in the givenIterableusing an implementation-specific distribution mechanism. | 
| <T> PCollection<T> | Pipeline. create(Iterable<T> contents,
      PType<T> ptype,
      CreateOptions options)Creates a  PCollectioncontaining the values found in the givenIterableusing an implementation-specific distribution mechanism. | 
| <T> PCollection<T> | Pipeline. emptyPCollection(PType<T> ptype)Creates an empty  PCollectionof the givenPType. | 
| PCollection<S> | PCollection. filter(FilterFn<S> filterFn)Apply the given filter function to this instance and return the resulting
  PCollection. | 
| PCollection<S> | PCollection. filter(String name,
      FilterFn<S> filterFn)Apply the given filter function to this instance and return the resulting
  PCollection. | 
| PCollection<K> | PTable. keys()Returns a  PCollectionmade up of the keys in this PTable. | 
| <T> PCollection<T> | PCollection. parallelDo(DoFn<S,T> doFn,
          PType<T> type)Applies the given doFn to the elements of this  PCollectionand
 returns a newPCollectionthat is the output of this processing. | 
| <T> PCollection<T> | PCollection. parallelDo(String name,
          DoFn<S,T> doFn,
          PType<T> type)Applies the given doFn to the elements of this  PCollectionand
 returns a newPCollectionthat is the output of this processing. | 
| <T> PCollection<T> | PCollection. parallelDo(String name,
          DoFn<S,T> doFn,
          PType<T> type,
          ParallelDoOptions options)Applies the given doFn to the elements of this  PCollectionand
 returns a newPCollectionthat is the output of this processing. | 
| <T> PCollection<T> | Pipeline. read(Source<T> source)Converts the given  Sourceinto aPCollectionthat is
 available to jobs run using thisPipelineinstance. | 
| <T> PCollection<T> | Pipeline. read(Source<T> source,
    String named)Converts the given  Sourceinto aPCollectionthat is
 available to jobs run using thisPipelineinstance. | 
| PCollection<String> | Pipeline. readTextFile(String pathName)A convenience method for reading a text file. | 
| <S> PCollection<S> | Pipeline. union(List<PCollection<S>> collections) | 
| PCollection<S> | PCollection. union(PCollection<S>... collections)Returns a  PCollectioninstance that acts as the union of thisPCollectionand the inputPCollections. | 
| PCollection<S> | PCollection. union(PCollection<S> other)Returns a  PCollectioninstance that acts as the union of thisPCollectionand the givenPCollection. | 
| PCollection<V> | PTable. values()Returns a  PCollectionmade up of the values in this PTable. | 
| PCollection<S> | PCollection. write(Target target)Write the contents of this  PCollectionto the givenTarget,
 using the storage format specified by the target. | 
| PCollection<S> | PCollection. write(Target target,
     Target.WriteMode writeMode)Write the contents of this  PCollectionto the givenTarget,
 using the givenTarget.WriteModeto handle existing
 targets. | 
| Modifier and Type | Method and Description | 
|---|---|
| Map<String,PCollection<?>> | PipelineCallable. getAllPCollections()Returns the mapping of labels to PCollection dependencies for this instance. | 
| Modifier and Type | Method and Description | 
|---|---|
| <T> void | Pipeline. cache(PCollection<T> pcollection,
     CachingOptions options)Caches the given PCollection so that it will be processed at most once
 during pipeline execution. | 
| PipelineCallable<Output> | PipelineCallable. dependsOn(String label,
         PCollection<?> pcollect)Requires that the given  PCollectionbe materialized to disk before this instance may be
 executed. | 
| <T> Iterable<T> | Pipeline. materialize(PCollection<T> pcollection)Create the given PCollection and read the data it contains into the
 returned Collection instance for client use. | 
| PCollection<S> | PCollection. union(PCollection<S>... collections)Returns a  PCollectioninstance that acts as the union of thisPCollectionand the inputPCollections. | 
| PCollection<S> | PCollection. union(PCollection<S> other)Returns a  PCollectioninstance that acts as the union of thisPCollectionand the givenPCollection. | 
| void | Pipeline. write(PCollection<?> collection,
     Target target)Write the given collection to the given target on the next pipeline run. | 
| void | Pipeline. write(PCollection<?> collection,
     Target target,
     Target.WriteMode writeMode)Write the contents of the  PCollectionto the givenTarget,
 using the storage format specified by the target and the givenWriteModefor cases where the referencedTargetalready exists. | 
| <T> void | Pipeline. writeTextFile(PCollection<T> collection,
             String pathName)A convenience method for writing a text file. | 
| Modifier and Type | Method and Description | 
|---|---|
| <S> PCollection<S> | Pipeline. union(List<PCollection<S>> collections) | 
| Modifier and Type | Method and Description | 
|---|---|
| static <T> PObject<org.apache.hadoop.util.bloom.BloomFilter> | BloomFilterFactory. createFilter(PCollection<T> collection,
            BloomFilterFn<T> filterFn) | 
| Modifier and Type | Method and Description | 
|---|---|
| static <T> PCollection<T> | Parse. parse(String groupName,
     PCollection<String> input,
     Extractor<T> extractor)Parses the lines of the input  PCollection<String>and returns aPCollection<T>using
 the givenExtractor<T>. | 
| static <T> PCollection<T> | Parse. parse(String groupName,
     PCollection<String> input,
     PTypeFamily ptf,
     Extractor<T> extractor)Parses the lines of the input  PCollection<String>and returns aPCollection<T>using
 the givenExtractor<T>that uses the givenPTypeFamily. | 
| Modifier and Type | Method and Description | 
|---|---|
| static <T> PCollection<T> | Parse. parse(String groupName,
     PCollection<String> input,
     Extractor<T> extractor)Parses the lines of the input  PCollection<String>and returns aPCollection<T>using
 the givenExtractor<T>. | 
| static <T> PCollection<T> | Parse. parse(String groupName,
     PCollection<String> input,
     PTypeFamily ptf,
     Extractor<T> extractor)Parses the lines of the input  PCollection<String>and returns aPCollection<T>using
 the givenExtractor<T>that uses the givenPTypeFamily. | 
| static <K,V> PTable<K,V> | Parse. parseTable(String groupName,
          PCollection<String> input,
          Extractor<Pair<K,V>> extractor)Parses the lines of the input  PCollection<String>and returns aPTable<K, V>using
 the givenExtractor<Pair<K, V>>. | 
| static <K,V> PTable<K,V> | Parse. parseTable(String groupName,
          PCollection<String> input,
          PTypeFamily ptf,
          Extractor<Pair<K,V>> extractor)Parses the lines of the input  PCollection<String>and returns aPTable<K, V>using
 the givenExtractor<Pair<K, V>>that uses the givenPTypeFamily. | 
| Modifier and Type | Method and Description | 
|---|---|
| PCollection<org.apache.hadoop.hbase.client.Put> | WordAggregationHBase. createPut(PTable<String,String> extractedText)Create puts in order to insert them in hbase. | 
| Modifier and Type | Method and Description | 
|---|---|
| <S> PCollection<S> | DistributedPipeline. create(Iterable<S> contents,
      PType<S> ptype) | 
| <S> PCollection<S> | DistributedPipeline. create(Iterable<S> contents,
      PType<S> ptype,
      CreateOptions options) | 
| <S> PCollection<S> | DistributedPipeline. emptyPCollection(PType<S> ptype) | 
| <S> PCollection<S> | DistributedPipeline. read(Source<S> source) | 
| <S> PCollection<S> | DistributedPipeline. read(Source<S> source,
    String named) | 
| PCollection<String> | DistributedPipeline. readTextFile(String pathName) | 
| <S> PCollection<S> | DistributedPipeline. union(List<PCollection<S>> collections) | 
| Modifier and Type | Method and Description | 
|---|---|
| <T> ReadableSource<T> | DistributedPipeline. getMaterializeSourceTarget(PCollection<T> pcollection)Retrieve a ReadableSourceTarget that provides access to the contents of a  PCollection. | 
| void | DistributedPipeline. write(PCollection<?> pcollection,
     Target target) | 
| void | DistributedPipeline. write(PCollection<?> pcollection,
     Target target,
     Target.WriteMode writeMode) | 
| <T> void | DistributedPipeline. writeTextFile(PCollection<T> pcollection,
             String pathName) | 
| Modifier and Type | Method and Description | 
|---|---|
| <S> PCollection<S> | DistributedPipeline. union(List<PCollection<S>> collections) | 
| Modifier and Type | Class and Description | 
|---|---|
| class  | BaseDoCollection<S> | 
| class  | BaseDoTable<K,V> | 
| class  | BaseGroupedTable<K,V> | 
| class  | BaseInputCollection<S> | 
| class  | BaseInputTable<K,V> | 
| class  | BaseUnionCollection<S> | 
| class  | BaseUnionTable<K,V> | 
| class  | EmptyPCollection<T> | 
| class  | EmptyPTable<K,V> | 
| class  | PCollectionImpl<S> | 
| class  | PTableBase<K,V> | 
| Modifier and Type | Method and Description | 
|---|---|
| PCollection<S> | PCollectionImpl. aggregate(Aggregator<S> aggregator) | 
| PCollection<S> | PCollectionImpl. cache() | 
| PCollection<S> | PCollectionImpl. cache(CachingOptions options) | 
| PCollection<S> | PCollectionImpl. filter(FilterFn<S> filterFn) | 
| PCollection<S> | PCollectionImpl. filter(String name,
      FilterFn<S> filterFn) | 
| PCollection<K> | PTableBase. keys() | 
| <T> PCollection<T> | PCollectionImpl. parallelDo(DoFn<S,T> fn,
          PType<T> type) | 
| <T> PCollection<T> | PCollectionImpl. parallelDo(String name,
          DoFn<S,T> fn,
          PType<T> type) | 
| <T> PCollection<T> | PCollectionImpl. parallelDo(String name,
          DoFn<S,T> fn,
          PType<T> type,
          ParallelDoOptions options) | 
| PCollection<S> | PCollectionImpl. union(PCollection<S>... collections) | 
| PCollection<S> | PCollectionImpl. union(PCollection<S> other) | 
| PCollection<V> | PTableBase. values() | 
| PCollection<S> | PCollectionImpl. write(Target target) | 
| PCollection<S> | PCollectionImpl. write(Target target,
     Target.WriteMode writeMode) | 
| Modifier and Type | Method and Description | 
|---|---|
| PCollection<S> | PCollectionImpl. union(PCollection<S>... collections) | 
| PCollection<S> | PCollectionImpl. union(PCollection<S> other) | 
| Modifier and Type | Method and Description | 
|---|---|
| static <T> PCollection<T> | MemPipeline. collectionOf(Iterable<T> collect) | 
| static <T> PCollection<T> | MemPipeline. collectionOf(T... ts) | 
| <T> PCollection<T> | MemPipeline. create(Iterable<T> contents,
      PType<T> ptype) | 
| <T> PCollection<T> | MemPipeline. create(Iterable<T> iterable,
      PType<T> ptype,
      CreateOptions options) | 
| <T> PCollection<T> | MemPipeline. emptyPCollection(PType<T> ptype) | 
| <T> PCollection<T> | MemPipeline. read(Source<T> source) | 
| <T> PCollection<T> | MemPipeline. read(Source<T> source,
    String named) | 
| PCollection<String> | MemPipeline. readTextFile(String pathName) | 
| static <T> PCollection<T> | MemPipeline. typedCollectionOf(PType<T> ptype,
                 Iterable<T> collect) | 
| static <T> PCollection<T> | MemPipeline. typedCollectionOf(PType<T> ptype,
                 T... ts) | 
| <S> PCollection<S> | MemPipeline. union(List<PCollection<S>> collections) | 
| Modifier and Type | Method and Description | 
|---|---|
| <T> void | MemPipeline. cache(PCollection<T> pcollection,
     CachingOptions options) | 
| <T> Iterable<T> | MemPipeline. materialize(PCollection<T> pcollection) | 
| void | MemPipeline. write(PCollection<?> collection,
     Target target) | 
| void | MemPipeline. write(PCollection<?> collection,
     Target target,
     Target.WriteMode writeMode) | 
| <T> void | MemPipeline. writeTextFile(PCollection<T> collection,
             String pathName) | 
| Modifier and Type | Method and Description | 
|---|---|
| <S> PCollection<S> | MemPipeline. union(List<PCollection<S>> collections) | 
| Modifier and Type | Method and Description | 
|---|---|
| <T> void | MRPipeline. cache(PCollection<T> pcollection,
     CachingOptions options) | 
| <T> Iterable<T> | MRPipeline. materialize(PCollection<T> pcollection) | 
| Modifier and Type | Method and Description | 
|---|---|
| <S> PCollection<S> | SparkPipeline. create(Iterable<S> contents,
      PType<S> ptype,
      CreateOptions options) | 
| <S> PCollection<S> | SparkPipeline. emptyPCollection(PType<S> ptype) | 
| Modifier and Type | Method and Description | 
|---|---|
| <T> void | SparkPipeline. cache(PCollection<T> pcollection,
     CachingOptions options) | 
| org.apache.spark.storage.StorageLevel | SparkRuntime. getStorageLevel(PCollection<?> pcollection) | 
| <T> Iterable<T> | SparkPipeline. materialize(PCollection<T> pcollection) | 
| Constructor and Description | 
|---|
| SparkRuntime(SparkPipeline pipeline,
            org.apache.spark.api.java.JavaSparkContext sparkContext,
            org.apache.hadoop.conf.Configuration conf,
            Map<PCollectionImpl<?>,Set<Target>> outputTargets,
            Map<PCollectionImpl<?>,org.apache.crunch.materialize.MaterializableIterable> toMaterialize,
            Map<PCollection<?>,org.apache.spark.storage.StorageLevel> toCache,
            Map<PipelineCallable<?>,Set<Target>> allPipelineCallables) | 
| Modifier and Type | Class and Description | 
|---|---|
| class  | CreatedCollection<T>Represents a Spark-based PCollection that was created from a Java  Iterableof
 values. | 
| class  | CreatedTable<K,V>Represents a Spark-based PTable that was created from a Java  Iterableof
 key-value pairs. | 
| class  | DoCollection<S> | 
| class  | DoTable<K,V> | 
| class  | InputCollection<S> | 
| class  | InputTable<K,V> | 
| class  | PGroupedTableImpl<K,V> | 
| class  | UnionCollection<S> | 
| class  | UnionTable<K,V> | 
| Modifier and Type | Method and Description | 
|---|---|
| static <S> PCollection<S> | Aggregate. aggregate(PCollection<S> collect,
         Aggregator<S> aggregator) | 
| static <T> PCollection<Tuple3<T,T,T>> | Set. comm(PCollection<T> coll1,
    PCollection<T> coll2)Find the elements that are common to two sets, like the Unix
  commutility. | 
| static <U,V> PCollection<Pair<U,V>> | Cartesian. cross(PCollection<U> left,
     PCollection<V> right)Performs a full cross join on the specified  PCollections (using the
 same strategy as Pig's CROSS operator). | 
| static <U,V> PCollection<Pair<U,V>> | Cartesian. cross(PCollection<U> left,
     PCollection<V> right,
     int parallelism)Performs a full cross join on the specified  PCollections (using the
 same strategy as Pig's CROSS operator). | 
| static <T> PCollection<T> | Set. difference(PCollection<T> coll1,
          PCollection<T> coll2)Compute the set difference between two sets of elements. | 
| static <S> PCollection<S> | Distinct. distinct(PCollection<S> input)Construct a new  PCollectionthat contains the unique elements of a
 given inputPCollection. | 
| static <S> PCollection<S> | Distinct. distinct(PCollection<S> input,
        int flushEvery)A  distinctoperation that gives the client more control over how frequently
 elements are flushed to disk in order to allow control over performance or
 memory consumption. | 
| static <T,N extends Number>  | Sample. groupedWeightedReservoirSample(PTable<Integer,Pair<T,N>> input,
                              int[] sampleSizes)The most general purpose of the weighted reservoir sampling patterns that allows us to choose
 a random sample of elements for each of N input groups. | 
| static <T,N extends Number>  | Sample. groupedWeightedReservoirSample(PTable<Integer,Pair<T,N>> input,
                              int[] sampleSizes,
                              Long seed)Same as the other groupedWeightedReservoirSample method, but include a seed for testing
 purposes. | 
| static <T> PCollection<T> | Set. intersection(PCollection<T> coll1,
            PCollection<T> coll2)Compute the intersection of two sets of elements. | 
| static <K,V> PCollection<K> | PTables. keys(PTable<K,V> ptable)Extract the keys from the given  PTable<K, V>as aPCollection<K>. | 
| static <T> PCollection<T> | Sample. reservoirSample(PCollection<T> input,
               int sampleSize)Select a fixed number of elements from the given  PCollectionwith each element
 equally likely to be included in the sample. | 
| static <T> PCollection<T> | Sample. reservoirSample(PCollection<T> input,
               int sampleSize,
               Long seed)A version of the reservoir sampling algorithm that uses a given seed, primarily for
 testing purposes. | 
| static <S> PCollection<S> | Sample. sample(PCollection<S> input,
      double probability)Output records from the given  PCollectionwith the given probability. | 
| static <S> PCollection<S> | Sample. sample(PCollection<S> input,
      Long seed,
      double probability)Output records from the given  PCollectionusing a given seed. | 
| static <T> PCollection<T> | Shard. shard(PCollection<T> pc,
     int numPartitions)Creates a  PCollection<T>that has the same contents as its input argument but will
 be written to a fixed number of output files. | 
| static <T> PCollection<T> | Sort. sort(PCollection<T> collection)Sorts the  PCollectionusing the natural ordering of its elements in ascending order. | 
| static <T> PCollection<T> | Sort. sort(PCollection<T> collection,
    int numReducers,
    Sort.Order order)Sorts the  PCollectionusing the natural ordering of its elements in
 the order specified using the given number of reducers. | 
| static <T> PCollection<T> | Sort. sort(PCollection<T> collection,
    Sort.Order order)Sorts the  PCollectionusing the natural order of its elements with the givenOrder. | 
| static <K,V1,V2,T>  | SecondarySort. sortAndApply(PTable<K,Pair<V1,V2>> input,
            DoFn<Pair<K,Iterable<Pair<V1,V2>>>,T> doFn,
            PType<T> ptype)Perform a secondary sort on the given  PTableinstance and then apply aDoFnto the resulting sorted data to yield an outputPCollection<T>. | 
| static <K,V1,V2,T>  | SecondarySort. sortAndApply(PTable<K,Pair<V1,V2>> input,
            DoFn<Pair<K,Iterable<Pair<V1,V2>>>,T> doFn,
            PType<T> ptype,
            int numReducers)Perform a secondary sort on the given  PTableinstance and then apply aDoFnto the resulting sorted data to yield an outputPCollection<T>, using
 the given number of reducers. | 
| static <U,V> PCollection<Pair<U,V>> | Sort. sortPairs(PCollection<Pair<U,V>> collection,
         Sort.ColumnOrder... columnOrders)Sorts the  PCollectionofPairs using the specified column
 ordering. | 
| static <V1,V2,V3,V4>  | Sort. sortQuads(PCollection<Tuple4<V1,V2,V3,V4>> collection,
         Sort.ColumnOrder... columnOrders)Sorts the  PCollectionofTuple4s using the specified column
 ordering. | 
| static <V1,V2,V3> PCollection<Tuple3<V1,V2,V3>> | Sort. sortTriples(PCollection<Tuple3<V1,V2,V3>> collection,
           Sort.ColumnOrder... columnOrders)Sorts the  PCollectionofTuple3s using the specified column
 ordering. | 
| static <T extends Tuple>  | Sort. sortTuples(PCollection<T> collection,
          int numReducers,
          Sort.ColumnOrder... columnOrders)Sorts the  PCollectionofTupleNs using the specified column
 ordering and a client-specified number of reducers. | 
| static <T extends Tuple>  | Sort. sortTuples(PCollection<T> collection,
          Sort.ColumnOrder... columnOrders)Sorts the  PCollectionof tuples using the specified column ordering. | 
| static <K,V> PCollection<V> | PTables. values(PTable<K,V> ptable)Extract the values from the given  PTable<K, V>as aPCollection<V>. | 
| static <T,N extends Number>  | Sample. weightedReservoirSample(PCollection<Pair<T,N>> input,
                       int sampleSize)Selects a weighted sample of the elements of the given  PCollection, where the second term in
 the inputPairis a numerical weight. | 
| static <T,N extends Number>  | Sample. weightedReservoirSample(PCollection<Pair<T,N>> input,
                       int sampleSize,
                       Long seed)The weighted reservoir sampling function with the seed term exposed for testing purposes. | 
| Modifier and Type | Method and Description | 
|---|---|
| static <T,U> Pair<PCollection<T>,PCollection<U>> | Channels. split(PCollection<Pair<T,U>> pCollection)Splits a  PCollectionof anyPairof objects into a Pair of
 PCollection}, to allow for the output of a DoFn to be handled using
 separate channels. | 
| static <T,U> Pair<PCollection<T>,PCollection<U>> | Channels. split(PCollection<Pair<T,U>> pCollection)Splits a  PCollectionof anyPairof objects into a Pair of
 PCollection}, to allow for the output of a DoFn to be handled using
 separate channels. | 
| static <T,U> Pair<PCollection<T>,PCollection<U>> | Channels. split(PCollection<Pair<T,U>> pCollection,
     PType<T> firstPType,
     PType<U> secondPType)Splits a  PCollectionof anyPairof objects into a Pair of
 PCollection}, to allow for the output of a DoFn to be handled using
 separate channels. | 
| static <T,U> Pair<PCollection<T>,PCollection<U>> | Channels. split(PCollection<Pair<T,U>> pCollection,
     PType<T> firstPType,
     PType<U> secondPType)Splits a  PCollectionof anyPairof objects into a Pair of
 PCollection}, to allow for the output of a DoFn to be handled using
 separate channels. | 
| Modifier and Type | Method and Description | 
|---|---|
| static <S> PCollection<S> | Aggregate. aggregate(PCollection<S> collect,
         Aggregator<S> aggregator) | 
| static <K,V> PTable<K,V> | PTables. asPTable(PCollection<Pair<K,V>> pcollect)Convert the given  PCollection<Pair<K, V>>to aPTable<K, V>. | 
| static <T> PCollection<Tuple3<T,T,T>> | Set. comm(PCollection<T> coll1,
    PCollection<T> coll2)Find the elements that are common to two sets, like the Unix
  commutility. | 
| static <T> PCollection<Tuple3<T,T,T>> | Set. comm(PCollection<T> coll1,
    PCollection<T> coll2)Find the elements that are common to two sets, like the Unix
  commutility. | 
| static <S> PTable<S,Long> | Aggregate. count(PCollection<S> collect)Returns a  PTablethat contains the unique elements of this collection mapped to a count
 of their occurrences. | 
| static <S> PTable<S,Long> | Aggregate. count(PCollection<S> collect,
     int numPartitions)Returns a  PTablethat contains the unique elements of this collection mapped to a count
 of their occurrences. | 
| static <U,V> PCollection<Pair<U,V>> | Cartesian. cross(PCollection<U> left,
     PCollection<V> right)Performs a full cross join on the specified  PCollections (using the
 same strategy as Pig's CROSS operator). | 
| static <U,V> PCollection<Pair<U,V>> | Cartesian. cross(PCollection<U> left,
     PCollection<V> right)Performs a full cross join on the specified  PCollections (using the
 same strategy as Pig's CROSS operator). | 
| static <U,V> PCollection<Pair<U,V>> | Cartesian. cross(PCollection<U> left,
     PCollection<V> right,
     int parallelism)Performs a full cross join on the specified  PCollections (using the
 same strategy as Pig's CROSS operator). | 
| static <U,V> PCollection<Pair<U,V>> | Cartesian. cross(PCollection<U> left,
     PCollection<V> right,
     int parallelism)Performs a full cross join on the specified  PCollections (using the
 same strategy as Pig's CROSS operator). | 
| static <T> PCollection<T> | Set. difference(PCollection<T> coll1,
          PCollection<T> coll2)Compute the set difference between two sets of elements. | 
| static <T> PCollection<T> | Set. difference(PCollection<T> coll1,
          PCollection<T> coll2)Compute the set difference between two sets of elements. | 
| static <S> PCollection<S> | Distinct. distinct(PCollection<S> input)Construct a new  PCollectionthat contains the unique elements of a
 given inputPCollection. | 
| static <S> PCollection<S> | Distinct. distinct(PCollection<S> input,
        int flushEvery)A  distinctoperation that gives the client more control over how frequently
 elements are flushed to disk in order to allow control over performance or
 memory consumption. | 
| static <X> PTable<X,Long> | TopList. globalToplist(PCollection<X> input)Create a list of unique items in the input collection with their count, sorted descending by their frequency. | 
| static <T> PCollection<T> | Set. intersection(PCollection<T> coll1,
            PCollection<T> coll2)Compute the intersection of two sets of elements. | 
| static <T> PCollection<T> | Set. intersection(PCollection<T> coll1,
            PCollection<T> coll2)Compute the intersection of two sets of elements. | 
| static <S> PObject<Long> | Aggregate. length(PCollection<S> collect)Returns the number of elements in the provided PCollection. | 
| static <S> PObject<S> | Aggregate. max(PCollection<S> collect)Returns the largest numerical element from the input collection. | 
| static <S> PObject<S> | Aggregate. min(PCollection<S> collect)Returns the smallest numerical element from the input collection. | 
| static <T> PCollection<T> | Sample. reservoirSample(PCollection<T> input,
               int sampleSize)Select a fixed number of elements from the given  PCollectionwith each element
 equally likely to be included in the sample. | 
| static <T> PCollection<T> | Sample. reservoirSample(PCollection<T> input,
               int sampleSize,
               Long seed)A version of the reservoir sampling algorithm that uses a given seed, primarily for
 testing purposes. | 
| static <S> PCollection<S> | Sample. sample(PCollection<S> input,
      double probability)Output records from the given  PCollectionwith the given probability. | 
| static <S> PCollection<S> | Sample. sample(PCollection<S> input,
      Long seed,
      double probability)Output records from the given  PCollectionusing a given seed. | 
| static <T> PCollection<T> | Shard. shard(PCollection<T> pc,
     int numPartitions)Creates a  PCollection<T>that has the same contents as its input argument but will
 be written to a fixed number of output files. | 
| static <T> PCollection<T> | Sort. sort(PCollection<T> collection)Sorts the  PCollectionusing the natural ordering of its elements in ascending order. | 
| static <T> PCollection<T> | Sort. sort(PCollection<T> collection,
    int numReducers,
    Sort.Order order)Sorts the  PCollectionusing the natural ordering of its elements in
 the order specified using the given number of reducers. | 
| static <T> PCollection<T> | Sort. sort(PCollection<T> collection,
    Sort.Order order)Sorts the  PCollectionusing the natural order of its elements with the givenOrder. | 
| static <U,V> PCollection<Pair<U,V>> | Sort. sortPairs(PCollection<Pair<U,V>> collection,
         Sort.ColumnOrder... columnOrders)Sorts the  PCollectionofPairs using the specified column
 ordering. | 
| static <V1,V2,V3,V4>  | Sort. sortQuads(PCollection<Tuple4<V1,V2,V3,V4>> collection,
         Sort.ColumnOrder... columnOrders)Sorts the  PCollectionofTuple4s using the specified column
 ordering. | 
| static <V1,V2,V3> PCollection<Tuple3<V1,V2,V3>> | Sort. sortTriples(PCollection<Tuple3<V1,V2,V3>> collection,
           Sort.ColumnOrder... columnOrders)Sorts the  PCollectionofTuple3s using the specified column
 ordering. | 
| static <T extends Tuple>  | Sort. sortTuples(PCollection<T> collection,
          int numReducers,
          Sort.ColumnOrder... columnOrders)Sorts the  PCollectionofTupleNs using the specified column
 ordering and a client-specified number of reducers. | 
| static <T extends Tuple>  | Sort. sortTuples(PCollection<T> collection,
          Sort.ColumnOrder... columnOrders)Sorts the  PCollectionof tuples using the specified column ordering. | 
| static <T,U> Pair<PCollection<T>,PCollection<U>> | Channels. split(PCollection<Pair<T,U>> pCollection)Splits a  PCollectionof anyPairof objects into a Pair of
 PCollection}, to allow for the output of a DoFn to be handled using
 separate channels. | 
| static <T,U> Pair<PCollection<T>,PCollection<U>> | Channels. split(PCollection<Pair<T,U>> pCollection,
     PType<T> firstPType,
     PType<U> secondPType)Splits a  PCollectionof anyPairof objects into a Pair of
 PCollection}, to allow for the output of a DoFn to be handled using
 separate channels. | 
| static <T,N extends Number>  | Sample. weightedReservoirSample(PCollection<Pair<T,N>> input,
                       int sampleSize)Selects a weighted sample of the elements of the given  PCollection, where the second term in
 the inputPairis a numerical weight. | 
| static <T,N extends Number>  | Sample. weightedReservoirSample(PCollection<Pair<T,N>> input,
                       int sampleSize,
                       Long seed)The weighted reservoir sampling function with the seed term exposed for testing purposes. | 
| Modifier and Type | Method and Description | 
|---|---|
| static <K,U,V,T> PCollection<T> | OneToManyJoin. oneToManyJoin(PTable<K,U> left,
             PTable<K,V> right,
             DoFn<Pair<U,Iterable<V>>,T> postProcessFn,
             PType<T> ptype)Performs a join on two tables, where the left table only contains a single
 value per key. | 
| static <K,U,V,T> PCollection<T> | OneToManyJoin. oneToManyJoin(PTable<K,U> left,
             PTable<K,V> right,
             DoFn<Pair<U,Iterable<V>>,T> postProcessFn,
             PType<T> ptype,
             int numReducers)Supports a user-specified number of reducers for the one-to-many join. | 
| Modifier and Type | Method and Description | 
|---|---|
| <T> PCollection<T> | CrunchTool. read(Source<T> source) | 
| PCollection<String> | CrunchTool. readTextFile(String pathName) | 
| Modifier and Type | Method and Description | 
|---|---|
| static <T> int | PartitionUtils. getRecommendedPartitions(PCollection<T> pcollection) | 
| static <T> int | PartitionUtils. getRecommendedPartitions(PCollection<T> pcollection,
                        org.apache.hadoop.conf.Configuration conf) | 
| <T> Iterable<T> | CrunchTool. materialize(PCollection<T> pcollection) | 
| void | CrunchTool. write(PCollection<?> pcollection,
     Target target) | 
| void | CrunchTool. writeTextFile(PCollection<?> pcollection,
             String pathName) | 
Copyright © 2015 The Apache Software Foundation. All Rights Reserved.