This project has retired. For details please refer to its Attic page.
Uses of Interface org.apache.crunch.PCollection (Apache Crunch 0.8.0 API)

Uses of Interface
org.apache.crunch.PCollection

Packages that use PCollection
org.apache.crunch Client-facing API and core abstractions. 
org.apache.crunch.contrib.bloomfilter Support for creating Bloom Filters. 
org.apache.crunch.contrib.text   
org.apache.crunch.examples Example applications demonstrating various aspects of Crunch. 
org.apache.crunch.impl.mem In-memory Pipeline implementation for rapid prototyping and testing. 
org.apache.crunch.impl.mr A Pipeline implementation that runs on Hadoop MapReduce. 
org.apache.crunch.lib Joining, sorting, aggregating, and other commonly used functionality. 
org.apache.crunch.lib.join Inner and outer joins on collections. 
org.apache.crunch.util An assorted set of utilities. 
 

Uses of PCollection in org.apache.crunch
 

Subinterfaces of PCollection in org.apache.crunch
 interface PGroupedTable<K,V>
          The Crunch representation of a grouped PTable, which corresponds to the output of the shuffle phase of a MapReduce job.
 interface PTable<K,V>
          A sub-interface of PCollection that represents an immutable, distributed multi-map of keys and values.
 

Methods in org.apache.crunch that return PCollection
 PCollection<S> PCollection.filter(FilterFn<S> filterFn)
          Apply the given filter function to this instance and return the resulting PCollection.
 PCollection<S> PCollection.filter(String name, FilterFn<S> filterFn)
          Apply the given filter function to this instance and return the resulting PCollection.
 PCollection<K> PTable.keys()
          Returns a PCollection made up of the keys in this PTable.
<T> PCollection<T>
PCollection.parallelDo(DoFn<S,T> doFn, PType<T> type)
          Applies the given doFn to the elements of this PCollection and returns a new PCollection that is the output of this processing.
<T> PCollection<T>
PCollection.parallelDo(String name, DoFn<S,T> doFn, PType<T> type)
          Applies the given doFn to the elements of this PCollection and returns a new PCollection that is the output of this processing.
<T> PCollection<T>
PCollection.parallelDo(String name, DoFn<S,T> doFn, PType<T> type, ParallelDoOptions options)
          Applies the given doFn to the elements of this PCollection and returns a new PCollection that is the output of this processing.
<T> PCollection<T>
Pipeline.read(Source<T> source)
          Converts the given Source into a PCollection that is available to jobs run using this Pipeline instance.
 PCollection<String> Pipeline.readTextFile(String pathName)
          A convenience method for reading a text file.
 PCollection<S> PCollection.union(PCollection<S>... collections)
          Returns a PCollection instance that acts as the union of this PCollection and the input PCollections.
 PCollection<S> PCollection.union(PCollection<S> other)
          Returns a PCollection instance that acts as the union of this PCollection and the given PCollection.
 PCollection<V> PTable.values()
          Returns a PCollection made up of the values in this PTable.
 PCollection<S> PCollection.write(Target target)
          Write the contents of this PCollection to the given Target, using the storage format specified by the target.
 PCollection<S> PCollection.write(Target target, Target.WriteMode writeMode)
          Write the contents of this PCollection to the given Target, using the given Target.WriteMode to handle existing targets.
 

Methods in org.apache.crunch with parameters of type PCollection
<T> Iterable<T>
Pipeline.materialize(PCollection<T> pcollection)
          Create the given PCollection and read the data it contains into the returned Collection instance for client use.
 PCollection<S> PCollection.union(PCollection<S>... collections)
          Returns a PCollection instance that acts as the union of this PCollection and the input PCollections.
 PCollection<S> PCollection.union(PCollection<S> other)
          Returns a PCollection instance that acts as the union of this PCollection and the given PCollection.
 void Pipeline.write(PCollection<?> collection, Target target)
          Write the given collection to the given target on the next pipeline run.
 void Pipeline.write(PCollection<?> collection, Target target, Target.WriteMode writeMode)
          Write the contents of the PCollection to the given Target, using the storage format specified by the target and the given WriteMode for cases where the referenced Target already exists.
<T> void
Pipeline.writeTextFile(PCollection<T> collection, String pathName)
          A convenience method for writing a text file.
 

Uses of PCollection in org.apache.crunch.contrib.bloomfilter
 

Methods in org.apache.crunch.contrib.bloomfilter with parameters of type PCollection
static
<T> PObject<org.apache.hadoop.util.bloom.BloomFilter>
BloomFilterFactory.createFilter(PCollection<T> collection, BloomFilterFn<T> filterFn)
           
 

Uses of PCollection in org.apache.crunch.contrib.text
 

Methods in org.apache.crunch.contrib.text that return PCollection
static
<T> PCollection<T>
Parse.parse(String groupName, PCollection<String> input, Extractor<T> extractor)
          Parses the lines of the input PCollection<String> and returns a PCollection<T> using the given Extractor<T>.
static
<T> PCollection<T>
Parse.parse(String groupName, PCollection<String> input, PTypeFamily ptf, Extractor<T> extractor)
          Parses the lines of the input PCollection<String> and returns a PCollection<T> using the given Extractor<T> that uses the given PTypeFamily.
 

Methods in org.apache.crunch.contrib.text with parameters of type PCollection
static
<T> PCollection<T>
Parse.parse(String groupName, PCollection<String> input, Extractor<T> extractor)
          Parses the lines of the input PCollection<String> and returns a PCollection<T> using the given Extractor<T>.
static
<T> PCollection<T>
Parse.parse(String groupName, PCollection<String> input, PTypeFamily ptf, Extractor<T> extractor)
          Parses the lines of the input PCollection<String> and returns a PCollection<T> using the given Extractor<T> that uses the given PTypeFamily.
static
<K,V> PTable<K,V>
Parse.parseTable(String groupName, PCollection<String> input, Extractor<Pair<K,V>> extractor)
          Parses the lines of the input PCollection<String> and returns a PTable<K, V> using the given Extractor<Pair<K, V>>.
static
<K,V> PTable<K,V>
Parse.parseTable(String groupName, PCollection<String> input, PTypeFamily ptf, Extractor<Pair<K,V>> extractor)
          Parses the lines of the input PCollection<String> and returns a PTable<K, V> using the given Extractor<Pair<K, V>> that uses the given PTypeFamily.
 

Uses of PCollection in org.apache.crunch.examples
 

Methods in org.apache.crunch.examples that return PCollection
 PCollection<org.apache.hadoop.hbase.client.Put> WordAggregationHBase.createPut(PTable<String,String> extractedText)
          Create puts in order to insert them in hbase.
 

Uses of PCollection in org.apache.crunch.impl.mem
 

Methods in org.apache.crunch.impl.mem that return PCollection
static
<T> PCollection<T>
MemPipeline.collectionOf(Iterable<T> collect)
           
static
<T> PCollection<T>
MemPipeline.collectionOf(T... ts)
           
<T> PCollection<T>
MemPipeline.read(Source<T> source)
           
 PCollection<String> MemPipeline.readTextFile(String pathName)
           
static
<T> PCollection<T>
MemPipeline.typedCollectionOf(PType<T> ptype, Iterable<T> collect)
           
static
<T> PCollection<T>
MemPipeline.typedCollectionOf(PType<T> ptype, T... ts)
           
 

Methods in org.apache.crunch.impl.mem with parameters of type PCollection
<T> Iterable<T>
MemPipeline.materialize(PCollection<T> pcollection)
           
 void MemPipeline.write(PCollection<?> collection, Target target)
           
 void MemPipeline.write(PCollection<?> collection, Target target, Target.WriteMode writeMode)
           
<T> void
MemPipeline.writeTextFile(PCollection<T> collection, String pathName)
           
 

Uses of PCollection in org.apache.crunch.impl.mr
 

Methods in org.apache.crunch.impl.mr that return PCollection
<S> PCollection<S>
MRPipeline.read(Source<S> source)
           
 PCollection<String> MRPipeline.readTextFile(String pathName)
           
 

Methods in org.apache.crunch.impl.mr with parameters of type PCollection
<T> ReadableSource<T>
MRPipeline.getMaterializeSourceTarget(PCollection<T> pcollection)
          Retrieve a ReadableSourceTarget that provides access to the contents of a PCollection.
<T> Iterable<T>
MRPipeline.materialize(PCollection<T> pcollection)
           
 void MRPipeline.write(PCollection<?> pcollection, Target target)
           
 void MRPipeline.write(PCollection<?> pcollection, Target target, Target.WriteMode writeMode)
           
<T> void
MRPipeline.writeTextFile(PCollection<T> pcollection, String pathName)
           
 

Uses of PCollection in org.apache.crunch.lib
 

Methods in org.apache.crunch.lib that return PCollection
static
<T> PCollection<Tuple3<T,T,T>>
Set.comm(PCollection<T> coll1, PCollection<T> coll2)
          Find the elements that are common to two sets, like the Unix comm utility.
static
<U,V> PCollection<Pair<U,V>>
Cartesian.cross(PCollection<U> left, PCollection<V> right)
          Performs a full cross join on the specified PCollections (using the same strategy as Pig's CROSS operator).
static
<U,V> PCollection<Pair<U,V>>
Cartesian.cross(PCollection<U> left, PCollection<V> right, int parallelism)
          Performs a full cross join on the specified PCollections (using the same strategy as Pig's CROSS operator).
static
<T> PCollection<T>
Set.difference(PCollection<T> coll1, PCollection<T> coll2)
          Compute the set difference between two sets of elements.
static
<S> PCollection<S>
Distinct.distinct(PCollection<S> input)
          Construct a new PCollection that contains the unique elements of a given input PCollection.
static
<S> PCollection<S>
Distinct.distinct(PCollection<S> input, int flushEvery)
          A distinct operation that gives the client more control over how frequently elements are flushed to disk in order to allow control over performance or memory consumption.
static
<T,N extends Number>
PCollection<Pair<Integer,T>>
Sample.groupedWeightedReservoirSample(PTable<Integer,Pair<T,N>> input, int[] sampleSizes)
          The most general purpose of the weighted reservoir sampling patterns that allows us to choose a random sample of elements for each of N input groups.
static
<T,N extends Number>
PCollection<Pair<Integer,T>>
Sample.groupedWeightedReservoirSample(PTable<Integer,Pair<T,N>> input, int[] sampleSizes, Long seed)
          Same as the other groupedWeightedReservoirSample method, but include a seed for testing purposes.
static
<T> PCollection<T>
Set.intersection(PCollection<T> coll1, PCollection<T> coll2)
          Compute the intersection of two sets of elements.
static
<K,V> PCollection<K>
PTables.keys(PTable<K,V> ptable)
          Extract the keys from the given PTable<K, V> as a PCollection<K>.
static
<T> PCollection<T>
Sample.reservoirSample(PCollection<T> input, int sampleSize)
          Select a fixed number of elements from the given PCollection with each element equally likely to be included in the sample.
static
<T> PCollection<T>
Sample.reservorSample(PCollection<T> input, int sampleSize, Long seed)
          A version of the reservoir sampling algorithm that uses a given seed, primarily for testing purposes.
static
<S> PCollection<S>
Sample.sample(PCollection<S> input, double probability)
          Output records from the given PCollection with the given probability.
static
<S> PCollection<S>
Sample.sample(PCollection<S> input, Long seed, double probability)
          Output records from the given PCollection using a given seed.
static
<T> PCollection<T>
Shard.shard(PCollection<T> pc, int numPartitions)
          Creates a PCollection<T> that has the same contents as its input argument but will be written to a fixed number of output files.
static
<T> PCollection<T>
Sort.sort(PCollection<T> collection)
          Sorts the PCollection using the natural ordering of its elements in ascending order.
static
<T> PCollection<T>
Sort.sort(PCollection<T> collection, int numReducers, Sort.Order order)
          Sorts the PCollection using the natural ordering of its elements in the order specified using the given number of reducers.
static
<T> PCollection<T>
Sort.sort(PCollection<T> collection, Sort.Order order)
          Sorts the PCollection using the natural order of its elements with the given Order.
static
<K,V1,V2,T>
PCollection<T>
SecondarySort.sortAndApply(PTable<K,Pair<V1,V2>> input, DoFn<Pair<K,Iterable<Pair<V1,V2>>>,T> doFn, PType<T> ptype)
          Perform a secondary sort on the given PTable instance and then apply a DoFn to the resulting sorted data to yield an output PCollection<T>.
static
<K,V1,V2,T>
PCollection<T>
SecondarySort.sortAndApply(PTable<K,Pair<V1,V2>> input, DoFn<Pair<K,Iterable<Pair<V1,V2>>>,T> doFn, PType<T> ptype, int numReducers)
          Perform a secondary sort on the given PTable instance and then apply a DoFn to the resulting sorted data to yield an output PCollection<T>, using the given number of reducers.
static
<U,V> PCollection<Pair<U,V>>
Sort.sortPairs(PCollection<Pair<U,V>> collection, Sort.ColumnOrder... columnOrders)
          Sorts the PCollection of Pairs using the specified column ordering.
static
<V1,V2,V3,V4>
PCollection<Tuple4<V1,V2,V3,V4>>
Sort.sortQuads(PCollection<Tuple4<V1,V2,V3,V4>> collection, Sort.ColumnOrder... columnOrders)
          Sorts the PCollection of Tuple4s using the specified column ordering.
static
<V1,V2,V3> PCollection<Tuple3<V1,V2,V3>>
Sort.sortTriples(PCollection<Tuple3<V1,V2,V3>> collection, Sort.ColumnOrder... columnOrders)
          Sorts the PCollection of Tuple3s using the specified column ordering.
static
<T extends Tuple>
PCollection<T>
Sort.sortTuples(PCollection<T> collection, int numReducers, Sort.ColumnOrder... columnOrders)
          Sorts the PCollection of TupleNs using the specified column ordering and a client-specified number of reducers.
static
<T extends Tuple>
PCollection<T>
Sort.sortTuples(PCollection<T> collection, Sort.ColumnOrder... columnOrders)
          Sorts the PCollection of tuples using the specified column ordering.
static
<K,V> PCollection<V>
PTables.values(PTable<K,V> ptable)
          Extract the values from the given PTable<K, V> as a PCollection<V>.
static
<T,N extends Number>
PCollection<T>
Sample.weightedReservoirSample(PCollection<Pair<T,N>> input, int sampleSize)
          Selects a weighted sample of the elements of the given PCollection, where the second term in the input Pair is a numerical weight.
static
<T,N extends Number>
PCollection<T>
Sample.weightedReservoirSample(PCollection<Pair<T,N>> input, int sampleSize, Long seed)
          The weighted reservoir sampling function with the seed term exposed for testing purposes.
 

Methods in org.apache.crunch.lib that return types with arguments of type PCollection
static
<T,U> Pair<PCollection<T>,PCollection<U>>
Channels.split(PCollection<Pair<T,U>> pCollection)
          Splits a PCollection of any Pair of objects into a Pair of PCollection}, to allow for the output of a DoFn to be handled using separate channels.
static
<T,U> Pair<PCollection<T>,PCollection<U>>
Channels.split(PCollection<Pair<T,U>> pCollection)
          Splits a PCollection of any Pair of objects into a Pair of PCollection}, to allow for the output of a DoFn to be handled using separate channels.
static
<T,U> Pair<PCollection<T>,PCollection<U>>
Channels.split(PCollection<Pair<T,U>> pCollection, PType<T> firstPType, PType<U> secondPType)
          Splits a PCollection of any Pair of objects into a Pair of PCollection}, to allow for the output of a DoFn to be handled using separate channels.
static
<T,U> Pair<PCollection<T>,PCollection<U>>
Channels.split(PCollection<Pair<T,U>> pCollection, PType<T> firstPType, PType<U> secondPType)
          Splits a PCollection of any Pair of objects into a Pair of PCollection}, to allow for the output of a DoFn to be handled using separate channels.
 

Methods in org.apache.crunch.lib with parameters of type PCollection
static
<K,V> PTable<K,V>
PTables.asPTable(PCollection<Pair<K,V>> pcollect)
          Convert the given PCollection<Pair<K, V>> to a PTable<K, V>.
static
<T> PCollection<Tuple3<T,T,T>>
Set.comm(PCollection<T> coll1, PCollection<T> coll2)
          Find the elements that are common to two sets, like the Unix comm utility.
static
<T> PCollection<Tuple3<T,T,T>>
Set.comm(PCollection<T> coll1, PCollection<T> coll2)
          Find the elements that are common to two sets, like the Unix comm utility.
static
<S> PTable<S,Long>
Aggregate.count(PCollection<S> collect)
          Returns a PTable that contains the unique elements of this collection mapped to a count of their occurrences.
static
<S> PTable<S,Long>
Aggregate.count(PCollection<S> collect, int numPartitions)
          Returns a PTable that contains the unique elements of this collection mapped to a count of their occurrences.
static
<U,V> PCollection<Pair<U,V>>
Cartesian.cross(PCollection<U> left, PCollection<V> right)
          Performs a full cross join on the specified PCollections (using the same strategy as Pig's CROSS operator).
static
<U,V> PCollection<Pair<U,V>>
Cartesian.cross(PCollection<U> left, PCollection<V> right)
          Performs a full cross join on the specified PCollections (using the same strategy as Pig's CROSS operator).
static
<U,V> PCollection<Pair<U,V>>
Cartesian.cross(PCollection<U> left, PCollection<V> right, int parallelism)
          Performs a full cross join on the specified PCollections (using the same strategy as Pig's CROSS operator).
static
<U,V> PCollection<Pair<U,V>>
Cartesian.cross(PCollection<U> left, PCollection<V> right, int parallelism)
          Performs a full cross join on the specified PCollections (using the same strategy as Pig's CROSS operator).
static
<T> PCollection<T>
Set.difference(PCollection<T> coll1, PCollection<T> coll2)
          Compute the set difference between two sets of elements.
static
<T> PCollection<T>
Set.difference(PCollection<T> coll1, PCollection<T> coll2)
          Compute the set difference between two sets of elements.
static
<S> PCollection<S>
Distinct.distinct(PCollection<S> input)
          Construct a new PCollection that contains the unique elements of a given input PCollection.
static
<S> PCollection<S>
Distinct.distinct(PCollection<S> input, int flushEvery)
          A distinct operation that gives the client more control over how frequently elements are flushed to disk in order to allow control over performance or memory consumption.
static
<T> PCollection<T>
Set.intersection(PCollection<T> coll1, PCollection<T> coll2)
          Compute the intersection of two sets of elements.
static
<T> PCollection<T>
Set.intersection(PCollection<T> coll1, PCollection<T> coll2)
          Compute the intersection of two sets of elements.
static
<S> PObject<Long>
Aggregate.length(PCollection<S> collect)
          Returns the number of elements in the provided PCollection.
static
<S> PObject<S>
Aggregate.max(PCollection<S> collect)
          Returns the largest numerical element from the input collection.
static
<S> PObject<S>
Aggregate.min(PCollection<S> collect)
          Returns the smallest numerical element from the input collection.
static
<T> PCollection<T>
Sample.reservoirSample(PCollection<T> input, int sampleSize)
          Select a fixed number of elements from the given PCollection with each element equally likely to be included in the sample.
static
<T> PCollection<T>
Sample.reservorSample(PCollection<T> input, int sampleSize, Long seed)
          A version of the reservoir sampling algorithm that uses a given seed, primarily for testing purposes.
static
<S> PCollection<S>
Sample.sample(PCollection<S> input, double probability)
          Output records from the given PCollection with the given probability.
static
<S> PCollection<S>
Sample.sample(PCollection<S> input, Long seed, double probability)
          Output records from the given PCollection using a given seed.
static
<T> PCollection<T>
Shard.shard(PCollection<T> pc, int numPartitions)
          Creates a PCollection<T> that has the same contents as its input argument but will be written to a fixed number of output files.
static
<T> PCollection<T>
Sort.sort(PCollection<T> collection)
          Sorts the PCollection using the natural ordering of its elements in ascending order.
static
<T> PCollection<T>
Sort.sort(PCollection<T> collection, int numReducers, Sort.Order order)
          Sorts the PCollection using the natural ordering of its elements in the order specified using the given number of reducers.
static
<T> PCollection<T>
Sort.sort(PCollection<T> collection, Sort.Order order)
          Sorts the PCollection using the natural order of its elements with the given Order.
static
<U,V> PCollection<Pair<U,V>>
Sort.sortPairs(PCollection<Pair<U,V>> collection, Sort.ColumnOrder... columnOrders)
          Sorts the PCollection of Pairs using the specified column ordering.
static
<V1,V2,V3,V4>
PCollection<Tuple4<V1,V2,V3,V4>>
Sort.sortQuads(PCollection<Tuple4<V1,V2,V3,V4>> collection, Sort.ColumnOrder... columnOrders)
          Sorts the PCollection of Tuple4s using the specified column ordering.
static
<V1,V2,V3> PCollection<Tuple3<V1,V2,V3>>
Sort.sortTriples(PCollection<Tuple3<V1,V2,V3>> collection, Sort.ColumnOrder... columnOrders)
          Sorts the PCollection of Tuple3s using the specified column ordering.
static
<T extends Tuple>
PCollection<T>
Sort.sortTuples(PCollection<T> collection, int numReducers, Sort.ColumnOrder... columnOrders)
          Sorts the PCollection of TupleNs using the specified column ordering and a client-specified number of reducers.
static
<T extends Tuple>
PCollection<T>
Sort.sortTuples(PCollection<T> collection, Sort.ColumnOrder... columnOrders)
          Sorts the PCollection of tuples using the specified column ordering.
static
<T,U> Pair<PCollection<T>,PCollection<U>>
Channels.split(PCollection<Pair<T,U>> pCollection)
          Splits a PCollection of any Pair of objects into a Pair of PCollection}, to allow for the output of a DoFn to be handled using separate channels.
static
<T,U> Pair<PCollection<T>,PCollection<U>>
Channels.split(PCollection<Pair<T,U>> pCollection, PType<T> firstPType, PType<U> secondPType)
          Splits a PCollection of any Pair of objects into a Pair of PCollection}, to allow for the output of a DoFn to be handled using separate channels.
static
<T,N extends Number>
PCollection<T>
Sample.weightedReservoirSample(PCollection<Pair<T,N>> input, int sampleSize)
          Selects a weighted sample of the elements of the given PCollection, where the second term in the input Pair is a numerical weight.
static
<T,N extends Number>
PCollection<T>
Sample.weightedReservoirSample(PCollection<Pair<T,N>> input, int sampleSize, Long seed)
          The weighted reservoir sampling function with the seed term exposed for testing purposes.
 

Uses of PCollection in org.apache.crunch.lib.join
 

Methods in org.apache.crunch.lib.join that return PCollection
static
<K,U,V,T> PCollection<T>
OneToManyJoin.oneToManyJoin(PTable<K,U> left, PTable<K,V> right, DoFn<Pair<U,Iterable<V>>,T> postProcessFn, PType<T> ptype)
          Performs a join on two tables, where the left table only contains a single value per key.
static
<K,U,V,T> PCollection<T>
OneToManyJoin.oneToManyJoin(PTable<K,U> left, PTable<K,V> right, DoFn<Pair<U,Iterable<V>>,T> postProcessFn, PType<T> ptype, int numReducers)
          Supports a user-specified number of reducers for the one-to-many join.
 

Uses of PCollection in org.apache.crunch.util
 

Methods in org.apache.crunch.util that return PCollection
<T> PCollection<T>
CrunchTool.read(Source<T> source)
           
 PCollection<String> CrunchTool.readTextFile(String pathName)
           
 

Methods in org.apache.crunch.util with parameters of type PCollection
static
<T> int
PartitionUtils.getRecommendedPartitions(PCollection<T> pcollection)
           
static
<T> int
PartitionUtils.getRecommendedPartitions(PCollection<T> pcollection, org.apache.hadoop.conf.Configuration conf)
           
<T> Iterable<T>
CrunchTool.materialize(PCollection<T> pcollection)
           
 void CrunchTool.write(PCollection<?> pcollection, Target target)
           
 void CrunchTool.writeTextFile(PCollection<?> pcollection, String pathName)
           
 



Copyright © 2013 The Apache Software Foundation. All Rights Reserved.