org.apache.crunch
Interface PCollection<S>

All Known Subinterfaces:
PGroupedTable<K,V>, PTable<K,V>
All Known Implementing Classes:
BaseDoCollection, BaseDoTable, BaseGroupedTable, BaseInputCollection, BaseInputTable, BaseUnionCollection, BaseUnionTable, DoCollection, DoCollection, DoTable, DoTable, InputCollection, InputCollection, InputTable, InputTable, MemCollection, MemTable, PCollectionImpl, PGroupedTableImpl, PGroupedTableImpl, PTableBase, UnionCollection, UnionCollection, UnionTable, UnionTable

public interface PCollection<S>

A representation of an immutable, distributed collection of elements that is the fundamental target of computations in Crunch.


Method Summary
 PObject<Collection<S>> asCollection()
           
 ReadableData<S> asReadable(boolean materialize)
           
<K> PTable<K,S>
by(MapFn<S,K> extractKeyFn, PType<K> keyType)
          Apply the given map function to each element of this instance in order to create a PTable.
<K> PTable<K,S>
by(String name, MapFn<S,K> extractKeyFn, PType<K> keyType)
          Apply the given map function to each element of this instance in order to create a PTable.
 PCollection<S> cache()
          Marks this data as cached using the default CachingOptions.
 PCollection<S> cache(CachingOptions options)
          Marks this data as cached using the given CachingOptions.
 PTable<S,Long> count()
          Returns a PTable instance that contains the counts of each unique element of this PCollection.
 PCollection<S> filter(FilterFn<S> filterFn)
          Apply the given filter function to this instance and return the resulting PCollection.
 PCollection<S> filter(String name, FilterFn<S> filterFn)
          Apply the given filter function to this instance and return the resulting PCollection.
 String getName()
          Returns a shorthand name for this PCollection.
 Pipeline getPipeline()
          Returns the Pipeline associated with this PCollection.
 PType<S> getPType()
          Returns the PType of this PCollection.
 long getSize()
          Returns the size of the data represented by this PCollection in bytes.
 PTypeFamily getTypeFamily()
          Returns the PTypeFamily of this PCollection.
 PObject<Long> length()
          Returns the number of elements represented by this PCollection.
 Iterable<S> materialize()
          Returns a reference to the data set represented by this PCollection that may be used by the client to read the data locally.
 PObject<S> max()
          Returns a PObject of the maximum element of this instance.
 PObject<S> min()
          Returns a PObject of the minimum element of this instance.
<K,V> PTable<K,V>
parallelDo(DoFn<S,Pair<K,V>> doFn, PTableType<K,V> type)
          Similar to the other parallelDo instance, but returns a PTable instance instead of a PCollection.
<T> PCollection<T>
parallelDo(DoFn<S,T> doFn, PType<T> type)
          Applies the given doFn to the elements of this PCollection and returns a new PCollection that is the output of this processing.
<K,V> PTable<K,V>
parallelDo(String name, DoFn<S,Pair<K,V>> doFn, PTableType<K,V> type)
          Similar to the other parallelDo instance, but returns a PTable instance instead of a PCollection.
<K,V> PTable<K,V>
parallelDo(String name, DoFn<S,Pair<K,V>> doFn, PTableType<K,V> type, ParallelDoOptions options)
          Similar to the other parallelDo instance, but returns a PTable instance instead of a PCollection.
<T> PCollection<T>
parallelDo(String name, DoFn<S,T> doFn, PType<T> type)
          Applies the given doFn to the elements of this PCollection and returns a new PCollection that is the output of this processing.
<T> PCollection<T>
parallelDo(String name, DoFn<S,T> doFn, PType<T> type, ParallelDoOptions options)
          Applies the given doFn to the elements of this PCollection and returns a new PCollection that is the output of this processing.
 PCollection<S> union(PCollection<S>... collections)
          Returns a PCollection instance that acts as the union of this PCollection and the input PCollections.
 PCollection<S> union(PCollection<S> other)
          Returns a PCollection instance that acts as the union of this PCollection and the given PCollection.
 PCollection<S> write(Target target)
          Write the contents of this PCollection to the given Target, using the storage format specified by the target.
 PCollection<S> write(Target target, Target.WriteMode writeMode)
          Write the contents of this PCollection to the given Target, using the given Target.WriteMode to handle existing targets.
 

Method Detail

getPipeline

Pipeline getPipeline()
Returns the Pipeline associated with this PCollection.


union

PCollection<S> union(PCollection<S> other)
Returns a PCollection instance that acts as the union of this PCollection and the given PCollection.


union

PCollection<S> union(PCollection<S>... collections)
Returns a PCollection instance that acts as the union of this PCollection and the input PCollections.


parallelDo

<T> PCollection<T> parallelDo(DoFn<S,T> doFn,
                              PType<T> type)
Applies the given doFn to the elements of this PCollection and returns a new PCollection that is the output of this processing.

Parameters:
doFn - The DoFn to apply
type - The PType of the resulting PCollection
Returns:
a new PCollection

parallelDo

<T> PCollection<T> parallelDo(String name,
                              DoFn<S,T> doFn,
                              PType<T> type)
Applies the given doFn to the elements of this PCollection and returns a new PCollection that is the output of this processing.

Parameters:
name - An identifier for this processing step, useful for debugging
doFn - The DoFn to apply
type - The PType of the resulting PCollection
Returns:
a new PCollection

parallelDo

<T> PCollection<T> parallelDo(String name,
                              DoFn<S,T> doFn,
                              PType<T> type,
                              ParallelDoOptions options)
Applies the given doFn to the elements of this PCollection and returns a new PCollection that is the output of this processing.

Parameters:
name - An identifier for this processing step, useful for debugging
doFn - The DoFn to apply
type - The PType of the resulting PCollection
options - Optional information that is needed for certain pipeline operations
Returns:
a new PCollection

parallelDo

<K,V> PTable<K,V> parallelDo(DoFn<S,Pair<K,V>> doFn,
                             PTableType<K,V> type)
Similar to the other parallelDo instance, but returns a PTable instance instead of a PCollection.

Parameters:
doFn - The DoFn to apply
type - The PTableType of the resulting PTable
Returns:
a new PTable

parallelDo

<K,V> PTable<K,V> parallelDo(String name,
                             DoFn<S,Pair<K,V>> doFn,
                             PTableType<K,V> type)
Similar to the other parallelDo instance, but returns a PTable instance instead of a PCollection.

Parameters:
name - An identifier for this processing step
doFn - The DoFn to apply
type - The PTableType of the resulting PTable
Returns:
a new PTable

parallelDo

<K,V> PTable<K,V> parallelDo(String name,
                             DoFn<S,Pair<K,V>> doFn,
                             PTableType<K,V> type,
                             ParallelDoOptions options)
Similar to the other parallelDo instance, but returns a PTable instance instead of a PCollection.

Parameters:
name - An identifier for this processing step
doFn - The DoFn to apply
type - The PTableType of the resulting PTable
options - Optional information that is needed for certain pipeline operations
Returns:
a new PTable

write

PCollection<S> write(Target target)
Write the contents of this PCollection to the given Target, using the storage format specified by the target.

Parameters:
target - The target to write to

write

PCollection<S> write(Target target,
                     Target.WriteMode writeMode)
Write the contents of this PCollection to the given Target, using the given Target.WriteMode to handle existing targets.

Parameters:
target - The target
writeMode - The rule for handling existing outputs at the target location

materialize

Iterable<S> materialize()
Returns a reference to the data set represented by this PCollection that may be used by the client to read the data locally.


cache

PCollection<S> cache()
Marks this data as cached using the default CachingOptions. Cached PCollections will only be processed once, and then their contents will be saved so that downstream code can process them many times.

Returns:
this PCollection instance

cache

PCollection<S> cache(CachingOptions options)
Marks this data as cached using the given CachingOptions. Cached PCollections will only be processed once and then their contents will be saved so that downstream code can process them many times.

Parameters:
options - the options that control the cache settings for the data
Returns:
this PCollection instance

asCollection

PObject<Collection<S>> asCollection()
Returns:
A PObject encapsulating an in-memory Collection containing the values of this PCollection.

asReadable

ReadableData<S> asReadable(boolean materialize)
Parameters:
materialize - If true, materialize this data before returning a reference to it
Returns:
A reference to the data in this instance that can be read from a job running on a cluster.

getPType

PType<S> getPType()
Returns the PType of this PCollection.


getTypeFamily

PTypeFamily getTypeFamily()
Returns the PTypeFamily of this PCollection.


getSize

long getSize()
Returns the size of the data represented by this PCollection in bytes.


length

PObject<Long> length()
Returns the number of elements represented by this PCollection.

Returns:
An PObject containing the number of elements in this PCollection.

getName

String getName()
Returns a shorthand name for this PCollection.


filter

PCollection<S> filter(FilterFn<S> filterFn)
Apply the given filter function to this instance and return the resulting PCollection.


filter

PCollection<S> filter(String name,
                      FilterFn<S> filterFn)
Apply the given filter function to this instance and return the resulting PCollection.

Parameters:
name - An identifier for this processing step
filterFn - The FilterFn to apply

by

<K> PTable<K,S> by(MapFn<S,K> extractKeyFn,
                   PType<K> keyType)
Apply the given map function to each element of this instance in order to create a PTable.


by

<K> PTable<K,S> by(String name,
                   MapFn<S,K> extractKeyFn,
                   PType<K> keyType)
Apply the given map function to each element of this instance in order to create a PTable.

Parameters:
name - An identifier for this processing step
extractKeyFn - The MapFn to apply

count

PTable<S,Long> count()
Returns a PTable instance that contains the counts of each unique element of this PCollection.


max

PObject<S> max()
Returns a PObject of the maximum element of this instance.


min

PObject<S> min()
Returns a PObject of the minimum element of this instance.



Copyright © 2014 The Apache Software Foundation. All Rights Reserved.