This project has retired. For details please refer to its Attic page.
PCollection (Apache Crunch 0.3.0-incubating API)

org.apache.crunch
Interface PCollection<S>

All Known Subinterfaces:
PGroupedTable<K,V>, PTable<K,V>
All Known Implementing Classes:
DoCollectionImpl, DoTableImpl, InputCollection, InputTable, MemCollection, MemTable, PCollectionImpl, PGroupedTableImpl, PTableBase, UnionCollection, UnionTable

public interface PCollection<S>

A representation of an immutable, distributed collection of elements that is the fundamental target of computations in Crunch.


Method Summary
<K> PTable<K,S>
by(MapFn<S,K> extractKeyFn, PType<K> keyType)
          Apply the given map function to each element of this instance in order to create a PTable.
<K> PTable<K,S>
by(String name, MapFn<S,K> extractKeyFn, PType<K> keyType)
          Apply the given map function to each element of this instance in order to create a PTable.
 PTable<S,Long> count()
          Returns a PTable instance that contains the counts of each unique element of this PCollection.
 PCollection<S> filter(FilterFn<S> filterFn)
          Apply the given filter function to this instance and return the resulting PCollection.
 PCollection<S> filter(String name, FilterFn<S> filterFn)
          Apply the given filter function to this instance and return the resulting PCollection.
 String getName()
          Returns a shorthand name for this PCollection.
 Pipeline getPipeline()
          Returns the Pipeline associated with this PCollection.
 PType<S> getPType()
          Returns the PType of this PCollection.
 long getSize()
          Returns the size of the data represented by this PCollection in bytes.
 PTypeFamily getTypeFamily()
          Returns the PTypeFamily of this PCollection.
 Iterable<S> materialize()
          Returns a reference to the data set represented by this PCollection that may be used by the client to read the data locally.
 PCollection<S> max()
          Returns a PCollection made up of only the maximum element of this instance.
 PCollection<S> min()
          Returns a PCollection made up of only the minimum element of this instance.
<K,V> PTable<K,V>
parallelDo(DoFn<S,Pair<K,V>> doFn, PTableType<K,V> type)
          Similar to the other parallelDo instance, but returns a PTable instance instead of a PCollection.
<T> PCollection<T>
parallelDo(DoFn<S,T> doFn, PType<T> type)
          Applies the given doFn to the elements of this PCollection and returns a new PCollection that is the output of this processing.
<K,V> PTable<K,V>
parallelDo(String name, DoFn<S,Pair<K,V>> doFn, PTableType<K,V> type)
          Similar to the other parallelDo instance, but returns a PTable instance instead of a PCollection.
<T> PCollection<T>
parallelDo(String name, DoFn<S,T> doFn, PType<T> type)
          Applies the given doFn to the elements of this PCollection and returns a new PCollection that is the output of this processing.
 PCollection<S> sample(double acceptanceProbability)
          Randomly sample items from this PCollection instance with the given probability of an item being accepted.
 PCollection<S> sample(double acceptanceProbability, long seed)
          Randomly sample items from this PCollection instance with the given probability of an item being accepted and using the given seed.
 PCollection<S> sort(boolean ascending)
          Returns a PCollection instance that contains all of the elements of this instance in sorted order.
 PCollection<S> union(PCollection<S>... collections)
          Returns a PCollection instance that acts as the union of this PCollection and the input PCollections.
 PCollection<S> write(Target target)
          Write the contents of this PCollection to the given Target, using the storage format specified by the target.
 

Method Detail

getPipeline

Pipeline getPipeline()
Returns the Pipeline associated with this PCollection.


union

PCollection<S> union(PCollection<S>... collections)
Returns a PCollection instance that acts as the union of this PCollection and the input PCollections.


parallelDo

<T> PCollection<T> parallelDo(DoFn<S,T> doFn,
                              PType<T> type)
Applies the given doFn to the elements of this PCollection and returns a new PCollection that is the output of this processing.

Parameters:
doFn - The DoFn to apply
type - The PType of the resulting PCollection
Returns:
a new PCollection

parallelDo

<T> PCollection<T> parallelDo(String name,
                              DoFn<S,T> doFn,
                              PType<T> type)
Applies the given doFn to the elements of this PCollection and returns a new PCollection that is the output of this processing.

Parameters:
name - An identifier for this processing step, useful for debugging
doFn - The DoFn to apply
type - The PType of the resulting PCollection
Returns:
a new PCollection

parallelDo

<K,V> PTable<K,V> parallelDo(DoFn<S,Pair<K,V>> doFn,
                             PTableType<K,V> type)
Similar to the other parallelDo instance, but returns a PTable instance instead of a PCollection.

Parameters:
doFn - The DoFn to apply
type - The PTableType of the resulting PTable
Returns:
a new PTable

parallelDo

<K,V> PTable<K,V> parallelDo(String name,
                             DoFn<S,Pair<K,V>> doFn,
                             PTableType<K,V> type)
Similar to the other parallelDo instance, but returns a PTable instance instead of a PCollection.

Parameters:
name - An identifier for this processing step
doFn - The DoFn to apply
type - The PTableType of the resulting PTable
Returns:
a new PTable

write

PCollection<S> write(Target target)
Write the contents of this PCollection to the given Target, using the storage format specified by the target.

Parameters:
target - The target to write to

materialize

Iterable<S> materialize()
Returns a reference to the data set represented by this PCollection that may be used by the client to read the data locally.


getPType

PType<S> getPType()
Returns the PType of this PCollection.


getTypeFamily

PTypeFamily getTypeFamily()
Returns the PTypeFamily of this PCollection.


getSize

long getSize()
Returns the size of the data represented by this PCollection in bytes.


getName

String getName()
Returns a shorthand name for this PCollection.


filter

PCollection<S> filter(FilterFn<S> filterFn)
Apply the given filter function to this instance and return the resulting PCollection.


filter

PCollection<S> filter(String name,
                      FilterFn<S> filterFn)
Apply the given filter function to this instance and return the resulting PCollection.

Parameters:
name - An identifier for this processing step
filterFn - The FilterFn to apply

by

<K> PTable<K,S> by(MapFn<S,K> extractKeyFn,
                   PType<K> keyType)
Apply the given map function to each element of this instance in order to create a PTable.


by

<K> PTable<K,S> by(String name,
                   MapFn<S,K> extractKeyFn,
                   PType<K> keyType)
Apply the given map function to each element of this instance in order to create a PTable.

Parameters:
name - An identifier for this processing step
extractKeyFn - The MapFn to apply

sort

PCollection<S> sort(boolean ascending)
Returns a PCollection instance that contains all of the elements of this instance in sorted order.


count

PTable<S,Long> count()
Returns a PTable instance that contains the counts of each unique element of this PCollection.


max

PCollection<S> max()
Returns a PCollection made up of only the maximum element of this instance.


min

PCollection<S> min()
Returns a PCollection made up of only the minimum element of this instance.


sample

PCollection<S> sample(double acceptanceProbability)
Randomly sample items from this PCollection instance with the given probability of an item being accepted.


sample

PCollection<S> sample(double acceptanceProbability,
                      long seed)
Randomly sample items from this PCollection instance with the given probability of an item being accepted and using the given seed.



Copyright © 2012 The Apache Software Foundation. All Rights Reserved.