This project has retired. For details please refer to its Attic page.
PCollectionImpl (Apache Crunch 0.10.0 API)

org.apache.crunch.impl.dist.collect
Class PCollectionImpl<S>

java.lang.Object
  extended by org.apache.crunch.impl.dist.collect.PCollectionImpl<S>
All Implemented Interfaces:
PCollection<S>
Direct Known Subclasses:
BaseDoCollection, BaseGroupedTable, BaseInputCollection, BaseUnionCollection, EmptyPCollection, PTableBase

public abstract class PCollectionImpl<S>
extends Object
implements PCollection<S>


Nested Class Summary
static interface PCollectionImpl.Visitor
           
 
Constructor Summary
PCollectionImpl(String name, DistributedPipeline pipeline)
           
PCollectionImpl(String name, DistributedPipeline pipeline, ParallelDoOptions doOptions)
           
 
Method Summary
 void accept(PCollectionImpl.Visitor visitor)
           
 PCollection<S> aggregate(Aggregator<S> aggregator)
          Returns a PCollection that contains the result of aggregating all values in this instance.
 PObject<Collection<S>> asCollection()
          
 ReadableData<S> asReadable(boolean materialize)
           
<K> PTable<K,S>
by(MapFn<S,K> mapFn, PType<K> keyType)
          Apply the given map function to each element of this instance in order to create a PTable.
<K> PTable<K,S>
by(String name, MapFn<S,K> mapFn, PType<K> keyType)
          Apply the given map function to each element of this instance in order to create a PTable.
 PCollection<S> cache()
          Marks this data as cached using the default CachingOptions.
 PCollection<S> cache(CachingOptions options)
          Marks this data as cached using the given CachingOptions.
 PTable<S,Long> count()
          Returns a PTable instance that contains the counts of each unique element of this PCollection.
 PCollection<S> filter(FilterFn<S> filterFn)
          Apply the given filter function to this instance and return the resulting PCollection.
 PCollection<S> filter(String name, FilterFn<S> filterFn)
          Apply the given filter function to this instance and return the resulting PCollection.
 PObject<S> first()
           
 int getDepth()
           
abstract  long getLastModifiedAt()
          The time of the most recent modification to one of the input sources to the collection.
 SourceTarget<S> getMaterializedAt()
           
 String getName()
          Returns a shorthand name for this PCollection.
 PCollectionImpl<?> getOnlyParent()
           
 ParallelDoOptions getParallelDoOptions()
           
abstract  List<PCollectionImpl<?>> getParents()
           
 DistributedPipeline getPipeline()
          Returns the Pipeline associated with this PCollection.
 long getSize()
          Returns the size of the data represented by this PCollection in bytes.
 Set<SourceTarget<?>> getTargetDependencies()
           
 PTypeFamily getTypeFamily()
          Returns the PTypeFamily of this PCollection.
 boolean isBreakpoint()
           
 PObject<Long> length()
          Returns the number of elements represented by this PCollection.
 Iterable<S> materialize()
          Returns a reference to the data set represented by this PCollection that may be used by the client to read the data locally.
 void materializeAt(SourceTarget<S> sourceTarget)
           
 PObject<S> max()
          Returns a PObject of the maximum element of this instance.
 PObject<S> min()
          Returns a PObject of the minimum element of this instance.
<K,V> PTable<K,V>
parallelDo(DoFn<S,Pair<K,V>> fn, PTableType<K,V> type)
          Similar to the other parallelDo instance, but returns a PTable instance instead of a PCollection.
<T> PCollection<T>
parallelDo(DoFn<S,T> fn, PType<T> type)
          Applies the given doFn to the elements of this PCollection and returns a new PCollection that is the output of this processing.
<K,V> PTable<K,V>
parallelDo(String name, DoFn<S,Pair<K,V>> fn, PTableType<K,V> type)
          Similar to the other parallelDo instance, but returns a PTable instance instead of a PCollection.
<K,V> PTable<K,V>
parallelDo(String name, DoFn<S,Pair<K,V>> fn, PTableType<K,V> type, ParallelDoOptions options)
          Similar to the other parallelDo instance, but returns a PTable instance instead of a PCollection.
<T> PCollection<T>
parallelDo(String name, DoFn<S,T> fn, PType<T> type)
          Applies the given doFn to the elements of this PCollection and returns a new PCollection that is the output of this processing.
<T> PCollection<T>
parallelDo(String name, DoFn<S,T> fn, PType<T> type, ParallelDoOptions options)
          Applies the given doFn to the elements of this PCollection and returns a new PCollection that is the output of this processing.
 void setBreakpoint()
           
 String toString()
           
 PCollection<S> union(PCollection<S>... collections)
          Returns a PCollection instance that acts as the union of this PCollection and the input PCollections.
 PCollection<S> union(PCollection<S> other)
          Returns a PCollection instance that acts as the union of this PCollection and the given PCollection.
 PCollection<S> write(Target target)
          Write the contents of this PCollection to the given Target, using the storage format specified by the target.
 PCollection<S> write(Target target, Target.WriteMode writeMode)
          Write the contents of this PCollection to the given Target, using the given Target.WriteMode to handle existing targets.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface org.apache.crunch.PCollection
getPType
 

Constructor Detail

PCollectionImpl

public PCollectionImpl(String name,
                       DistributedPipeline pipeline)

PCollectionImpl

public PCollectionImpl(String name,
                       DistributedPipeline pipeline,
                       ParallelDoOptions doOptions)
Method Detail

getName

public String getName()
Description copied from interface: PCollection
Returns a shorthand name for this PCollection.

Specified by:
getName in interface PCollection<S>

getPipeline

public DistributedPipeline getPipeline()
Description copied from interface: PCollection
Returns the Pipeline associated with this PCollection.

Specified by:
getPipeline in interface PCollection<S>

getParallelDoOptions

public ParallelDoOptions getParallelDoOptions()

toString

public String toString()
Overrides:
toString in class Object

materialize

public Iterable<S> materialize()
Description copied from interface: PCollection
Returns a reference to the data set represented by this PCollection that may be used by the client to read the data locally.

Specified by:
materialize in interface PCollection<S>

cache

public PCollection<S> cache()
Description copied from interface: PCollection
Marks this data as cached using the default CachingOptions. Cached PCollections will only be processed once, and then their contents will be saved so that downstream code can process them many times.

Specified by:
cache in interface PCollection<S>
Returns:
this PCollection instance

cache

public PCollection<S> cache(CachingOptions options)
Description copied from interface: PCollection
Marks this data as cached using the given CachingOptions. Cached PCollections will only be processed once and then their contents will be saved so that downstream code can process them many times.

Specified by:
cache in interface PCollection<S>
Parameters:
options - the options that control the cache settings for the data
Returns:
this PCollection instance

union

public PCollection<S> union(PCollection<S> other)
Description copied from interface: PCollection
Returns a PCollection instance that acts as the union of this PCollection and the given PCollection.

Specified by:
union in interface PCollection<S>

union

public PCollection<S> union(PCollection<S>... collections)
Description copied from interface: PCollection
Returns a PCollection instance that acts as the union of this PCollection and the input PCollections.

Specified by:
union in interface PCollection<S>

parallelDo

public <T> PCollection<T> parallelDo(DoFn<S,T> fn,
                                     PType<T> type)
Description copied from interface: PCollection
Applies the given doFn to the elements of this PCollection and returns a new PCollection that is the output of this processing.

Specified by:
parallelDo in interface PCollection<S>
Parameters:
fn - The DoFn to apply
type - The PType of the resulting PCollection
Returns:
a new PCollection

parallelDo

public <T> PCollection<T> parallelDo(String name,
                                     DoFn<S,T> fn,
                                     PType<T> type)
Description copied from interface: PCollection
Applies the given doFn to the elements of this PCollection and returns a new PCollection that is the output of this processing.

Specified by:
parallelDo in interface PCollection<S>
Parameters:
name - An identifier for this processing step, useful for debugging
fn - The DoFn to apply
type - The PType of the resulting PCollection
Returns:
a new PCollection

parallelDo

public <T> PCollection<T> parallelDo(String name,
                                     DoFn<S,T> fn,
                                     PType<T> type,
                                     ParallelDoOptions options)
Description copied from interface: PCollection
Applies the given doFn to the elements of this PCollection and returns a new PCollection that is the output of this processing.

Specified by:
parallelDo in interface PCollection<S>
Parameters:
name - An identifier for this processing step, useful for debugging
fn - The DoFn to apply
type - The PType of the resulting PCollection
options - Optional information that is needed for certain pipeline operations
Returns:
a new PCollection

parallelDo

public <K,V> PTable<K,V> parallelDo(DoFn<S,Pair<K,V>> fn,
                                    PTableType<K,V> type)
Description copied from interface: PCollection
Similar to the other parallelDo instance, but returns a PTable instance instead of a PCollection.

Specified by:
parallelDo in interface PCollection<S>
Parameters:
fn - The DoFn to apply
type - The PTableType of the resulting PTable
Returns:
a new PTable

parallelDo

public <K,V> PTable<K,V> parallelDo(String name,
                                    DoFn<S,Pair<K,V>> fn,
                                    PTableType<K,V> type)
Description copied from interface: PCollection
Similar to the other parallelDo instance, but returns a PTable instance instead of a PCollection.

Specified by:
parallelDo in interface PCollection<S>
Parameters:
name - An identifier for this processing step
fn - The DoFn to apply
type - The PTableType of the resulting PTable
Returns:
a new PTable

parallelDo

public <K,V> PTable<K,V> parallelDo(String name,
                                    DoFn<S,Pair<K,V>> fn,
                                    PTableType<K,V> type,
                                    ParallelDoOptions options)
Description copied from interface: PCollection
Similar to the other parallelDo instance, but returns a PTable instance instead of a PCollection.

Specified by:
parallelDo in interface PCollection<S>
Parameters:
name - An identifier for this processing step
fn - The DoFn to apply
type - The PTableType of the resulting PTable
options - Optional information that is needed for certain pipeline operations
Returns:
a new PTable

write

public PCollection<S> write(Target target)
Description copied from interface: PCollection
Write the contents of this PCollection to the given Target, using the storage format specified by the target.

Specified by:
write in interface PCollection<S>
Parameters:
target - The target to write to

write

public PCollection<S> write(Target target,
                            Target.WriteMode writeMode)
Description copied from interface: PCollection
Write the contents of this PCollection to the given Target, using the given Target.WriteMode to handle existing targets.

Specified by:
write in interface PCollection<S>
Parameters:
target - The target
writeMode - The rule for handling existing outputs at the target location

accept

public void accept(PCollectionImpl.Visitor visitor)

setBreakpoint

public void setBreakpoint()

isBreakpoint

public boolean isBreakpoint()

asCollection

public PObject<Collection<S>> asCollection()

Specified by:
asCollection in interface PCollection<S>
Returns:
A PObject encapsulating an in-memory Collection containing the values of this PCollection.

first

public PObject<S> first()
Specified by:
first in interface PCollection<S>
Returns:
The first element of this PCollection.

getMaterializedAt

public SourceTarget<S> getMaterializedAt()

materializeAt

public void materializeAt(SourceTarget<S> sourceTarget)

filter

public PCollection<S> filter(FilterFn<S> filterFn)
Description copied from interface: PCollection
Apply the given filter function to this instance and return the resulting PCollection.

Specified by:
filter in interface PCollection<S>

filter

public PCollection<S> filter(String name,
                             FilterFn<S> filterFn)
Description copied from interface: PCollection
Apply the given filter function to this instance and return the resulting PCollection.

Specified by:
filter in interface PCollection<S>
Parameters:
name - An identifier for this processing step
filterFn - The FilterFn to apply

by

public <K> PTable<K,S> by(MapFn<S,K> mapFn,
                          PType<K> keyType)
Description copied from interface: PCollection
Apply the given map function to each element of this instance in order to create a PTable.

Specified by:
by in interface PCollection<S>

by

public <K> PTable<K,S> by(String name,
                          MapFn<S,K> mapFn,
                          PType<K> keyType)
Description copied from interface: PCollection
Apply the given map function to each element of this instance in order to create a PTable.

Specified by:
by in interface PCollection<S>
Parameters:
name - An identifier for this processing step
mapFn - The MapFn to apply

count

public PTable<S,Long> count()
Description copied from interface: PCollection
Returns a PTable instance that contains the counts of each unique element of this PCollection.

Specified by:
count in interface PCollection<S>

length

public PObject<Long> length()
Description copied from interface: PCollection
Returns the number of elements represented by this PCollection.

Specified by:
length in interface PCollection<S>
Returns:
An PObject containing the number of elements in this PCollection.

max

public PObject<S> max()
Description copied from interface: PCollection
Returns a PObject of the maximum element of this instance.

Specified by:
max in interface PCollection<S>

min

public PObject<S> min()
Description copied from interface: PCollection
Returns a PObject of the minimum element of this instance.

Specified by:
min in interface PCollection<S>

aggregate

public PCollection<S> aggregate(Aggregator<S> aggregator)
Description copied from interface: PCollection
Returns a PCollection that contains the result of aggregating all values in this instance.

Specified by:
aggregate in interface PCollection<S>

getTypeFamily

public PTypeFamily getTypeFamily()
Description copied from interface: PCollection
Returns the PTypeFamily of this PCollection.

Specified by:
getTypeFamily in interface PCollection<S>

getParents

public abstract List<PCollectionImpl<?>> getParents()

getOnlyParent

public PCollectionImpl<?> getOnlyParent()

getTargetDependencies

public Set<SourceTarget<?>> getTargetDependencies()

getDepth

public int getDepth()

asReadable

public ReadableData<S> asReadable(boolean materialize)
Specified by:
asReadable in interface PCollection<S>
Parameters:
materialize - If true, materialize this data before returning a reference to it
Returns:
A reference to the data in this instance that can be read from a job running on a cluster.

getSize

public long getSize()
Description copied from interface: PCollection
Returns the size of the data represented by this PCollection in bytes.

Specified by:
getSize in interface PCollection<S>

getLastModifiedAt

public abstract long getLastModifiedAt()
The time of the most recent modification to one of the input sources to the collection. If the time can not be determined then -1 should be returned.

Returns:
time of the most recent modification to one of the input sources to the collection.


Copyright © 2014 The Apache Software Foundation. All Rights Reserved.