This project has retired. For details please refer to its Attic page.
Pipeline (Apache Crunch 0.9.0 API)

org.apache.crunch
Interface Pipeline

All Known Implementing Classes:
DistributedPipeline, MemPipeline, MRPipeline, SparkPipeline

public interface Pipeline

Manages the state of a pipeline execution.


Method Summary
<T> void
cache(PCollection<T> pcollection, CachingOptions options)
          Caches the given PCollection so that it will be processed at most once during pipeline execution.
 void cleanup(boolean force)
          Cleans up any artifacts created as a result of running the pipeline.
 PipelineResult done()
          Run any remaining jobs required to generate outputs and then clean up any intermediate data files that were created in this run or previous calls to run.
 void enableDebug()
          Turn on debug logging for jobs that are run from this pipeline.
 org.apache.hadoop.conf.Configuration getConfiguration()
          Returns the Configuration instance associated with this pipeline.
 String getName()
          Returns the name of this pipeline.
<T> Iterable<T>
materialize(PCollection<T> pcollection)
          Create the given PCollection and read the data it contains into the returned Collection instance for client use.
<T> PCollection<T>
read(Source<T> source)
          Converts the given Source into a PCollection that is available to jobs run using this Pipeline instance.
<K,V> PTable<K,V>
read(TableSource<K,V> tableSource)
          A version of the read method for TableSource instances that map to PTables.
 PCollection<String> readTextFile(String pathName)
          A convenience method for reading a text file.
 PipelineResult run()
          Constructs and executes a series of MapReduce jobs in order to write data to the output targets.
 PipelineExecution runAsync()
          Constructs and starts a series of MapReduce jobs in order ot write data to the output targets, but returns a ListenableFuture to allow clients to control job execution.
 void setConfiguration(org.apache.hadoop.conf.Configuration conf)
          Set the Configuration to use with this pipeline.
 void write(PCollection<?> collection, Target target)
          Write the given collection to the given target on the next pipeline run.
 void write(PCollection<?> collection, Target target, Target.WriteMode writeMode)
          Write the contents of the PCollection to the given Target, using the storage format specified by the target and the given WriteMode for cases where the referenced Target already exists.
<T> void
writeTextFile(PCollection<T> collection, String pathName)
          A convenience method for writing a text file.
 

Method Detail

setConfiguration

void setConfiguration(org.apache.hadoop.conf.Configuration conf)
Set the Configuration to use with this pipeline.


getName

String getName()
Returns the name of this pipeline.

Returns:
Name of the pipeline

getConfiguration

org.apache.hadoop.conf.Configuration getConfiguration()
Returns the Configuration instance associated with this pipeline.


read

<T> PCollection<T> read(Source<T> source)
Converts the given Source into a PCollection that is available to jobs run using this Pipeline instance.

Parameters:
source - The source of data
Returns:
A PCollection that references the given source

read

<K,V> PTable<K,V> read(TableSource<K,V> tableSource)
A version of the read method for TableSource instances that map to PTables.

Parameters:
tableSource - The source of the data
Returns:
A PTable that references the given source

write

void write(PCollection<?> collection,
           Target target)
Write the given collection to the given target on the next pipeline run. The system will check to see if the target's location already exists using the WriteMode.DEFAULT rule for the given Target.

Parameters:
collection - The collection
target - The output target

write

void write(PCollection<?> collection,
           Target target,
           Target.WriteMode writeMode)
Write the contents of the PCollection to the given Target, using the storage format specified by the target and the given WriteMode for cases where the referenced Target already exists.

Parameters:
collection - The collection
target - The target to write to
writeMode - The strategy to use for handling existing outputs

materialize

<T> Iterable<T> materialize(PCollection<T> pcollection)
Create the given PCollection and read the data it contains into the returned Collection instance for client use.

Parameters:
pcollection - The PCollection to materialize
Returns:
the data from the PCollection as a read-only Collection

cache

<T> void cache(PCollection<T> pcollection,
               CachingOptions options)
Caches the given PCollection so that it will be processed at most once during pipeline execution.

Parameters:
pcollection - The PCollection to cache
options - The options for how the cached data is stored

run

PipelineResult run()
Constructs and executes a series of MapReduce jobs in order to write data to the output targets.


runAsync

PipelineExecution runAsync()
Constructs and starts a series of MapReduce jobs in order ot write data to the output targets, but returns a ListenableFuture to allow clients to control job execution.

Returns:

done

PipelineResult done()
Run any remaining jobs required to generate outputs and then clean up any intermediate data files that were created in this run or previous calls to run.


cleanup

void cleanup(boolean force)
Cleans up any artifacts created as a result of running the pipeline.

Parameters:
force - forces the cleanup even if all targets of the pipeline have not been completed.

readTextFile

PCollection<String> readTextFile(String pathName)
A convenience method for reading a text file.


writeTextFile

<T> void writeTextFile(PCollection<T> collection,
                       String pathName)
A convenience method for writing a text file.


enableDebug

void enableDebug()
Turn on debug logging for jobs that are run from this pipeline.



Copyright © 2014 The Apache Software Foundation. All Rights Reserved.