This project has retired. For details please refer to its Attic page.
Pipeline (Apache Crunch 0.7.0 API)

org.apache.crunch
Interface Pipeline

All Known Implementing Classes:
MemPipeline, MRPipeline

public interface Pipeline

Manages the state of a pipeline execution.


Method Summary
 PipelineResult done()
          Run any remaining jobs required to generate outputs and then clean up any intermediate data files that were created in this run or previous calls to run.
 void enableDebug()
          Turn on debug logging for jobs that are run from this pipeline.
 org.apache.hadoop.conf.Configuration getConfiguration()
          Returns the Configuration instance associated with this pipeline.
 String getName()
          Returns the name of this pipeline.
<T> Iterable<T>
materialize(PCollection<T> pcollection)
          Create the given PCollection and read the data it contains into the returned Collection instance for client use.
<T> PCollection<T>
read(Source<T> source)
          Converts the given Source into a PCollection that is available to jobs run using this Pipeline instance.
<K,V> PTable<K,V>
read(TableSource<K,V> tableSource)
          A version of the read method for TableSource instances that map to PTables.
 PCollection<String> readTextFile(String pathName)
          A convenience method for reading a text file.
 PipelineResult run()
          Constructs and executes a series of MapReduce jobs in order to write data to the output targets.
 PipelineExecution runAsync()
          Constructs and starts a series of MapReduce jobs in order ot write data to the output targets, but returns a ListenableFuture to allow clients to control job execution.
 void setConfiguration(org.apache.hadoop.conf.Configuration conf)
          Set the Configuration to use with this pipeline.
 void write(PCollection<?> collection, Target target)
          Write the given collection to the given target on the next pipeline run.
 void write(PCollection<?> collection, Target target, Target.WriteMode writeMode)
          Write the contents of the PCollection to the given Target, using the storage format specified by the target and the given WriteMode for cases where the referenced Target already exists.
<T> void
writeTextFile(PCollection<T> collection, String pathName)
          A convenience method for writing a text file.
 

Method Detail

setConfiguration

void setConfiguration(org.apache.hadoop.conf.Configuration conf)
Set the Configuration to use with this pipeline.


getName

String getName()
Returns the name of this pipeline.

Returns:
Name of the pipeline

getConfiguration

org.apache.hadoop.conf.Configuration getConfiguration()
Returns the Configuration instance associated with this pipeline.


read

<T> PCollection<T> read(Source<T> source)
Converts the given Source into a PCollection that is available to jobs run using this Pipeline instance.

Parameters:
source - The source of data
Returns:
A PCollection that references the given source

read

<K,V> PTable<K,V> read(TableSource<K,V> tableSource)
A version of the read method for TableSource instances that map to PTables.

Parameters:
tableSource - The source of the data
Returns:
A PTable that references the given source

write

void write(PCollection<?> collection,
           Target target)
Write the given collection to the given target on the next pipeline run. The system will check to see if the target's location already exists using the WriteMode.DEFAULT rule for the given Target.

Parameters:
collection - The collection
target - The output target

write

void write(PCollection<?> collection,
           Target target,
           Target.WriteMode writeMode)
Write the contents of the PCollection to the given Target, using the storage format specified by the target and the given WriteMode for cases where the referenced Target already exists.

Parameters:
collection - The collection
target - The target to write to
writeMode - The strategy to use for handling existing outputs

materialize

<T> Iterable<T> materialize(PCollection<T> pcollection)
Create the given PCollection and read the data it contains into the returned Collection instance for client use.

Parameters:
pcollection - The PCollection to materialize
Returns:
the data from the PCollection as a read-only Collection

run

PipelineResult run()
Constructs and executes a series of MapReduce jobs in order to write data to the output targets.


runAsync

PipelineExecution runAsync()
Constructs and starts a series of MapReduce jobs in order ot write data to the output targets, but returns a ListenableFuture to allow clients to control job execution.

Returns:

done

PipelineResult done()
Run any remaining jobs required to generate outputs and then clean up any intermediate data files that were created in this run or previous calls to run.


readTextFile

PCollection<String> readTextFile(String pathName)
A convenience method for reading a text file.


writeTextFile

<T> void writeTextFile(PCollection<T> collection,
                       String pathName)
A convenience method for writing a text file.


enableDebug

void enableDebug()
Turn on debug logging for jobs that are run from this pipeline.



Copyright © 2013 The Apache Software Foundation. All Rights Reserved.