This project has retired. For details please refer to its Attic page.
Pipeline (Apache Crunch 0.4.0-incubating API)

org.apache.crunch
Interface Pipeline

All Known Implementing Classes:
MemPipeline, MRPipeline

public interface Pipeline

Manages the state of a pipeline execution.


Method Summary
 PipelineResult done()
          Run any remaining jobs required to generate outputs and then clean up any intermediate data files that were created in this run or previous calls to run.
 void enableDebug()
          Turn on debug logging for jobs that are run from this pipeline.
 org.apache.hadoop.conf.Configuration getConfiguration()
          Returns the Configuration instance associated with this pipeline.
 String getName()
          Returns the name of this pipeline.
<T> Iterable<T>
materialize(PCollection<T> pcollection)
          Create the given PCollection and read the data it contains into the returned Collection instance for client use.
<T> PCollection<T>
read(Source<T> source)
          Converts the given Source into a PCollection that is available to jobs run using this Pipeline instance.
<K,V> PTable<K,V>
read(TableSource<K,V> tableSource)
          A version of the read method for TableSource instances that map to PTables.
 PCollection<String> readTextFile(String pathName)
          A convenience method for reading a text file.
 PipelineResult run()
          Constructs and executes a series of MapReduce jobs in order to write data to the output targets.
 void setConfiguration(org.apache.hadoop.conf.Configuration conf)
          Set the Configuration to use with this pipeline.
 void write(PCollection<?> collection, Target target)
          Write the given collection to the given target on the next pipeline run.
<T> void
writeTextFile(PCollection<T> collection, String pathName)
          A convenience method for writing a text file.
 

Method Detail

setConfiguration

void setConfiguration(org.apache.hadoop.conf.Configuration conf)
Set the Configuration to use with this pipeline.


getName

String getName()
Returns the name of this pipeline.

Returns:
Name of the pipeline

getConfiguration

org.apache.hadoop.conf.Configuration getConfiguration()
Returns the Configuration instance associated with this pipeline.


read

<T> PCollection<T> read(Source<T> source)
Converts the given Source into a PCollection that is available to jobs run using this Pipeline instance.

Parameters:
source - The source of data
Returns:
A PCollection that references the given source

read

<K,V> PTable<K,V> read(TableSource<K,V> tableSource)
A version of the read method for TableSource instances that map to PTables.

Parameters:
tableSource - The source of the data
Returns:
A PTable that references the given source

write

void write(PCollection<?> collection,
           Target target)
Write the given collection to the given target on the next pipeline run.

Parameters:
collection - The collection
target - The output target

materialize

<T> Iterable<T> materialize(PCollection<T> pcollection)
Create the given PCollection and read the data it contains into the returned Collection instance for client use.

Parameters:
pcollection - The PCollection to materialize
Returns:
the data from the PCollection as a read-only Collection

run

PipelineResult run()
Constructs and executes a series of MapReduce jobs in order to write data to the output targets.


done

PipelineResult done()
Run any remaining jobs required to generate outputs and then clean up any intermediate data files that were created in this run or previous calls to run.


readTextFile

PCollection<String> readTextFile(String pathName)
A convenience method for reading a text file.


writeTextFile

<T> void writeTextFile(PCollection<T> collection,
                       String pathName)
A convenience method for writing a text file.


enableDebug

void enableDebug()
Turn on debug logging for jobs that are run from this pipeline.



Copyright © 2012 The Apache Software Foundation. All Rights Reserved.