Pipeline (Apache Crunch 0.12.0 API)

All Known Implementing Classes:

DistributedPipeline, MemPipeline, MRPipeline, SparkPipeline
```
public interface Pipeline
```
Manages the state of a pipeline execution.

Method Summary

Methods
Modifier and Type	Method and Description
`<T> void`	`cache(PCollection<T> pcollection, CachingOptions options)` Caches the given PCollection so that it will be processed at most once during pipeline execution.
`void`	`cleanup(boolean force)` Cleans up any artifacts created as a result of `running` the pipeline.
`<K,V> PTable<K,V>`	`create(Iterable<Pair<K,V>> contents, PTableType<K,V> ptype)` Creates a `PTable` containing the values found in the given `Iterable` using an implementation-specific distribution mechanism.
`<K,V> PTable<K,V>`	`create(Iterable<Pair<K,V>> contents, PTableType<K,V> ptype, CreateOptions options)` Creates a `PTable` containing the values found in the given `Iterable` using an implementation-specific distribution mechanism.
`<T> PCollection<T>`	`create(Iterable<T> contents, PType<T> ptype)` Creates a `PCollection` containing the values found in the given `Iterable` using an implementation-specific distribution mechanism.
`<T> PCollection<T>`	`create(Iterable<T> contents, PType<T> ptype, CreateOptions options)` Creates a `PCollection` containing the values found in the given `Iterable` using an implementation-specific distribution mechanism.
`PipelineResult`	`done()` Run any remaining jobs required to generate outputs and then clean up any intermediate data files that were created in this run or previous calls to `run`.
`<T> PCollection<T>`	`emptyPCollection(PType<T> ptype)` Creates an empty `PCollection` of the given `PType`.
`<K,V> PTable<K,V>`	`emptyPTable(PTableType<K,V> ptype)` Creates an empty `PTable` of the given `PTable Type`.
`void`	`enableDebug()` Turn on debug logging for jobs that are run from this pipeline.
`org.apache.hadoop.conf.Configuration`	`getConfiguration()` Returns the `Configuration` instance associated with this pipeline.
`String`	`getName()` Returns the name of this pipeline.
`<T> Iterable<T>`	`materialize(PCollection<T> pcollection)` Create the given PCollection and read the data it contains into the returned Collection instance for client use.
`<T> PCollection<T>`	`read(Source<T> source)` Converts the given `Source` into a `PCollection` that is available to jobs run using this `Pipeline` instance.
`<T> PCollection<T>`	`read(Source<T> source, String named)` Converts the given `Source` into a `PCollection` that is available to jobs run using this `Pipeline` instance.
`<K,V> PTable<K,V>`	`read(TableSource<K,V> tableSource)` A version of the read method for `TableSource` instances that map to `PTable`s.
`<K,V> PTable<K,V>`	`read(TableSource<K,V> tableSource, String named)` A version of the read method for `TableSource` instances that map to `PTable`s.
`PCollection<String>`	`readTextFile(String pathName)` A convenience method for reading a text file.
`PipelineResult`	`run()` Constructs and executes a series of MapReduce jobs in order to write data to the output targets.
`PipelineExecution`	`runAsync()` Constructs and starts a series of MapReduce jobs in order ot write data to the output targets, but returns a `ListenableFuture` to allow clients to control job execution.
`<Output> Output`	`sequentialDo(PipelineCallable<Output> pipelineCallable)` Executes the given `PipelineCallable` on the client after the `Targets` that the PipelineCallable depends on (if any) have been created by other pipeline processing steps.
`void`	`setConfiguration(org.apache.hadoop.conf.Configuration conf)` Set the `Configuration` to use with this pipeline.
`<S> PCollection<S>`	`union(List<PCollection<S>> collections)`
`<K,V> PTable<K,V>`	`unionTables(List<PTable<K,V>> tables)`
`void`	`write(PCollection<?> collection, Target target)` Write the given collection to the given target on the next pipeline run.
`void`	`write(PCollection<?> collection, Target target, Target.WriteMode writeMode)` Write the contents of the `PCollection` to the given `Target`, using the storage format specified by the target and the given `WriteMode` for cases where the referenced `Target` already exists.
`<T> void`	`writeTextFile(PCollection<T> collection, String pathName)` A convenience method for writing a text file.

- Method Detail
  - setConfiguration
```
void setConfiguration(org.apache.hadoop.conf.Configuration conf)
```
    Set the Configuration to use with this pipeline.
  - getName
```
String getName()
```
    Returns the name of this pipeline.
    
    Returns:
    Name of the pipeline
  - getConfiguration
```
org.apache.hadoop.conf.Configuration getConfiguration()
```
    Returns the Configuration instance associated with this pipeline.
  - read
```
<T> PCollection<T> read(Source<T> source)
```
    Converts the given Source into a PCollection that is available to jobs run using this Pipeline instance.
    
    Parameters:
    source - The source of data
    
    Returns:
    A PCollection that references the given source
  - read
```
<T> PCollection<T> read(Source<T> source,
                      String named)
```
    Converts the given Source into a PCollection that is available to jobs run using this Pipeline instance.
    
    Parameters:
    source - The source of data
    named - A name for the returned PCollection
    
    Returns:
    A PCollection that references the given source
  - read
```
<K,V> PTable<K,V> read(TableSource<K,V> tableSource)
```
    A version of the read method for TableSource instances that map to PTables.
    
    Parameters:
    tableSource - The source of the data
    
    Returns:
    A PTable that references the given source
  - read
```
<K,V> PTable<K,V> read(TableSource<K,V> tableSource,
                     String named)
```
    A version of the read method for TableSource instances that map to PTables.
    
    Parameters:
    tableSource - The source of the data
    named - A name for the returned PTable
    
    Returns:
    A PTable that references the given source
  - write
```
void write(PCollection<?> collection,
         Target target)
```
    Write the given collection to the given target on the next pipeline run. The system will check to see if the target's location already exists using the WriteMode.DEFAULT rule for the given Target.
    
    Parameters:
    collection - The collection
    target - The output target
  - write
```
void write(PCollection<?> collection,
         Target target,
         Target.WriteMode writeMode)
```
    Write the contents of the PCollection to the given Target, using the storage format specified by the target and the given WriteMode for cases where the referenced Target already exists.
    
    Parameters:
    collection - The collection
    target - The target to write to
    writeMode - The strategy to use for handling existing outputs
  - materialize
```
<T> Iterable<T> materialize(PCollection<T> pcollection)
```
    Create the given PCollection and read the data it contains into the returned Collection instance for client use.
    
    Parameters:
    pcollection - The PCollection to materialize
    
    Returns:
    the data from the PCollection as a read-only Collection
  - cache
```
<T> void cache(PCollection<T> pcollection,
             CachingOptions options)
```
    Caches the given PCollection so that it will be processed at most once during pipeline execution.
    
    Parameters:
    pcollection - The PCollection to cache
    options - The options for how the cached data is stored
  - emptyPCollection
```
<T> PCollection<T> emptyPCollection(PType<T> ptype)
```
    Creates an empty PCollection of the given PType.
    
    Parameters:
    ptype - The PType of the empty PCollection
    
    Returns:
    A valid PCollection with no contents
  - emptyPTable
```
<K,V> PTable<K,V> emptyPTable(PTableType<K,V> ptype)
```
    Creates an empty PTable of the given PTable Type.
    
    Parameters:
    ptype - The PTableType of the empty PTable
    
    Returns:
    A valid PTable with no contents
  - create
```
<T> PCollection<T> create(Iterable<T> contents,
                        PType<T> ptype)
```
    Creates a PCollection containing the values found in the given Iterable using an implementation-specific distribution mechanism.
    
    Parameters:
    contents - The values the new PCollection will contain
    ptype - The PType of the PCollection
    
    Returns:
    A PCollection that contains the given values
  - create
```
<T> PCollection<T> create(Iterable<T> contents,
                        PType<T> ptype,
                        CreateOptions options)
```
    Creates a PCollection containing the values found in the given Iterable using an implementation-specific distribution mechanism.
    
    Parameters:
    contents - The values the new PCollection will contain
    ptype - The PType of the PCollection
    options - Additional options, such as the name or desired parallelism of the PCollection
    
    Returns:
    A PCollection that contains the given values
  - create
```
<K,V> PTable<K,V> create(Iterable<Pair<K,V>> contents,
                       PTableType<K,V> ptype)
```
    Creates a PTable containing the values found in the given Iterable using an implementation-specific distribution mechanism.
    
    Parameters:
    contents - The values the new PTable will contain
    ptype - The PTableType of the PTable
    
    Returns:
    A PTable that contains the given values
  - create
```
<K,V> PTable<K,V> create(Iterable<Pair<K,V>> contents,
                       PTableType<K,V> ptype,
                       CreateOptions options)
```
    Creates a PTable containing the values found in the given Iterable using an implementation-specific distribution mechanism.
    
    Parameters:
    contents - The values the new PTable will contain
    ptype - The PTableType of the PTable
    options - Additional options, such as the name or desired parallelism of the PTable
    
    Returns:
    A PTable that contains the given values
  - union
```
<S> PCollection<S> union(List<PCollection<S>> collections)
```
  - unionTables
```
<K,V> PTable<K,V> unionTables(List<PTable<K,V>> tables)
```
  - sequentialDo
```
<Output> Output sequentialDo(PipelineCallable<Output> pipelineCallable)
```
    Executes the given PipelineCallable on the client after the Targets that the PipelineCallable depends on (if any) have been created by other pipeline processing steps.
    
    Type Parameters:
    Output - The return type of the PipelineCallable
    Parameters:
    pipelineCallable - The sequential logic to execute
    
    Returns:
    The result of executing the PipelineCallable
  - run
```
PipelineResult run()
```
    Constructs and executes a series of MapReduce jobs in order to write data to the output targets.
  - runAsync
```
PipelineExecution runAsync()
```
    Constructs and starts a series of MapReduce jobs in order ot write data to the output targets, but returns a ListenableFuture to allow clients to control job execution.
    
    Returns:
  - done
```
PipelineResult done()
```
    Run any remaining jobs required to generate outputs and then clean up any intermediate data files that were created in this run or previous calls to run.
  - cleanup
```
void cleanup(boolean force)
```
    Cleans up any artifacts created as a result of running the pipeline.
    
    Parameters:
    force - forces the cleanup even if all targets of the pipeline have not been completed.
  - readTextFile
```
PCollection<String> readTextFile(String pathName)
```
    A convenience method for reading a text file.
  - writeTextFile
```
<T> void writeTextFile(PCollection<T> collection,
                     String pathName)
```
    A convenience method for writing a text file.
  - enableDebug
```
void enableDebug()
```
    Turn on debug logging for jobs that are run from this pipeline.

Interface Pipeline

Method Summary

Method Detail

setConfiguration

getName

getConfiguration

read

read

read

read

write

write

materialize

cache

emptyPCollection

emptyPTable

create

create

create

create

union

unionTables

sequentialDo

run

runAsync

done

cleanup

readTextFile

writeTextFile

enableDebug