DistributedPipeline (Apache Crunch 0.14.0 API)

java.lang.Object
- org.apache.crunch.impl.dist.DistributedPipeline

All Implemented Interfaces:

Pipeline

Direct Known Subclasses:

MRPipeline, SparkPipeline
```
public abstract class DistributedPipeline
extends Object
implements Pipeline
```

Constructor Summary

Constructors
Constructor and Description
`DistributedPipeline(String name, org.apache.hadoop.conf.Configuration conf, PCollectionFactory factory)` Instantiate with a custom name and configuration.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`cleanup(boolean force)` Cleans up any artifacts created as a result of `running` the pipeline.
`<K,V> PTable<K,V>`	`create(Iterable<Pair<K,V>> contents, PTableType<K,V> ptype)` Creates a `PTable` containing the values found in the given `Iterable` using an implementation-specific distribution mechanism.
`<K,V> PTable<K,V>`	`create(Iterable<Pair<K,V>> contents, PTableType<K,V> ptype, CreateOptions options)` Creates a `PTable` containing the values found in the given `Iterable` using an implementation-specific distribution mechanism.
`<S> PCollection<S>`	`create(Iterable<S> contents, PType<S> ptype)` Creates a `PCollection` containing the values found in the given `Iterable` using an implementation-specific distribution mechanism.
`<S> PCollection<S>`	`create(Iterable<S> contents, PType<S> ptype, CreateOptions options)` Creates a `PCollection` containing the values found in the given `Iterable` using an implementation-specific distribution mechanism.
`<T> SourceTarget<T>`	`createIntermediateOutput(PType<T> ptype)`
`org.apache.hadoop.fs.Path`	`createTempPath()`
`PipelineResult`	`done()` Run any remaining jobs required to generate outputs and then clean up any intermediate data files that were created in this run or previous calls to `run`.
`<S> PCollection<S>`	`emptyPCollection(PType<S> ptype)` Creates an empty `PCollection` of the given `PType`.
`<K,V> PTable<K,V>`	`emptyPTable(PTableType<K,V> ptype)` Creates an empty `PTable` of the given `PTable Type`.
`void`	`enableDebug()` Turn on debug logging for jobs that are run from this pipeline.
`org.apache.hadoop.conf.Configuration`	`getConfiguration()` Returns the `Configuration` instance associated with this pipeline.
`PCollectionFactory`	`getFactory()`
`<T> ReadableSource<T>`	`getMaterializeSourceTarget(PCollection<T> pcollection)` Retrieve a ReadableSourceTarget that provides access to the contents of a `PCollection`.
`String`	`getName()` Returns the name of this pipeline.
`int`	`getNextAnonymousStageId()`
`<S> PCollection<S>`	`read(Source<S> source)` Converts the given `Source` into a `PCollection` that is available to jobs run using this `Pipeline` instance.
`<S> PCollection<S>`	`read(Source<S> source, String named)` Converts the given `Source` into a `PCollection` that is available to jobs run using this `Pipeline` instance.
`<K,V> PTable<K,V>`	`read(TableSource<K,V> source)` A version of the read method for `TableSource` instances that map to `PTable`s.
`<K,V> PTable<K,V>`	`read(TableSource<K,V> source, String named)` A version of the read method for `TableSource` instances that map to `PTable`s.
`PCollection<String>`	`readTextFile(String pathName)` A convenience method for reading a text file.
`<Output> Output`	`sequentialDo(PipelineCallable<Output> pipelineCallable)` Executes the given `PipelineCallable` on the client after the `Targets` that the PipelineCallable depends on (if any) have been created by other pipeline processing steps.
`void`	`setConfiguration(org.apache.hadoop.conf.Configuration conf)` Set the `Configuration` to use with this pipeline.
`<S> PCollection<S>`	`union(List<PCollection<S>> collections)`
`<K,V> PTable<K,V>`	`unionTables(List<PTable<K,V>> tables)`
`void`	`write(PCollection<?> pcollection, Target target)` Write the given collection to the given target on the next pipeline run.
`void`	`write(PCollection<?> pcollection, Target target, Target.WriteMode writeMode)` Write the contents of the `PCollection` to the given `Target`, using the storage format specified by the target and the given `WriteMode` for cases where the referenced `Target` already exists.
`<T> void`	`writeTextFile(PCollection<T> pcollection, String pathName)` A convenience method for writing a text file.

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.apache.crunch.Pipeline
cache, materialize, run, runAsync

- Constructor Detail
  - DistributedPipeline
```
public DistributedPipeline(String name,
                           org.apache.hadoop.conf.Configuration conf,
                           PCollectionFactory factory)
```
    Instantiate with a custom name and configuration.
    
    Parameters:
    
    name - Display name of the pipeline
    
    conf - Configuration to be used within all MapReduce jobs run in the pipeline
- Method Detail
  - getFactory
```
public PCollectionFactory getFactory()
```
  - getConfiguration
```
public org.apache.hadoop.conf.Configuration getConfiguration()
```
    Description copied from interface: Pipeline
    
    Returns the Configuration instance associated with this pipeline.
    
    Specified by:
    
    getConfiguration in interface Pipeline
  - setConfiguration
```
public void setConfiguration(org.apache.hadoop.conf.Configuration conf)
```
    Description copied from interface: Pipeline
    
    Set the Configuration to use with this pipeline.
    
    Specified by:
    
    setConfiguration in interface Pipeline
  - done
```
public PipelineResult done()
```
    Description copied from interface: Pipeline
    
    Run any remaining jobs required to generate outputs and then clean up any intermediate data files that were created in this run or previous calls to run.
    
    Specified by:
    
    done in interface Pipeline
  - union
```
public <S> PCollection<S> union(List<PCollection<S>> collections)
```
    Specified by:
    
    union in interface Pipeline
  - unionTables
```
public <K,V> PTable<K,V> unionTables(List<PTable<K,V>> tables)
```
    Specified by:
    
    unionTables in interface Pipeline
  - sequentialDo
```
public <Output> Output sequentialDo(PipelineCallable<Output> pipelineCallable)
```
    Description copied from interface: Pipeline
    
    Executes the given PipelineCallable on the client after the Targets that the PipelineCallable depends on (if any) have been created by other pipeline processing steps.
    
    Specified by:
    
    sequentialDo in interface Pipeline
    
    Type Parameters:
    
    Output - The return type of the PipelineCallable
    
    Parameters:
    
    pipelineCallable - The sequential logic to execute
    
    Returns:
    
    The result of executing the PipelineCallable
  - read
```
public <S> PCollection<S> read(Source<S> source)
```
    Description copied from interface: Pipeline
    
    Converts the given Source into a PCollection that is available to jobs run using this Pipeline instance.
    
    Specified by:
    
    read in interface Pipeline
    
    Parameters:
    
    source - The source of data
    
    Returns:
    
    A PCollection that references the given source
  - read
```
public <S> PCollection<S> read(Source<S> source,
                               String named)
```
    Description copied from interface: Pipeline
    
    Converts the given Source into a PCollection that is available to jobs run using this Pipeline instance.
    
    Specified by:
    
    read in interface Pipeline
    
    Parameters:
    
    source - The source of data
    
    named - A name for the returned PCollection
    
    Returns:
    
    A PCollection that references the given source
  - read
```
public <K,V> PTable<K,V> read(TableSource<K,V> source)
```
    Description copied from interface: Pipeline
    
    A version of the read method for TableSource instances that map to PTables.
    
    Specified by:
    
    read in interface Pipeline
    
    Parameters:
    
    source - The source of the data
    
    Returns:
    
    A PTable that references the given source
  - read
```
public <K,V> PTable<K,V> read(TableSource<K,V> source,
                              String named)
```
    Description copied from interface: Pipeline
    
    A version of the read method for TableSource instances that map to PTables.
    
    Specified by:
    
    read in interface Pipeline
    
    Parameters:
    
    source - The source of the data
    
    named - A name for the returned PTable
    
    Returns:
    
    A PTable that references the given source
  - readTextFile
```
public PCollection<String> readTextFile(String pathName)
```
    Description copied from interface: Pipeline
    
    A convenience method for reading a text file.
    
    Specified by:
    
    readTextFile in interface Pipeline
  - write
```
public void write(PCollection<?> pcollection,
                  Target target)
```
    Description copied from interface: Pipeline
    
    Write the given collection to the given target on the next pipeline run. The system will check to see if the target's location already exists using the WriteMode.DEFAULT rule for the given Target.
    
    Specified by:
    
    write in interface Pipeline
    
    Parameters:
    
    pcollection - The collection
    
    target - The output target
  - write
```
public void write(PCollection<?> pcollection,
                  Target target,
                  Target.WriteMode writeMode)
```
    Description copied from interface: Pipeline
    
    Write the contents of the PCollection to the given Target, using the storage format specified by the target and the given WriteMode for cases where the referenced Target already exists.
    
    Specified by:
    
    write in interface Pipeline
    
    Parameters:
    
    pcollection - The collection
    
    target - The target to write to
    
    writeMode - The strategy to use for handling existing outputs
  - emptyPCollection
```
public <S> PCollection<S> emptyPCollection(PType<S> ptype)
```
    Description copied from interface: Pipeline
    
    Creates an empty PCollection of the given PType.
    
    Specified by:
    
    emptyPCollection in interface Pipeline
    
    Parameters:
    
    ptype - The PType of the empty PCollection
    
    Returns:
    
    A valid PCollection with no contents
  - emptyPTable
```
public <K,V> PTable<K,V> emptyPTable(PTableType<K,V> ptype)
```
    Description copied from interface: Pipeline
    
    Creates an empty PTable of the given PTable Type.
    
    Specified by:
    
    emptyPTable in interface Pipeline
    
    Parameters:
    
    ptype - The PTableType of the empty PTable
    
    Returns:
    
    A valid PTable with no contents
  - create
```
public <S> PCollection<S> create(Iterable<S> contents,
                                 PType<S> ptype)
```
    Description copied from interface: Pipeline
    
    Creates a PCollection containing the values found in the given Iterable using an implementation-specific distribution mechanism.
    
    Specified by:
    
    create in interface Pipeline
    
    Parameters:
    
    contents - The values the new PCollection will contain
    
    ptype - The PType of the PCollection
    
    Returns:
    
    A PCollection that contains the given values
  - create
```
public <S> PCollection<S> create(Iterable<S> contents,
                                 PType<S> ptype,
                                 CreateOptions options)
```
    Description copied from interface: Pipeline
    
    Creates a PCollection containing the values found in the given Iterable using an implementation-specific distribution mechanism.
    
    Specified by:
    
    create in interface Pipeline
    
    Parameters:
    
    contents - The values the new PCollection will contain
    
    ptype - The PType of the PCollection
    
    options - Additional options, such as the name or desired parallelism of the PCollection
    
    Returns:
    
    A PCollection that contains the given values
  - create
```
public <K,V> PTable<K,V> create(Iterable<Pair<K,V>> contents,
                                PTableType<K,V> ptype)
```
    Description copied from interface: Pipeline
    
    Creates a PTable containing the values found in the given Iterable using an implementation-specific distribution mechanism.
    
    Specified by:
    
    create in interface Pipeline
    
    Parameters:
    
    contents - The values the new PTable will contain
    
    ptype - The PTableType of the PTable
    
    Returns:
    
    A PTable that contains the given values
  - create
```
public <K,V> PTable<K,V> create(Iterable<Pair<K,V>> contents,
                                PTableType<K,V> ptype,
                                CreateOptions options)
```
    Description copied from interface: Pipeline
    
    Creates a PTable containing the values found in the given Iterable using an implementation-specific distribution mechanism.
    
    Specified by:
    
    create in interface Pipeline
    
    Parameters:
    
    contents - The values the new PTable will contain
    
    ptype - The PTableType of the PTable
    
    options - Additional options, such as the name or desired parallelism of the PTable
    
    Returns:
    
    A PTable that contains the given values
  - getMaterializeSourceTarget
```
public <T> ReadableSource<T> getMaterializeSourceTarget(PCollection<T> pcollection)
```
    Retrieve a ReadableSourceTarget that provides access to the contents of a PCollection. This is primarily intended as a helper method to Pipeline.materialize(PCollection). The underlying data of the ReadableSourceTarget may not be actually present until the pipeline is run.
    
    Parameters:
    
    pcollection - The collection for which the ReadableSourceTarget is to be retrieved
    
    Returns:
    
    The ReadableSourceTarget
    
    Throws:
    
    IllegalArgumentException - If no ReadableSourceTarget can be retrieved for the given PCollection
  - createIntermediateOutput
```
public <T> SourceTarget<T> createIntermediateOutput(PType<T> ptype)
```
  - createTempPath
```
public org.apache.hadoop.fs.Path createTempPath()
```
  - writeTextFile
```
public <T> void writeTextFile(PCollection<T> pcollection,
                              String pathName)
```
    Description copied from interface: Pipeline
    
    A convenience method for writing a text file.
    
    Specified by:
    
    writeTextFile in interface Pipeline
  - cleanup
```
public void cleanup(boolean force)
```
    Description copied from interface: Pipeline
    
    Cleans up any artifacts created as a result of running the pipeline.
    
    Specified by:
    
    cleanup in interface Pipeline
    
    Parameters:
    
    force - forces the cleanup even if all targets of the pipeline have not been completed.
  - getNextAnonymousStageId
```
public int getNextAnonymousStageId()
```
  - enableDebug
```
public void enableDebug()
```
    Description copied from interface: Pipeline
    
    Turn on debug logging for jobs that are run from this pipeline.
    
    Specified by:
    
    enableDebug in interface Pipeline
  - getName
```
public String getName()
```
    Description copied from interface: Pipeline
    
    Returns the name of this pipeline.
    
    Specified by:
    
    getName in interface Pipeline
    
    Returns:
    
    Name of the pipeline

Class DistributedPipeline

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Methods inherited from interface org.apache.crunch.Pipeline

Constructor Detail

DistributedPipeline

Method Detail

getFactory

getConfiguration

setConfiguration

done

union

unionTables

sequentialDo

read

read

read

read

readTextFile

write

write

emptyPCollection

emptyPTable

create

create

create

create

getMaterializeSourceTarget

createIntermediateOutput

createTempPath

writeTextFile

cleanup

getNextAnonymousStageId

enableDebug

getName