public abstract class DistributedPipeline extends Object implements Pipeline
| Constructor and Description |
|---|
DistributedPipeline(String name,
org.apache.hadoop.conf.Configuration conf,
PCollectionFactory factory)
Instantiate with a custom name and configuration.
|
| Modifier and Type | Method and Description |
|---|---|
void |
cleanup(boolean force)
Cleans up any artifacts created as a result of
running the pipeline. |
<K,V> PTable<K,V> |
create(Iterable<Pair<K,V>> contents,
PTableType<K,V> ptype)
Creates a
PTable containing the values found in the given Iterable
using an implementation-specific distribution mechanism. |
<K,V> PTable<K,V> |
create(Iterable<Pair<K,V>> contents,
PTableType<K,V> ptype,
CreateOptions options)
Creates a
PTable containing the values found in the given Iterable
using an implementation-specific distribution mechanism. |
<S> PCollection<S> |
create(Iterable<S> contents,
PType<S> ptype)
Creates a
PCollection containing the values found in the given Iterable
using an implementation-specific distribution mechanism. |
<S> PCollection<S> |
create(Iterable<S> contents,
PType<S> ptype,
CreateOptions options)
Creates a
PCollection containing the values found in the given Iterable
using an implementation-specific distribution mechanism. |
<T> SourceTarget<T> |
createIntermediateOutput(PType<T> ptype) |
org.apache.hadoop.fs.Path |
createTempPath() |
PipelineResult |
done()
Run any remaining jobs required to generate outputs and then clean up any
intermediate data files that were created in this run or previous calls to
run. |
<S> PCollection<S> |
emptyPCollection(PType<S> ptype)
Creates an empty
PCollection of the given PType. |
<K,V> PTable<K,V> |
emptyPTable(PTableType<K,V> ptype)
Creates an empty
PTable of the given PTable Type. |
void |
enableDebug()
Turn on debug logging for jobs that are run from this pipeline.
|
org.apache.hadoop.conf.Configuration |
getConfiguration()
Returns the
Configuration instance associated with this pipeline. |
PCollectionFactory |
getFactory() |
<T> ReadableSource<T> |
getMaterializeSourceTarget(PCollection<T> pcollection)
Retrieve a ReadableSourceTarget that provides access to the contents of a
PCollection. |
String |
getName()
Returns the name of this pipeline.
|
int |
getNextAnonymousStageId() |
<S> PCollection<S> |
read(Source<S> source)
Converts the given
Source into a PCollection that is
available to jobs run using this Pipeline instance. |
<S> PCollection<S> |
read(Source<S> source,
String named)
Converts the given
Source into a PCollection that is
available to jobs run using this Pipeline instance. |
<K,V> PTable<K,V> |
read(TableSource<K,V> source)
A version of the read method for
TableSource instances that map to
PTables. |
<K,V> PTable<K,V> |
read(TableSource<K,V> source,
String named)
A version of the read method for
TableSource instances that map to
PTables. |
PCollection<String> |
readTextFile(String pathName)
A convenience method for reading a text file.
|
<Output> Output |
sequentialDo(PipelineCallable<Output> pipelineCallable)
Executes the given
PipelineCallable on the client after the Targets
that the PipelineCallable depends on (if any) have been created by other pipeline
processing steps. |
void |
setConfiguration(org.apache.hadoop.conf.Configuration conf)
Set the
Configuration to use with this pipeline. |
<S> PCollection<S> |
union(List<PCollection<S>> collections) |
<K,V> PTable<K,V> |
unionTables(List<PTable<K,V>> tables) |
void |
write(PCollection<?> pcollection,
Target target)
Write the given collection to the given target on the next pipeline run.
|
void |
write(PCollection<?> pcollection,
Target target,
Target.WriteMode writeMode)
Write the contents of the
PCollection to the given Target,
using the storage format specified by the target and the given
WriteMode for cases where the referenced Target
already exists. |
<T> void |
writeTextFile(PCollection<T> pcollection,
String pathName)
A convenience method for writing a text file.
|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitcache, materialize, run, runAsyncpublic DistributedPipeline(String name, org.apache.hadoop.conf.Configuration conf, PCollectionFactory factory)
name - Display name of the pipelineconf - Configuration to be used within all MapReduce jobs run in the pipelinepublic PCollectionFactory getFactory()
public org.apache.hadoop.conf.Configuration getConfiguration()
PipelineConfiguration instance associated with this pipeline.getConfiguration in interface Pipelinepublic void setConfiguration(org.apache.hadoop.conf.Configuration conf)
PipelineConfiguration to use with this pipeline.setConfiguration in interface Pipelinepublic PipelineResult done()
Pipelinerun.public <S> PCollection<S> union(List<PCollection<S>> collections)
public <K,V> PTable<K,V> unionTables(List<PTable<K,V>> tables)
unionTables in interface Pipelinepublic <Output> Output sequentialDo(PipelineCallable<Output> pipelineCallable)
PipelinePipelineCallable on the client after the Targets
that the PipelineCallable depends on (if any) have been created by other pipeline
processing steps.sequentialDo in interface PipelineOutput - The return type of the PipelineCallablepipelineCallable - The sequential logic to executepublic <S> PCollection<S> read(Source<S> source)
PipelineSource into a PCollection that is
available to jobs run using this Pipeline instance.public <S> PCollection<S> read(Source<S> source, String named)
PipelineSource into a PCollection that is
available to jobs run using this Pipeline instance.public <K,V> PTable<K,V> read(TableSource<K,V> source)
PipelineTableSource instances that map to
PTables.public <K,V> PTable<K,V> read(TableSource<K,V> source, String named)
PipelineTableSource instances that map to
PTables.public PCollection<String> readTextFile(String pathName)
PipelinereadTextFile in interface Pipelinepublic void write(PCollection<?> pcollection, Target target)
PipelineWriteMode.DEFAULT rule for the given Target.public void write(PCollection<?> pcollection, Target target, Target.WriteMode writeMode)
PipelinePCollection to the given Target,
using the storage format specified by the target and the given
WriteMode for cases where the referenced Target
already exists.public <S> PCollection<S> emptyPCollection(PType<S> ptype)
PipelinePCollection of the given PType.emptyPCollection in interface Pipelineptype - The PType of the empty PCollectionpublic <K,V> PTable<K,V> emptyPTable(PTableType<K,V> ptype)
PipelinePTable of the given PTable Type.emptyPTable in interface Pipelineptype - The PTableType of the empty PTablepublic <S> PCollection<S> create(Iterable<S> contents, PType<S> ptype)
PipelinePCollection containing the values found in the given Iterable
using an implementation-specific distribution mechanism.public <S> PCollection<S> create(Iterable<S> contents, PType<S> ptype, CreateOptions options)
PipelinePCollection containing the values found in the given Iterable
using an implementation-specific distribution mechanism.public <K,V> PTable<K,V> create(Iterable<Pair<K,V>> contents, PTableType<K,V> ptype)
PipelinePTable containing the values found in the given Iterable
using an implementation-specific distribution mechanism.public <K,V> PTable<K,V> create(Iterable<Pair<K,V>> contents, PTableType<K,V> ptype, CreateOptions options)
PipelinePTable containing the values found in the given Iterable
using an implementation-specific distribution mechanism.public <T> ReadableSource<T> getMaterializeSourceTarget(PCollection<T> pcollection)
PCollection.
This is primarily intended as a helper method to Pipeline.materialize(PCollection). The
underlying data of the ReadableSourceTarget may not be actually present until the pipeline is
run.pcollection - The collection for which the ReadableSourceTarget is to be retrievedIllegalArgumentException - If no ReadableSourceTarget can be retrieved for the given
PCollectionpublic <T> SourceTarget<T> createIntermediateOutput(PType<T> ptype)
public org.apache.hadoop.fs.Path createTempPath()
public <T> void writeTextFile(PCollection<T> pcollection, String pathName)
PipelinewriteTextFile in interface Pipelinepublic void cleanup(boolean force)
Pipelinerunning the pipeline.public int getNextAnonymousStageId()
public void enableDebug()
PipelineenableDebug in interface PipelineCopyright © 2015 The Apache Software Foundation. All Rights Reserved.