public abstract class DistributedPipeline extends Object implements Pipeline
Constructor and Description |
---|
DistributedPipeline(String name,
org.apache.hadoop.conf.Configuration conf,
PCollectionFactory factory)
Instantiate with a custom name and configuration.
|
Modifier and Type | Method and Description |
---|---|
void |
cleanup(boolean force)
Cleans up any artifacts created as a result of
running the pipeline. |
<K,V> PTable<K,V> |
create(Iterable<Pair<K,V>> contents,
PTableType<K,V> ptype)
Creates a
PTable containing the values found in the given Iterable
using an implementation-specific distribution mechanism. |
<K,V> PTable<K,V> |
create(Iterable<Pair<K,V>> contents,
PTableType<K,V> ptype,
CreateOptions options)
Creates a
PTable containing the values found in the given Iterable
using an implementation-specific distribution mechanism. |
<S> PCollection<S> |
create(Iterable<S> contents,
PType<S> ptype)
Creates a
PCollection containing the values found in the given Iterable
using an implementation-specific distribution mechanism. |
<S> PCollection<S> |
create(Iterable<S> contents,
PType<S> ptype,
CreateOptions options)
Creates a
PCollection containing the values found in the given Iterable
using an implementation-specific distribution mechanism. |
<T> SourceTarget<T> |
createIntermediateOutput(PType<T> ptype) |
org.apache.hadoop.fs.Path |
createTempPath() |
PipelineResult |
done()
Run any remaining jobs required to generate outputs and then clean up any
intermediate data files that were created in this run or previous calls to
run . |
<S> PCollection<S> |
emptyPCollection(PType<S> ptype)
Creates an empty
PCollection of the given PType . |
<K,V> PTable<K,V> |
emptyPTable(PTableType<K,V> ptype)
Creates an empty
PTable of the given PTable Type . |
void |
enableDebug()
Turn on debug logging for jobs that are run from this pipeline.
|
org.apache.hadoop.conf.Configuration |
getConfiguration()
Returns the
Configuration instance associated with this pipeline. |
PCollectionFactory |
getFactory() |
<T> ReadableSource<T> |
getMaterializeSourceTarget(PCollection<T> pcollection)
Retrieve a ReadableSourceTarget that provides access to the contents of a
PCollection . |
String |
getName()
Returns the name of this pipeline.
|
int |
getNextAnonymousStageId() |
<S> PCollection<S> |
read(Source<S> source)
Converts the given
Source into a PCollection that is
available to jobs run using this Pipeline instance. |
<S> PCollection<S> |
read(Source<S> source,
String named)
Converts the given
Source into a PCollection that is
available to jobs run using this Pipeline instance. |
<K,V> PTable<K,V> |
read(TableSource<K,V> source)
A version of the read method for
TableSource instances that map to
PTable s. |
<K,V> PTable<K,V> |
read(TableSource<K,V> source,
String named)
A version of the read method for
TableSource instances that map to
PTable s. |
PCollection<String> |
readTextFile(String pathName)
A convenience method for reading a text file.
|
<Output> Output |
sequentialDo(PipelineCallable<Output> pipelineCallable)
Executes the given
PipelineCallable on the client after the Targets
that the PipelineCallable depends on (if any) have been created by other pipeline
processing steps. |
void |
setConfiguration(org.apache.hadoop.conf.Configuration conf)
Set the
Configuration to use with this pipeline. |
<S> PCollection<S> |
union(List<PCollection<S>> collections) |
<K,V> PTable<K,V> |
unionTables(List<PTable<K,V>> tables) |
void |
write(PCollection<?> pcollection,
Target target)
Write the given collection to the given target on the next pipeline run.
|
void |
write(PCollection<?> pcollection,
Target target,
Target.WriteMode writeMode)
Write the contents of the
PCollection to the given Target ,
using the storage format specified by the target and the given
WriteMode for cases where the referenced Target
already exists. |
<T> void |
writeTextFile(PCollection<T> pcollection,
String pathName)
A convenience method for writing a text file.
|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
cache, materialize, run, runAsync
public DistributedPipeline(String name, org.apache.hadoop.conf.Configuration conf, PCollectionFactory factory)
name
- Display name of the pipelineconf
- Configuration to be used within all MapReduce jobs run in the pipelinepublic PCollectionFactory getFactory()
public org.apache.hadoop.conf.Configuration getConfiguration()
Pipeline
Configuration
instance associated with this pipeline.getConfiguration
in interface Pipeline
public void setConfiguration(org.apache.hadoop.conf.Configuration conf)
Pipeline
Configuration
to use with this pipeline.setConfiguration
in interface Pipeline
public PipelineResult done()
Pipeline
run
.public <S> PCollection<S> union(List<PCollection<S>> collections)
public <K,V> PTable<K,V> unionTables(List<PTable<K,V>> tables)
unionTables
in interface Pipeline
public <Output> Output sequentialDo(PipelineCallable<Output> pipelineCallable)
Pipeline
PipelineCallable
on the client after the Targets
that the PipelineCallable depends on (if any) have been created by other pipeline
processing steps.sequentialDo
in interface Pipeline
Output
- The return type of the PipelineCallablepipelineCallable
- The sequential logic to executepublic <S> PCollection<S> read(Source<S> source)
Pipeline
Source
into a PCollection
that is
available to jobs run using this Pipeline
instance.public <S> PCollection<S> read(Source<S> source, String named)
Pipeline
Source
into a PCollection
that is
available to jobs run using this Pipeline
instance.public <K,V> PTable<K,V> read(TableSource<K,V> source)
Pipeline
TableSource
instances that map to
PTable
s.public <K,V> PTable<K,V> read(TableSource<K,V> source, String named)
Pipeline
TableSource
instances that map to
PTable
s.public PCollection<String> readTextFile(String pathName)
Pipeline
readTextFile
in interface Pipeline
public void write(PCollection<?> pcollection, Target target)
Pipeline
WriteMode.DEFAULT
rule for the given Target
.public void write(PCollection<?> pcollection, Target target, Target.WriteMode writeMode)
Pipeline
PCollection
to the given Target
,
using the storage format specified by the target and the given
WriteMode
for cases where the referenced Target
already exists.public <S> PCollection<S> emptyPCollection(PType<S> ptype)
Pipeline
PCollection
of the given PType
.emptyPCollection
in interface Pipeline
ptype
- The PType of the empty PCollectionpublic <K,V> PTable<K,V> emptyPTable(PTableType<K,V> ptype)
Pipeline
PTable
of the given PTable Type
.emptyPTable
in interface Pipeline
ptype
- The PTableType of the empty PTablepublic <S> PCollection<S> create(Iterable<S> contents, PType<S> ptype)
Pipeline
PCollection
containing the values found in the given Iterable
using an implementation-specific distribution mechanism.public <S> PCollection<S> create(Iterable<S> contents, PType<S> ptype, CreateOptions options)
Pipeline
PCollection
containing the values found in the given Iterable
using an implementation-specific distribution mechanism.public <K,V> PTable<K,V> create(Iterable<Pair<K,V>> contents, PTableType<K,V> ptype)
Pipeline
PTable
containing the values found in the given Iterable
using an implementation-specific distribution mechanism.public <K,V> PTable<K,V> create(Iterable<Pair<K,V>> contents, PTableType<K,V> ptype, CreateOptions options)
Pipeline
PTable
containing the values found in the given Iterable
using an implementation-specific distribution mechanism.public <T> ReadableSource<T> getMaterializeSourceTarget(PCollection<T> pcollection)
PCollection
.
This is primarily intended as a helper method to Pipeline.materialize(PCollection)
. The
underlying data of the ReadableSourceTarget may not be actually present until the pipeline is
run.pcollection
- The collection for which the ReadableSourceTarget is to be retrievedIllegalArgumentException
- If no ReadableSourceTarget can be retrieved for the given
PCollectionpublic <T> SourceTarget<T> createIntermediateOutput(PType<T> ptype)
public org.apache.hadoop.fs.Path createTempPath()
public <T> void writeTextFile(PCollection<T> pcollection, String pathName)
Pipeline
writeTextFile
in interface Pipeline
public void cleanup(boolean force)
Pipeline
running
the pipeline.public int getNextAnonymousStageId()
public void enableDebug()
Pipeline
enableDebug
in interface Pipeline
Copyright © 2016 The Apache Software Foundation. All rights reserved.