public interface Pipeline
| Modifier and Type | Method and Description |
|---|---|
<T> void |
cache(PCollection<T> pcollection,
CachingOptions options)
Caches the given PCollection so that it will be processed at most once
during pipeline execution.
|
void |
cleanup(boolean force)
Cleans up any artifacts created as a result of
running the pipeline. |
<K,V> PTable<K,V> |
create(Iterable<Pair<K,V>> contents,
PTableType<K,V> ptype)
Creates a
PTable containing the values found in the given Iterable
using an implementation-specific distribution mechanism. |
<K,V> PTable<K,V> |
create(Iterable<Pair<K,V>> contents,
PTableType<K,V> ptype,
CreateOptions options)
Creates a
PTable containing the values found in the given Iterable
using an implementation-specific distribution mechanism. |
<T> PCollection<T> |
create(Iterable<T> contents,
PType<T> ptype)
Creates a
PCollection containing the values found in the given Iterable
using an implementation-specific distribution mechanism. |
<T> PCollection<T> |
create(Iterable<T> contents,
PType<T> ptype,
CreateOptions options)
Creates a
PCollection containing the values found in the given Iterable
using an implementation-specific distribution mechanism. |
PipelineResult |
done()
Run any remaining jobs required to generate outputs and then clean up any
intermediate data files that were created in this run or previous calls to
run. |
<T> PCollection<T> |
emptyPCollection(PType<T> ptype)
Creates an empty
PCollection of the given PType. |
<K,V> PTable<K,V> |
emptyPTable(PTableType<K,V> ptype)
Creates an empty
PTable of the given PTable Type. |
void |
enableDebug()
Turn on debug logging for jobs that are run from this pipeline.
|
org.apache.hadoop.conf.Configuration |
getConfiguration()
Returns the
Configuration instance associated with this pipeline. |
String |
getName()
Returns the name of this pipeline.
|
<T> Iterable<T> |
materialize(PCollection<T> pcollection)
Create the given PCollection and read the data it contains into the
returned Collection instance for client use.
|
<T> PCollection<T> |
read(Source<T> source)
Converts the given
Source into a PCollection that is
available to jobs run using this Pipeline instance. |
<T> PCollection<T> |
read(Source<T> source,
String named)
Converts the given
Source into a PCollection that is
available to jobs run using this Pipeline instance. |
<K,V> PTable<K,V> |
read(TableSource<K,V> tableSource)
A version of the read method for
TableSource instances that map to
PTables. |
<K,V> PTable<K,V> |
read(TableSource<K,V> tableSource,
String named)
A version of the read method for
TableSource instances that map to
PTables. |
PCollection<String> |
readTextFile(String pathName)
A convenience method for reading a text file.
|
PipelineResult |
run()
Constructs and executes a series of MapReduce jobs in order to write data
to the output targets.
|
PipelineExecution |
runAsync()
Constructs and starts a series of MapReduce jobs in order ot write data to
the output targets, but returns a
ListenableFuture to allow clients to control
job execution. |
<Output> Output |
sequentialDo(PipelineCallable<Output> pipelineCallable)
Executes the given
PipelineCallable on the client after the Targets
that the PipelineCallable depends on (if any) have been created by other pipeline
processing steps. |
void |
setConfiguration(org.apache.hadoop.conf.Configuration conf)
Set the
Configuration to use with this pipeline. |
<S> PCollection<S> |
union(List<PCollection<S>> collections) |
<K,V> PTable<K,V> |
unionTables(List<PTable<K,V>> tables) |
void |
write(PCollection<?> collection,
Target target)
Write the given collection to the given target on the next pipeline run.
|
void |
write(PCollection<?> collection,
Target target,
Target.WriteMode writeMode)
Write the contents of the
PCollection to the given Target,
using the storage format specified by the target and the given
WriteMode for cases where the referenced Target
already exists. |
<T> void |
writeTextFile(PCollection<T> collection,
String pathName)
A convenience method for writing a text file.
|
void setConfiguration(org.apache.hadoop.conf.Configuration conf)
Configuration to use with this pipeline.String getName()
org.apache.hadoop.conf.Configuration getConfiguration()
Configuration instance associated with this pipeline.<T> PCollection<T> read(Source<T> source)
Source into a PCollection that is
available to jobs run using this Pipeline instance.source - The source of data<T> PCollection<T> read(Source<T> source, String named)
Source into a PCollection that is
available to jobs run using this Pipeline instance.source - The source of datanamed - A name for the returned PCollection<K,V> PTable<K,V> read(TableSource<K,V> tableSource)
TableSource instances that map to
PTables.tableSource - The source of the data<K,V> PTable<K,V> read(TableSource<K,V> tableSource, String named)
TableSource instances that map to
PTables.tableSource - The source of the datanamed - A name for the returned PTablevoid write(PCollection<?> collection, Target target)
WriteMode.DEFAULT rule for the given Target.collection - The collectiontarget - The output targetvoid write(PCollection<?> collection, Target target, Target.WriteMode writeMode)
PCollection to the given Target,
using the storage format specified by the target and the given
WriteMode for cases where the referenced Target
already exists.collection - The collectiontarget - The target to write towriteMode - The strategy to use for handling existing outputs<T> Iterable<T> materialize(PCollection<T> pcollection)
pcollection - The PCollection to materialize<T> void cache(PCollection<T> pcollection, CachingOptions options)
pcollection - The PCollection to cacheoptions - The options for how the cached data is stored<T> PCollection<T> emptyPCollection(PType<T> ptype)
PCollection of the given PType.ptype - The PType of the empty PCollection<K,V> PTable<K,V> emptyPTable(PTableType<K,V> ptype)
PTable of the given PTable Type.ptype - The PTableType of the empty PTable<T> PCollection<T> create(Iterable<T> contents, PType<T> ptype)
PCollection containing the values found in the given Iterable
using an implementation-specific distribution mechanism.contents - The values the new PCollection will containptype - The PType of the PCollection<T> PCollection<T> create(Iterable<T> contents, PType<T> ptype, CreateOptions options)
PCollection containing the values found in the given Iterable
using an implementation-specific distribution mechanism.contents - The values the new PCollection will containptype - The PType of the PCollectionoptions - Additional options, such as the name or desired parallelism of the PCollection<K,V> PTable<K,V> create(Iterable<Pair<K,V>> contents, PTableType<K,V> ptype)
PTable containing the values found in the given Iterable
using an implementation-specific distribution mechanism.contents - The values the new PTable will containptype - The PTableType of the PTable<K,V> PTable<K,V> create(Iterable<Pair<K,V>> contents, PTableType<K,V> ptype, CreateOptions options)
PTable containing the values found in the given Iterable
using an implementation-specific distribution mechanism.contents - The values the new PTable will containptype - The PTableType of the PTableoptions - Additional options, such as the name or desired parallelism of the PTable<S> PCollection<S> union(List<PCollection<S>> collections)
<Output> Output sequentialDo(PipelineCallable<Output> pipelineCallable)
PipelineCallable on the client after the Targets
that the PipelineCallable depends on (if any) have been created by other pipeline
processing steps.Output - The return type of the PipelineCallablepipelineCallable - The sequential logic to executePipelineResult run()
PipelineExecution runAsync()
ListenableFuture to allow clients to control
job execution.PipelineResult done()
run.void cleanup(boolean force)
running the pipeline.force - forces the cleanup even if all targets of the pipeline have not been completed.PCollection<String> readTextFile(String pathName)
<T> void writeTextFile(PCollection<T> collection, String pathName)
void enableDebug()
Copyright © 2015 The Apache Software Foundation. All Rights Reserved.