public interface Pipeline
Modifier and Type | Method and Description |
---|---|
<T> void |
cache(PCollection<T> pcollection,
CachingOptions options)
Caches the given PCollection so that it will be processed at most once
during pipeline execution.
|
void |
cleanup(boolean force)
Cleans up any artifacts created as a result of
running the pipeline. |
<K,V> PTable<K,V> |
create(Iterable<Pair<K,V>> contents,
PTableType<K,V> ptype)
Creates a
PTable containing the values found in the given Iterable
using an implementation-specific distribution mechanism. |
<K,V> PTable<K,V> |
create(Iterable<Pair<K,V>> contents,
PTableType<K,V> ptype,
CreateOptions options)
Creates a
PTable containing the values found in the given Iterable
using an implementation-specific distribution mechanism. |
<T> PCollection<T> |
create(Iterable<T> contents,
PType<T> ptype)
Creates a
PCollection containing the values found in the given Iterable
using an implementation-specific distribution mechanism. |
<T> PCollection<T> |
create(Iterable<T> contents,
PType<T> ptype,
CreateOptions options)
Creates a
PCollection containing the values found in the given Iterable
using an implementation-specific distribution mechanism. |
PipelineResult |
done()
Run any remaining jobs required to generate outputs and then clean up any
intermediate data files that were created in this run or previous calls to
run . |
<T> PCollection<T> |
emptyPCollection(PType<T> ptype)
Creates an empty
PCollection of the given PType . |
<K,V> PTable<K,V> |
emptyPTable(PTableType<K,V> ptype)
Creates an empty
PTable of the given PTable Type . |
void |
enableDebug()
Turn on debug logging for jobs that are run from this pipeline.
|
org.apache.hadoop.conf.Configuration |
getConfiguration()
Returns the
Configuration instance associated with this pipeline. |
String |
getName()
Returns the name of this pipeline.
|
<T> Iterable<T> |
materialize(PCollection<T> pcollection)
Create the given PCollection and read the data it contains into the
returned Collection instance for client use.
|
<T> PCollection<T> |
read(Source<T> source)
Converts the given
Source into a PCollection that is
available to jobs run using this Pipeline instance. |
<T> PCollection<T> |
read(Source<T> source,
String named)
Converts the given
Source into a PCollection that is
available to jobs run using this Pipeline instance. |
<K,V> PTable<K,V> |
read(TableSource<K,V> tableSource)
A version of the read method for
TableSource instances that map to
PTable s. |
<K,V> PTable<K,V> |
read(TableSource<K,V> tableSource,
String named)
A version of the read method for
TableSource instances that map to
PTable s. |
PCollection<String> |
readTextFile(String pathName)
A convenience method for reading a text file.
|
PipelineResult |
run()
Constructs and executes a series of MapReduce jobs in order to write data
to the output targets.
|
PipelineExecution |
runAsync()
Constructs and starts a series of MapReduce jobs in order ot write data to
the output targets, but returns a
ListenableFuture to allow clients to control
job execution. |
<Output> Output |
sequentialDo(PipelineCallable<Output> pipelineCallable)
Executes the given
PipelineCallable on the client after the Targets
that the PipelineCallable depends on (if any) have been created by other pipeline
processing steps. |
void |
setConfiguration(org.apache.hadoop.conf.Configuration conf)
Set the
Configuration to use with this pipeline. |
<S> PCollection<S> |
union(List<PCollection<S>> collections) |
<K,V> PTable<K,V> |
unionTables(List<PTable<K,V>> tables) |
void |
write(PCollection<?> collection,
Target target)
Write the given collection to the given target on the next pipeline run.
|
void |
write(PCollection<?> collection,
Target target,
Target.WriteMode writeMode)
Write the contents of the
PCollection to the given Target ,
using the storage format specified by the target and the given
WriteMode for cases where the referenced Target
already exists. |
<T> void |
writeTextFile(PCollection<T> collection,
String pathName)
A convenience method for writing a text file.
|
void setConfiguration(org.apache.hadoop.conf.Configuration conf)
Configuration
to use with this pipeline.String getName()
org.apache.hadoop.conf.Configuration getConfiguration()
Configuration
instance associated with this pipeline.<T> PCollection<T> read(Source<T> source)
Source
into a PCollection
that is
available to jobs run using this Pipeline
instance.source
- The source of data<T> PCollection<T> read(Source<T> source, String named)
Source
into a PCollection
that is
available to jobs run using this Pipeline
instance.source
- The source of datanamed
- A name for the returned PCollection<K,V> PTable<K,V> read(TableSource<K,V> tableSource)
TableSource
instances that map to
PTable
s.tableSource
- The source of the data<K,V> PTable<K,V> read(TableSource<K,V> tableSource, String named)
TableSource
instances that map to
PTable
s.tableSource
- The source of the datanamed
- A name for the returned PTablevoid write(PCollection<?> collection, Target target)
WriteMode.DEFAULT
rule for the given Target
.collection
- The collectiontarget
- The output targetvoid write(PCollection<?> collection, Target target, Target.WriteMode writeMode)
PCollection
to the given Target
,
using the storage format specified by the target and the given
WriteMode
for cases where the referenced Target
already exists.collection
- The collectiontarget
- The target to write towriteMode
- The strategy to use for handling existing outputs<T> Iterable<T> materialize(PCollection<T> pcollection)
pcollection
- The PCollection to materialize<T> void cache(PCollection<T> pcollection, CachingOptions options)
pcollection
- The PCollection to cacheoptions
- The options for how the cached data is stored<T> PCollection<T> emptyPCollection(PType<T> ptype)
PCollection
of the given PType
.ptype
- The PType of the empty PCollection<K,V> PTable<K,V> emptyPTable(PTableType<K,V> ptype)
PTable
of the given PTable Type
.ptype
- The PTableType of the empty PTable<T> PCollection<T> create(Iterable<T> contents, PType<T> ptype)
PCollection
containing the values found in the given Iterable
using an implementation-specific distribution mechanism.contents
- The values the new PCollection will containptype
- The PType of the PCollection<T> PCollection<T> create(Iterable<T> contents, PType<T> ptype, CreateOptions options)
PCollection
containing the values found in the given Iterable
using an implementation-specific distribution mechanism.contents
- The values the new PCollection will containptype
- The PType of the PCollectionoptions
- Additional options, such as the name or desired parallelism of the PCollection<K,V> PTable<K,V> create(Iterable<Pair<K,V>> contents, PTableType<K,V> ptype)
PTable
containing the values found in the given Iterable
using an implementation-specific distribution mechanism.contents
- The values the new PTable will containptype
- The PTableType of the PTable<K,V> PTable<K,V> create(Iterable<Pair<K,V>> contents, PTableType<K,V> ptype, CreateOptions options)
PTable
containing the values found in the given Iterable
using an implementation-specific distribution mechanism.contents
- The values the new PTable will containptype
- The PTableType of the PTableoptions
- Additional options, such as the name or desired parallelism of the PTable<S> PCollection<S> union(List<PCollection<S>> collections)
<Output> Output sequentialDo(PipelineCallable<Output> pipelineCallable)
PipelineCallable
on the client after the Targets
that the PipelineCallable depends on (if any) have been created by other pipeline
processing steps.Output
- The return type of the PipelineCallablepipelineCallable
- The sequential logic to executePipelineResult run()
PipelineExecution runAsync()
ListenableFuture
to allow clients to control
job execution.PipelineResult done()
run
.void cleanup(boolean force)
running
the pipeline.force
- forces the cleanup even if all targets of the pipeline have not been completed.PCollection<String> readTextFile(String pathName)
<T> void writeTextFile(PCollection<T> collection, String pathName)
void enableDebug()
Copyright © 2016 The Apache Software Foundation. All rights reserved.