public class SparkPipeline extends DistributedPipeline
| Constructor and Description |
|---|
SparkPipeline(org.apache.spark.api.java.JavaSparkContext sparkContext,
String appName) |
SparkPipeline(org.apache.spark.api.java.JavaSparkContext sparkContext,
String appName,
Class<?> jarClass,
org.apache.hadoop.conf.Configuration conf) |
SparkPipeline(String sparkConnect,
String appName) |
SparkPipeline(String sparkConnect,
String appName,
Class<?> jarClass) |
SparkPipeline(String sparkConnect,
String appName,
Class<?> jarClass,
org.apache.hadoop.conf.Configuration conf) |
| Modifier and Type | Method and Description |
|---|---|
<T> void |
cache(PCollection<T> pcollection,
CachingOptions options)
Caches the given PCollection so that it will be processed at most once
during pipeline execution.
|
<K,V> PTable<K,V> |
create(Iterable<Pair<K,V>> contents,
PTableType<K,V> ptype,
CreateOptions options)
Creates a
PTable containing the values found in the given Iterable
using an implementation-specific distribution mechanism. |
<S> PCollection<S> |
create(Iterable<S> contents,
PType<S> ptype,
CreateOptions options)
Creates a
PCollection containing the values found in the given Iterable
using an implementation-specific distribution mechanism. |
PipelineResult |
done()
Run any remaining jobs required to generate outputs and then clean up any
intermediate data files that were created in this run or previous calls to
run. |
<S> PCollection<S> |
emptyPCollection(PType<S> ptype)
Creates an empty
PCollection of the given PType. |
<K,V> PTable<K,V> |
emptyPTable(PTableType<K,V> ptype)
Creates an empty
PTable of the given PTable Type. |
<T> Iterable<T> |
materialize(PCollection<T> pcollection)
Create the given PCollection and read the data it contains into the
returned Collection instance for client use.
|
PipelineResult |
run()
Constructs and executes a series of MapReduce jobs in order to write data
to the output targets.
|
PipelineExecution |
runAsync()
Constructs and starts a series of MapReduce jobs in order ot write data to
the output targets, but returns a
ListenableFuture to allow clients to control
job execution. |
cleanup, create, create, createIntermediateOutput, createTempPath, enableDebug, getConfiguration, getFactory, getMaterializeSourceTarget, getName, getNextAnonymousStageId, read, read, read, read, readTextFile, sequentialDo, setConfiguration, union, unionTables, write, write, writeTextFilepublic SparkPipeline(String sparkConnect, String appName, Class<?> jarClass, org.apache.hadoop.conf.Configuration conf)
public SparkPipeline(org.apache.spark.api.java.JavaSparkContext sparkContext,
String appName)
public <T> Iterable<T> materialize(PCollection<T> pcollection)
Pipelinepcollection - The PCollection to materializepublic <S> PCollection<S> emptyPCollection(PType<S> ptype)
PipelinePCollection of the given PType.emptyPCollection in interface PipelineemptyPCollection in class DistributedPipelineptype - The PType of the empty PCollectionpublic <K,V> PTable<K,V> emptyPTable(PTableType<K,V> ptype)
PipelinePTable of the given PTable Type.emptyPTable in interface PipelineemptyPTable in class DistributedPipelineptype - The PTableType of the empty PTablepublic <S> PCollection<S> create(Iterable<S> contents, PType<S> ptype, CreateOptions options)
PipelinePCollection containing the values found in the given Iterable
using an implementation-specific distribution mechanism.create in interface Pipelinecreate in class DistributedPipelinecontents - The values the new PCollection will containptype - The PType of the PCollectionoptions - Additional options, such as the name or desired parallelism of the PCollectionpublic <K,V> PTable<K,V> create(Iterable<Pair<K,V>> contents, PTableType<K,V> ptype, CreateOptions options)
PipelinePTable containing the values found in the given Iterable
using an implementation-specific distribution mechanism.create in interface Pipelinecreate in class DistributedPipelinecontents - The values the new PTable will containptype - The PTableType of the PTableoptions - Additional options, such as the name or desired parallelism of the PTablepublic <T> void cache(PCollection<T> pcollection, CachingOptions options)
Pipelinepcollection - The PCollection to cacheoptions - The options for how the cached data is storedpublic PipelineResult run()
Pipelinepublic PipelineExecution runAsync()
PipelineListenableFuture to allow clients to control
job execution.public PipelineResult done()
Pipelinerun.done in interface Pipelinedone in class DistributedPipelineCopyright © 2015 The Apache Software Foundation. All Rights Reserved.