public class SparkPipeline extends DistributedPipeline
Constructor and Description |
---|
SparkPipeline(org.apache.spark.api.java.JavaSparkContext sparkContext,
String appName) |
SparkPipeline(org.apache.spark.api.java.JavaSparkContext sparkContext,
String appName,
Class<?> jarClass,
org.apache.hadoop.conf.Configuration conf) |
SparkPipeline(String sparkConnect,
String appName) |
SparkPipeline(String sparkConnect,
String appName,
Class<?> jarClass) |
SparkPipeline(String sparkConnect,
String appName,
Class<?> jarClass,
org.apache.hadoop.conf.Configuration conf) |
Modifier and Type | Method and Description |
---|---|
<T> void |
cache(PCollection<T> pcollection,
CachingOptions options)
Caches the given PCollection so that it will be processed at most once
during pipeline execution.
|
<K,V> PTable<K,V> |
create(Iterable<Pair<K,V>> contents,
PTableType<K,V> ptype,
CreateOptions options)
Creates a
PTable containing the values found in the given Iterable
using an implementation-specific distribution mechanism. |
<S> PCollection<S> |
create(Iterable<S> contents,
PType<S> ptype,
CreateOptions options)
Creates a
PCollection containing the values found in the given Iterable
using an implementation-specific distribution mechanism. |
PipelineResult |
done()
Run any remaining jobs required to generate outputs and then clean up any
intermediate data files that were created in this run or previous calls to
run . |
<S> PCollection<S> |
emptyPCollection(PType<S> ptype)
Creates an empty
PCollection of the given PType . |
<K,V> PTable<K,V> |
emptyPTable(PTableType<K,V> ptype)
Creates an empty
PTable of the given PTable Type . |
<T> Iterable<T> |
materialize(PCollection<T> pcollection)
Create the given PCollection and read the data it contains into the
returned Collection instance for client use.
|
PipelineResult |
run()
Constructs and executes a series of MapReduce jobs in order to write data
to the output targets.
|
PipelineExecution |
runAsync()
Constructs and starts a series of MapReduce jobs in order ot write data to
the output targets, but returns a
ListenableFuture to allow clients to control
job execution. |
cleanup, create, create, createIntermediateOutput, createTempPath, enableDebug, getConfiguration, getFactory, getMaterializeSourceTarget, getName, getNextAnonymousStageId, read, read, read, read, readTextFile, sequentialDo, setConfiguration, union, unionTables, write, write, writeTextFile
public SparkPipeline(String sparkConnect, String appName, Class<?> jarClass, org.apache.hadoop.conf.Configuration conf)
public SparkPipeline(org.apache.spark.api.java.JavaSparkContext sparkContext, String appName)
public <T> Iterable<T> materialize(PCollection<T> pcollection)
Pipeline
pcollection
- The PCollection to materializepublic <S> PCollection<S> emptyPCollection(PType<S> ptype)
Pipeline
PCollection
of the given PType
.emptyPCollection
in interface Pipeline
emptyPCollection
in class DistributedPipeline
ptype
- The PType of the empty PCollectionpublic <K,V> PTable<K,V> emptyPTable(PTableType<K,V> ptype)
Pipeline
PTable
of the given PTable Type
.emptyPTable
in interface Pipeline
emptyPTable
in class DistributedPipeline
ptype
- The PTableType of the empty PTablepublic <S> PCollection<S> create(Iterable<S> contents, PType<S> ptype, CreateOptions options)
Pipeline
PCollection
containing the values found in the given Iterable
using an implementation-specific distribution mechanism.create
in interface Pipeline
create
in class DistributedPipeline
contents
- The values the new PCollection will containptype
- The PType of the PCollectionoptions
- Additional options, such as the name or desired parallelism of the PCollectionpublic <K,V> PTable<K,V> create(Iterable<Pair<K,V>> contents, PTableType<K,V> ptype, CreateOptions options)
Pipeline
PTable
containing the values found in the given Iterable
using an implementation-specific distribution mechanism.create
in interface Pipeline
create
in class DistributedPipeline
contents
- The values the new PTable will containptype
- The PTableType of the PTableoptions
- Additional options, such as the name or desired parallelism of the PTablepublic <T> void cache(PCollection<T> pcollection, CachingOptions options)
Pipeline
pcollection
- The PCollection to cacheoptions
- The options for how the cached data is storedpublic PipelineResult run()
Pipeline
public PipelineExecution runAsync()
Pipeline
ListenableFuture
to allow clients to control
job execution.public PipelineResult done()
Pipeline
run
.done
in interface Pipeline
done
in class DistributedPipeline
Copyright © 2016 The Apache Software Foundation. All rights reserved.