SparkPipeline (Apache Crunch 0.14.0 API)

java.lang.Object
- org.apache.crunch.impl.dist.DistributedPipeline
- - org.apache.crunch.impl.spark.SparkPipeline

All Implemented Interfaces:: Pipeline

public class SparkPipeline
extends DistributedPipeline

Constructor Summary

Constructors
Constructor and Description
`SparkPipeline(org.apache.spark.api.java.JavaSparkContext sparkContext, String appName)`
`SparkPipeline(org.apache.spark.api.java.JavaSparkContext sparkContext, String appName, Class<?> jarClass, org.apache.hadoop.conf.Configuration conf)`
`SparkPipeline(String sparkConnect, String appName)`
`SparkPipeline(String sparkConnect, String appName, Class<?> jarClass)`
`SparkPipeline(String sparkConnect, String appName, Class<?> jarClass, org.apache.hadoop.conf.Configuration conf)`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`<T> void`	`cache(PCollection<T> pcollection, CachingOptions options)` Caches the given PCollection so that it will be processed at most once during pipeline execution.
`<K,V> PTable<K,V>`	`create(Iterable<Pair<K,V>> contents, PTableType<K,V> ptype, CreateOptions options)` Creates a `PTable` containing the values found in the given `Iterable` using an implementation-specific distribution mechanism.
`<S> PCollection<S>`	`create(Iterable<S> contents, PType<S> ptype, CreateOptions options)` Creates a `PCollection` containing the values found in the given `Iterable` using an implementation-specific distribution mechanism.
`PipelineResult`	`done()` Run any remaining jobs required to generate outputs and then clean up any intermediate data files that were created in this run or previous calls to `run`.
`<S> PCollection<S>`	`emptyPCollection(PType<S> ptype)` Creates an empty `PCollection` of the given `PType`.
`<K,V> PTable<K,V>`	`emptyPTable(PTableType<K,V> ptype)` Creates an empty `PTable` of the given `PTable Type`.
`<T> Iterable<T>`	`materialize(PCollection<T> pcollection)` Create the given PCollection and read the data it contains into the returned Collection instance for client use.
`PipelineResult`	`run()` Constructs and executes a series of MapReduce jobs in order to write data to the output targets.
`PipelineExecution`	`runAsync()` Constructs and starts a series of MapReduce jobs in order ot write data to the output targets, but returns a `ListenableFuture` to allow clients to control job execution.

Methods inherited from class org.apache.crunch.impl.dist.DistributedPipeline
cleanup, create, create, createIntermediateOutput, createTempPath, enableDebug, getConfiguration, getFactory, getMaterializeSourceTarget, getName, getNextAnonymousStageId, read, read, read, read, readTextFile, sequentialDo, setConfiguration, union, unionTables, write, write, writeTextFile

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - SparkPipeline
```
public SparkPipeline(String sparkConnect,
                     String appName)
```
  - SparkPipeline
```
public SparkPipeline(String sparkConnect,
                     String appName,
                     Class<?> jarClass)
```
  - SparkPipeline
```
public SparkPipeline(String sparkConnect,
                     String appName,
                     Class<?> jarClass,
                     org.apache.hadoop.conf.Configuration conf)
```
  - SparkPipeline
```
public SparkPipeline(org.apache.spark.api.java.JavaSparkContext sparkContext,
                     String appName)
```
  - SparkPipeline
```
public SparkPipeline(org.apache.spark.api.java.JavaSparkContext sparkContext,
                     String appName,
                     Class<?> jarClass,
                     org.apache.hadoop.conf.Configuration conf)
```
- Method Detail
  - materialize
```
public <T> Iterable<T> materialize(PCollection<T> pcollection)
```
    Description copied from interface: Pipeline
    
    Create the given PCollection and read the data it contains into the returned Collection instance for client use.
    
    Parameters:
    
    pcollection - The PCollection to materialize
    
    Returns:
    
    the data from the PCollection as a read-only Collection
  - emptyPCollection
```
public <S> PCollection<S> emptyPCollection(PType<S> ptype)
```
    Description copied from interface: Pipeline
    
    Creates an empty PCollection of the given PType.
    
    Specified by:
    
    emptyPCollection in interface Pipeline
    
    Overrides:
    
    emptyPCollection in class DistributedPipeline
    
    Parameters:
    
    ptype - The PType of the empty PCollection
    
    Returns:
    
    A valid PCollection with no contents
  - emptyPTable
```
public <K,V> PTable<K,V> emptyPTable(PTableType<K,V> ptype)
```
    Description copied from interface: Pipeline
    
    Creates an empty PTable of the given PTable Type.
    
    Specified by:
    
    emptyPTable in interface Pipeline
    
    Overrides:
    
    emptyPTable in class DistributedPipeline
    
    Parameters:
    
    ptype - The PTableType of the empty PTable
    
    Returns:
    
    A valid PTable with no contents
  - create
```
public <S> PCollection<S> create(Iterable<S> contents,
                                 PType<S> ptype,
                                 CreateOptions options)
```
    Description copied from interface: Pipeline
    
    Creates a PCollection containing the values found in the given Iterable using an implementation-specific distribution mechanism.
    
    Specified by:
    
    create in interface Pipeline
    
    Overrides:
    
    create in class DistributedPipeline
    
    Parameters:
    
    contents - The values the new PCollection will contain
    
    ptype - The PType of the PCollection
    
    options - Additional options, such as the name or desired parallelism of the PCollection
    
    Returns:
    
    A PCollection that contains the given values
  - create
```
public <K,V> PTable<K,V> create(Iterable<Pair<K,V>> contents,
                                PTableType<K,V> ptype,
                                CreateOptions options)
```
    Description copied from interface: Pipeline
    
    Creates a PTable containing the values found in the given Iterable using an implementation-specific distribution mechanism.
    
    Specified by:
    
    create in interface Pipeline
    
    Overrides:
    
    create in class DistributedPipeline
    
    Parameters:
    
    contents - The values the new PTable will contain
    
    ptype - The PTableType of the PTable
    
    options - Additional options, such as the name or desired parallelism of the PTable
    
    Returns:
    
    A PTable that contains the given values
  - cache
```
public <T> void cache(PCollection<T> pcollection,
                      CachingOptions options)
```
    Description copied from interface: Pipeline
    
    Caches the given PCollection so that it will be processed at most once during pipeline execution.
    
    Parameters:
    
    pcollection - The PCollection to cache
    
    options - The options for how the cached data is stored
  - run
```
public PipelineResult run()
```
    Description copied from interface: Pipeline
    
    Constructs and executes a series of MapReduce jobs in order to write data to the output targets.
  - runAsync
```
public PipelineExecution runAsync()
```
    Description copied from interface: Pipeline
    
    Constructs and starts a series of MapReduce jobs in order ot write data to the output targets, but returns a ListenableFuture to allow clients to control job execution.
    
    Returns:
  - done
```
public PipelineResult done()
```
    Description copied from interface: Pipeline
    
    Run any remaining jobs required to generate outputs and then clean up any intermediate data files that were created in this run or previous calls to run.
    
    Specified by:
    
    done in interface Pipeline
    
    Overrides:
    
    done in class DistributedPipeline

Class SparkPipeline

Constructor Summary

Method Summary

Methods inherited from class org.apache.crunch.impl.dist.DistributedPipeline

Methods inherited from class java.lang.Object

Constructor Detail

SparkPipeline

SparkPipeline

SparkPipeline

SparkPipeline

SparkPipeline

Method Detail

materialize

emptyPCollection

emptyPTable

create

create

cache

run

runAsync

done