SparkPipeline (Apache Crunch 0.10.0 API)

This project has retired. For details please refer to its Attic page.

Overview

Package

Class

Use

Tree

Deprecated

Index

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.crunch.impl.spark
Class SparkPipeline

java.lang.Object
  org.apache.crunch.impl.dist.DistributedPipeline
      org.apache.crunch.impl.spark.SparkPipeline

All Implemented Interfaces:: Pipeline

public class SparkPipeline
extends DistributedPipeline
extends DistributedPipeline

Constructor Summary

SparkPipeline(org.apache.spark.api.java.JavaSparkContext sparkContext, String appName)


SparkPipeline(String sparkConnect, String appName)


SparkPipeline(String sparkConnect, String appName, Class<?> jarClass)


Method Summary

<T> void cache(PCollection<T> pcollection, CachingOptions options)
          Caches the given PCollection so that it will be processed at most once during pipeline execution.

PipelineResult done()
          Run any remaining jobs required to generate outputs and then clean up any intermediate data files that were created in this run or previous calls to run.

<S> PCollection<S> emptyPCollection(PType<S> ptype)


<K,V> PTable<K,V> emptyPTable(PTableType<K,V> ptype)


<T> Iterable<T> materialize(PCollection<T> pcollection)
          Create the given PCollection and read the data it contains into the returned Collection instance for client use.

PipelineResult run()
          Constructs and executes a series of MapReduce jobs in order to write data to the output targets.

PipelineExecution runAsync()
          Constructs and starts a series of MapReduce jobs in order ot write data to the output targets, but returns a ListenableFuture to allow clients to control job execution.

Methods inherited from class org.apache.crunch.impl.dist.DistributedPipeline

cleanup, createIntermediateOutput, createTempPath, enableDebug, getConfiguration, getFactory, getMaterializeSourceTarget, getName, getNextAnonymousStageId, read, read, readTextFile, setConfiguration, write, write, writeTextFile

Methods inherited from class java.lang.Object

equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Summary
`SparkPipeline(org.apache.spark.api.java.JavaSparkContext sparkContext, String appName)`
`SparkPipeline(String sparkConnect, String appName)`
`SparkPipeline(String sparkConnect, String appName, Class<?> jarClass)`

Methods inherited from class org.apache.crunch.impl.dist.DistributedPipeline
`cleanup, createIntermediateOutput, createTempPath, enableDebug, getConfiguration, getFactory, getMaterializeSourceTarget, getName, getNextAnonymousStageId, read, read, readTextFile, setConfiguration, write, write, writeTextFile`

Methods inherited from class java.lang.Object
`equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

SparkPipeline

public SparkPipeline(String sparkConnect,
                     String appName)

SparkPipeline

public SparkPipeline(String sparkConnect,
                     String appName,
                     Class<?> jarClass)

SparkPipeline

public SparkPipeline(org.apache.spark.api.java.JavaSparkContext sparkContext,
                     String appName)

Method Detail

materialize

public <T> Iterable<T> materialize(PCollection<T> pcollection)

Description copied from interface: Pipeline

Create the given PCollection and read the data it contains into the returned Collection instance for client use.

Parameters:: pcollection - The PCollection to materialize
Returns:: the data from the PCollection as a read-only Collection

emptyPCollection

public <S> PCollection<S> emptyPCollection(PType<S> ptype)

Specified by:: emptyPCollection in interface Pipeline
Overrides:: emptyPCollection in class DistributedPipeline

emptyPTable

public <K,V> PTable<K,V> emptyPTable(PTableType<K,V> ptype)

Specified by:: emptyPTable in interface Pipeline
Overrides:: emptyPTable in class DistributedPipeline

cache

public <T> void cache(PCollection<T> pcollection,
                      CachingOptions options)

Description copied from interface: Pipeline

Caches the given PCollection so that it will be processed at most once during pipeline execution.

Parameters:: pcollection - The PCollection to cache; options - The options for how the cached data is stored

run

public PipelineResult run()

Description copied from interface: Pipeline

Constructs and executes a series of MapReduce jobs in order to write data to the output targets.

runAsync

public PipelineExecution runAsync()

Description copied from interface: Pipeline

Constructs and starts a series of MapReduce jobs in order ot write data to the output targets, but returns a ListenableFuture to allow clients to control job execution.

Returns:

done

public PipelineResult done()

Description copied from interface: Pipeline

Run any remaining jobs required to generate outputs and then clean up any intermediate data files that were created in this run or previous calls to run.

Specified by:: done in interface Pipeline
Overrides:: done in class DistributedPipeline

Overview

Package

Class

Use

Tree

Deprecated

Index

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.crunch.impl.spark Class SparkPipeline

SparkPipeline

SparkPipeline

SparkPipeline

materialize

emptyPCollection

emptyPTable

cache

run

runAsync

done

org.apache.crunch.impl.spark
Class SparkPipeline