This project has retired. For details please refer to its Attic page.
MRPipeline (Apache Crunch 0.9.0 API)

org.apache.crunch.impl.mr
Class MRPipeline

java.lang.Object
  extended by org.apache.crunch.impl.dist.DistributedPipeline
      extended by org.apache.crunch.impl.mr.MRPipeline
All Implemented Interfaces:
Pipeline

public class MRPipeline
extends DistributedPipeline

Pipeline implementation that is executed within Hadoop MapReduce.


Field Summary
 
Fields inherited from class org.apache.crunch.impl.dist.DistributedPipeline
factory, outputTargets, outputTargetsToMaterialize
 
Constructor Summary
MRPipeline(Class<?> jarClass)
          Instantiate with a default Configuration and name.
MRPipeline(Class<?> jarClass, org.apache.hadoop.conf.Configuration conf)
          Instantiate with a custom configuration and default naming.
MRPipeline(Class<?> jarClass, String name)
          Instantiate with a custom pipeline name.
MRPipeline(Class<?> jarClass, String name, org.apache.hadoop.conf.Configuration conf)
          Instantiate with a custom name and configuration.
 
Method Summary
<T> void
cache(PCollection<T> pcollection, CachingOptions options)
          Caches the given PCollection so that it will be processed at most once during pipeline execution.
<T> Iterable<T>
materialize(PCollection<T> pcollection)
          Create the given PCollection and read the data it contains into the returned Collection instance for client use.
 MRExecutor plan()
           
 PipelineResult run()
          Constructs and executes a series of MapReduce jobs in order to write data to the output targets.
 MRPipelineExecution runAsync()
          Constructs and starts a series of MapReduce jobs in order ot write data to the output targets, but returns a ListenableFuture to allow clients to control job execution.
 
Methods inherited from class org.apache.crunch.impl.dist.DistributedPipeline
cleanup, createIntermediateOutput, createTempPath, done, enableDebug, getConfiguration, getFactory, getMaterializeSourceTarget, getName, getNextAnonymousStageId, read, read, readTextFile, setConfiguration, write, write, writeTextFile
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

MRPipeline

public MRPipeline(Class<?> jarClass)
Instantiate with a default Configuration and name.

Parameters:
jarClass - Class containing the main driver method for running the pipeline

MRPipeline

public MRPipeline(Class<?> jarClass,
                  String name)
Instantiate with a custom pipeline name. The name will be displayed in the Hadoop JobTracker.

Parameters:
jarClass - Class containing the main driver method for running the pipeline
name - Display name of the pipeline

MRPipeline

public MRPipeline(Class<?> jarClass,
                  org.apache.hadoop.conf.Configuration conf)
Instantiate with a custom configuration and default naming.

Parameters:
jarClass - Class containing the main driver method for running the pipeline
conf - Configuration to be used within all MapReduce jobs run in the pipeline

MRPipeline

public MRPipeline(Class<?> jarClass,
                  String name,
                  org.apache.hadoop.conf.Configuration conf)
Instantiate with a custom name and configuration. The name will be displayed in the Hadoop JobTracker.

Parameters:
jarClass - Class containing the main driver method for running the pipeline
name - Display name of the pipeline
conf - Configuration to be used within all MapReduce jobs run in the pipeline
Method Detail

plan

public MRExecutor plan()

run

public PipelineResult run()
Description copied from interface: Pipeline
Constructs and executes a series of MapReduce jobs in order to write data to the output targets.


runAsync

public MRPipelineExecution runAsync()
Description copied from interface: Pipeline
Constructs and starts a series of MapReduce jobs in order ot write data to the output targets, but returns a ListenableFuture to allow clients to control job execution.

Returns:

materialize

public <T> Iterable<T> materialize(PCollection<T> pcollection)
Description copied from interface: Pipeline
Create the given PCollection and read the data it contains into the returned Collection instance for client use.

Parameters:
pcollection - The PCollection to materialize
Returns:
the data from the PCollection as a read-only Collection

cache

public <T> void cache(PCollection<T> pcollection,
                      CachingOptions options)
Description copied from interface: Pipeline
Caches the given PCollection so that it will be processed at most once during pipeline execution.

Parameters:
pcollection - The PCollection to cache
options - The options for how the cached data is stored


Copyright © 2014 The Apache Software Foundation. All Rights Reserved.