This project has retired. For details please refer to its Attic page.
PipelineCallable (Apache Crunch 0.11.0 API)

org.apache.crunch
Class PipelineCallable<Output>

java.lang.Object
  extended by org.apache.crunch.PipelineCallable<Output>
Type Parameters:
Output - the output value returned by this instance (Void, PCollection, Pair<PCollection, PCollection>, etc.
All Implemented Interfaces:
Callable<PipelineCallable.Status>

public abstract class PipelineCallable<Output>
extends Object
implements Callable<PipelineCallable.Status>

A specialization of Callable that executes some sequential logic on the client machine as part of an overall Crunch pipeline in order to generate zero or more outputs, some of which may be PCollection instances that are processed by other jobs in the pipeline.

PipelineCallable is intended to be used to inject auxiliary logic into the control flow of a Crunch pipeline. This can be used for a number of purposes, such as importing or exporting data into a cluster using Apache Sqoop, executing a legacy MapReduce job or Pig/Hive script within a Crunch pipeline, or sending emails or status notifications about the status of a long-running pipeline during its execution.

The Crunch planner needs to know three things about a PipelineCallable instance in order to manage it:

  1. The Target and PCollection instances that must have been materialized before this instance is allowed to run. This information should be specified via the dependsOn methods of the class.
  2. What Outputs will be created after this instance is executed, if any. These outputs may be new PCollection instances that are used as inputs in other Crunch jobs. These outputs should be specified by the getOutput(Pipeline) method of the class, which will be executed immediately after this instance is registered with the Pipeline.sequentialDo(org.apache.crunch.PipelineCallable) method.
  3. The actual logic to execute when the dependent Targets and PCollections have been created in order to materialize the output data. This is defined in the call method of the class.

If a given PipelineCallable does not have any dependencies, it will be executed before any jobs are run by the planner. After that, the planner will keep track of when the dependencies of a given instance have been materialized, and then execute the instance as soon as they all exist. The Crunch planner uses a thread pool executor to run multiple PipelineCallable instances simultaneously, but you can indicate that an instance should be run by itself by overriding the boolean runSingleThreaded() method below to return true.

The call method returns a Status to indicate whether it succeeded or failed. A failed instance, or any exceptions/errors thrown by the call method, will cause the overall Crunch pipeline containing this instance to fail.

A number of helper methods for accessing the dependent Target/PCollection instances that this instance needs to exist, as well as the Configuration instance for the overall Pipeline execution, are available as protected methods in this class so that they may be accessed from implementations of PipelineCallable within the call method.


Nested Class Summary
static class PipelineCallable.Status
           
 
Constructor Summary
PipelineCallable()
           
 
Method Summary
 PipelineCallable<Output> dependsOn(String label, PCollection<?> pcollect)
          Requires that the given PCollection be materialized to disk before this instance may be executed.
 PipelineCallable<Output> dependsOn(String label, Target t)
          Requires that the given Target exists before this instance may be executed.
 Output generateOutput(Pipeline pipeline)
          Called by the Pipeline when this instance is registered with Pipeline#sequentialDo.
 Map<String,PCollection<?>> getAllPCollections()
          Returns the mapping of labels to PCollection dependencies for this instance.
 Map<String,Target> getAllTargets()
          Returns the mapping of labels to Target dependencies for this instance.
 String getMessage()
          Returns a message associated with this callable's execution, especially in case of errors.
 String getName()
          Returns the name of this instance.
 PipelineCallable<Output> named(String name)
          Use the given name to identify this instance in the logs.
 boolean runSingleThreaded()
          Override this method to indicate to the planner that this instance should not be run at the same time as any other PipelineCallable instances.
 void setMessage(String message)
          Sets a message associated with this callable's execution, especially in case of errors.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface java.util.concurrent.Callable
call
 

Constructor Detail

PipelineCallable

public PipelineCallable()
Method Detail

runSingleThreaded

public boolean runSingleThreaded()
Override this method to indicate to the planner that this instance should not be run at the same time as any other PipelineCallable instances.

Returns:
true if this instance should run by itself, false otherwise

dependsOn

public PipelineCallable<Output> dependsOn(String label,
                                          Target t)
Requires that the given Target exists before this instance may be executed.

Parameters:
label - A string that can be used to retrieve the given Target inside of the call method.
t - the Target itself
Returns:
this instance

dependsOn

public PipelineCallable<Output> dependsOn(String label,
                                          PCollection<?> pcollect)
Requires that the given PCollection be materialized to disk before this instance may be executed.

Parameters:
label - A string that can be used to retrieve the given PCollection inside of the call method.
pcollect - the PCollection itself
Returns:
this instance

generateOutput

public Output generateOutput(Pipeline pipeline)
Called by the Pipeline when this instance is registered with Pipeline#sequentialDo. In general, clients should override the protected getOutput(Pipeline) method instead of this one.


getName

public String getName()
Returns the name of this instance.


named

public PipelineCallable<Output> named(String name)
Use the given name to identify this instance in the logs.


getMessage

public String getMessage()
Returns a message associated with this callable's execution, especially in case of errors.


setMessage

public void setMessage(String message)
Sets a message associated with this callable's execution, especially in case of errors.


getAllPCollections

public Map<String,PCollection<?>> getAllPCollections()
Returns the mapping of labels to PCollection dependencies for this instance.


getAllTargets

public Map<String,Target> getAllTargets()
Returns the mapping of labels to Target dependencies for this instance.



Copyright © 2014 The Apache Software Foundation. All Rights Reserved.