Output
- the output value returned by this instance (Void, PCollection, Pair<PCollection, PCollection>,
etc.public abstract class PipelineCallable<Output> extends Object implements Callable<PipelineCallable.Status>
Callable
that executes some sequential logic on the client machine as
part of an overall Crunch pipeline in order to generate zero or more outputs, some of
which may be PCollection
instances that are processed by other jobs in the
pipeline.
PipelineCallable
is intended to be used to inject auxiliary logic into the control
flow of a Crunch pipeline. This can be used for a number of purposes, such as importing or
exporting data into a cluster using Apache Sqoop, executing a legacy MapReduce job
or Pig/Hive script within a Crunch pipeline, or sending emails or status notifications
about the status of a long-running pipeline during its execution.
The Crunch planner needs to know three things about a PipelineCallable
instance in order
to manage it:
Target
and PCollection
instances that must have been materialized
before this instance is allowed to run. This information should be specified via the dependsOn
methods of the class.PCollection
instances that are used as inputs in other Crunch jobs. These outputs should
be specified by the getOutput(Pipeline)
method of the class, which will be executed immediately
after this instance is registered with the Pipeline.sequentialDo(org.apache.crunch.PipelineCallable<Output>)
method.call
method of the class.If a given PipelineCallable does not have any dependencies, it will be executed before any jobs are run
by the planner. After that, the planner will keep track of when the dependencies of a given instance
have been materialized, and then execute the instance as soon as they all exist. The Crunch planner
uses a thread pool executor to run multiple PipelineCallable
instances simultaneously, but you can
indicate that an instance should be run by itself by overriding the boolean runSingleThreaded()
method
below to return true.
The call
method returns a Status
to indicate whether it succeeded or failed. A failed
instance, or any exceptions/errors thrown by the call method, will cause the overall Crunch pipeline containing
this instance to fail.
A number of helper methods for accessing the dependent Target/PCollection instances that this instance
needs to exist, as well as the Configuration
instance for the overall Pipeline execution, are available
as protected methods in this class so that they may be accessed from implementations of PipelineCallable
within the call
method.
Modifier and Type | Class and Description |
---|---|
static class |
PipelineCallable.Status |
Constructor and Description |
---|
PipelineCallable() |
Modifier and Type | Method and Description |
---|---|
PipelineCallable<Output> |
dependsOn(String label,
PCollection<?> pcollect)
Requires that the given
PCollection be materialized to disk before this instance may be
executed. |
PipelineCallable<Output> |
dependsOn(String label,
Target t)
Requires that the given
Target exists before this instance may be
executed. |
Output |
generateOutput(Pipeline pipeline)
Called by the
Pipeline when this instance is registered with Pipeline#sequentialDo . |
Map<String,PCollection<?>> |
getAllPCollections()
Returns the mapping of labels to PCollection dependencies for this instance.
|
Map<String,Target> |
getAllTargets()
Returns the mapping of labels to Target dependencies for this instance.
|
String |
getMessage()
Returns a message associated with this callable's execution, especially in case of errors.
|
String |
getName()
Returns the name of this instance.
|
PipelineCallable<Output> |
named(String name)
Use the given name to identify this instance in the logs.
|
boolean |
runSingleThreaded()
Override this method to indicate to the planner that this instance should not be run at the
same time as any other
PipelineCallable instances. |
void |
setMessage(String message)
Sets a message associated with this callable's execution, especially in case of errors.
|
public boolean runSingleThreaded()
PipelineCallable
instances.public PipelineCallable<Output> dependsOn(String label, Target t)
Target
exists before this instance may be
executed.label
- A string that can be used to retrieve the given Target inside of the call
method.t
- the Target
itselfpublic PipelineCallable<Output> dependsOn(String label, PCollection<?> pcollect)
PCollection
be materialized to disk before this instance may be
executed.label
- A string that can be used to retrieve the given PCollection inside of the call
method.pcollect
- the PCollection
itselfpublic Output generateOutput(Pipeline pipeline)
Pipeline
when this instance is registered with Pipeline#sequentialDo
. In general,
clients should override the protected getOutput(Pipeline)
method instead of this one.public String getName()
public PipelineCallable<Output> named(String name)
public String getMessage()
public void setMessage(String message)
public Map<String,PCollection<?>> getAllPCollections()
Copyright © 2016 The Apache Software Foundation. All rights reserved.