|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.apache.crunch.DoFn<S,T>
public abstract class DoFn<S,T>
Base class for all data processing functions in Crunch.
Note that all DoFn
instances implement Serializable
, and thus
all of their non-transient member variables must implement
Serializable
as well. If your DoFn depends on non-serializable
classes for data processing, they may be declared as transient
and
initialized in the DoFn's initialize
method.
Constructor Summary | |
---|---|
DoFn()
|
Method Summary | |
---|---|
void |
cleanup(Emitter<T> emitter)
Called during the cleanup of the MapReduce job this DoFn is
associated with. |
void |
configure(org.apache.hadoop.conf.Configuration conf)
Configure this DoFn. |
boolean |
disableDeepCopy()
By default, Crunch will do a defensive deep copy of the outputs of a DoFn when there are multiple downstream consumers of that item, in order to prevent the downstream functions from making concurrent modifications to data objects. |
protected org.apache.hadoop.conf.Configuration |
getConfiguration()
|
protected org.apache.hadoop.mapreduce.TaskInputOutputContext<?,?,?,?> |
getContext()
|
protected org.apache.hadoop.mapreduce.Counter |
getCounter(Enum<?> counterName)
Deprecated. The Counter class changed incompatibly between Hadoop 1 and 2
(from a class to an interface) so user programs should avoid this method and use
one of the increment methods instead, such as increment(Enum) . |
protected org.apache.hadoop.mapreduce.Counter |
getCounter(String groupName,
String counterName)
Deprecated. The Counter class changed incompatibly between Hadoop 1 and 2
(from a class to an interface) so user programs should avoid this method and use
one of the increment methods instead, such as increment(Enum) . |
protected String |
getStatus()
|
protected org.apache.hadoop.mapreduce.TaskAttemptID |
getTaskAttemptID()
|
protected void |
increment(Enum<?> counterName)
|
protected void |
increment(Enum<?> counterName,
long value)
|
protected void |
increment(String groupName,
String counterName)
|
protected void |
increment(String groupName,
String counterName,
long value)
|
void |
initialize()
Initialize this DoFn. |
abstract void |
process(S input,
Emitter<T> emitter)
Processes the records from a PCollection . |
protected void |
progress()
|
float |
scaleFactor()
Returns an estimate of how applying this function to a PCollection
will cause it to change in side. |
void |
setConfiguration(org.apache.hadoop.conf.Configuration conf)
Called during the setup of an initialized PType that
relies on this instance. |
void |
setContext(org.apache.hadoop.mapreduce.TaskInputOutputContext<?,?,?,?> context)
Called during setup to pass the TaskInputOutputContext to this
DoFn instance. |
protected void |
setStatus(String status)
|
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public DoFn()
Method Detail |
---|
public void configure(org.apache.hadoop.conf.Configuration conf)
Called during the job planning phase by the crunch-client.
conf
- The Configuration instance for the Job.public void initialize()
process(Object, Emitter)
is triggered. Subclasses may override
this method to do appropriate initialization.
Called during the setup of the job instance this DoFn
is associated
with.
public abstract void process(S input, Emitter<T> emitter)
PCollection
.
process(Object, Emitter)
method call. This
functionality is imposed by Hadoop's Reducer implementation: The framework will reuse the key and value
objects that are passed into the reduce, therefore the application should
clone the objects they want to keep a copy of.
input
- The input record.emitter
- The emitter to send the output topublic void cleanup(Emitter<T> emitter)
DoFn
is
associated with. Subclasses may override this method to do appropriate
cleanup.
emitter
- The emitter that was used for outputpublic void setContext(org.apache.hadoop.mapreduce.TaskInputOutputContext<?,?,?,?> context)
TaskInputOutputContext
to this
DoFn
instance.
public void setConfiguration(org.apache.hadoop.conf.Configuration conf)
PType
that
relies on this instance.
conf
- The configuration for the PType
being initializedpublic float scaleFactor()
PCollection
will cause it to change in side. The optimizer uses these estimates to
decide where to break up dependent MR jobs into separate Map and Reduce
phases in order to minimize I/O.
Subclasses of DoFn
that will substantially alter the size of the
resulting PCollection
should override this method.
public boolean disableDeepCopy()
true
.
protected org.apache.hadoop.mapreduce.TaskInputOutputContext<?,?,?,?> getContext()
protected org.apache.hadoop.conf.Configuration getConfiguration()
@Deprecated protected org.apache.hadoop.mapreduce.Counter getCounter(Enum<?> counterName)
Counter
class changed incompatibly between Hadoop 1 and 2
(from a class to an interface) so user programs should avoid this method and use
one of the increment
methods instead, such as increment(Enum)
.
@Deprecated protected org.apache.hadoop.mapreduce.Counter getCounter(String groupName, String counterName)
Counter
class changed incompatibly between Hadoop 1 and 2
(from a class to an interface) so user programs should avoid this method and use
one of the increment
methods instead, such as increment(Enum)
.
protected void increment(String groupName, String counterName)
protected void increment(String groupName, String counterName, long value)
protected void increment(Enum<?> counterName)
protected void increment(Enum<?> counterName, long value)
protected void progress()
protected org.apache.hadoop.mapreduce.TaskAttemptID getTaskAttemptID()
protected void setStatus(String status)
protected String getStatus()
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |