public abstract class DoFn<S,T> extends Object implements Serializable
 Note that all DoFn instances implement Serializable, and thus
 all of their non-transient member variables must implement
 Serializable as well. If your DoFn depends on non-serializable
 classes for data processing, they may be declared as transient and
 initialized in the DoFn's initialize method.
| Constructor and Description | 
|---|
| DoFn() | 
| Modifier and Type | Method and Description | 
|---|---|
| void | cleanup(Emitter<T> emitter)Called during the cleanup of the MapReduce job this  DoFnis
 associated with. | 
| void | configure(org.apache.hadoop.conf.Configuration conf)Configure this DoFn. | 
| boolean | disableDeepCopy()By default, Crunch will do a defensive deep copy of the outputs of a
 DoFn when there are multiple downstream consumers of that item, in order to
 prevent the downstream functions from making concurrent modifications to
 data objects. | 
| void | initialize()Initialize this DoFn. | 
| abstract void | process(S input,
       Emitter<T> emitter)Processes the records from a  PCollection. | 
| float | scaleFactor()Returns an estimate of how applying this function to a  PCollectionwill cause it to change in side. | 
| void | setConfiguration(org.apache.hadoop.conf.Configuration conf)Called during the setup of an initialized  PTypethat
 relies on this instance. | 
| void | setContext(org.apache.hadoop.mapreduce.TaskInputOutputContext<?,?,?,?> context)Called during setup to pass the  TaskInputOutputContextto thisDoFninstance. | 
public void configure(org.apache.hadoop.conf.Configuration conf)
Called during the job planning phase by the crunch-client.
conf - The Configuration instance for the Job.public void initialize()
process(Object, Emitter) is triggered. Subclasses may override
 this method to do appropriate initialization.
 
 
 Called during the setup of the job instance this DoFn is associated
 with.
 
public abstract void process(S input, Emitter<T> emitter)
PCollection.
 
 process(Object, Emitter) method call. This
 functionality is imposed by Hadoop's Reducer implementation: The framework will reuse the key and value
 objects that are passed into the reduce, therefore the application should
 clone the objects they want to keep a copy of.input - The input record.emitter - The emitter to send the output topublic void cleanup(Emitter<T> emitter)
DoFn is
 associated with. Subclasses may override this method to do appropriate
 cleanup.emitter - The emitter that was used for outputpublic void setContext(@Nonnull org.apache.hadoop.mapreduce.TaskInputOutputContext<?,?,?,?> context)
TaskInputOutputContext to this
 DoFn instance. The specified TaskInputOutputContext must not be null.public void setConfiguration(@Nonnull org.apache.hadoop.conf.Configuration conf)
PType that
 relies on this instance.conf - The non-null configuration for the PType being initializedpublic float scaleFactor()
PCollection
 will cause it to change in side. The optimizer uses these estimates to
 decide where to break up dependent MR jobs into separate Map and Reduce
 phases in order to minimize I/O.
 
 
 Subclasses of DoFn that will substantially alter the size of the
 resulting PCollection should override this method.
public boolean disableDeepCopy()
true.Copyright © 2017 The Apache Software Foundation. All rights reserved.