org.apache.crunch
Class DoFn<S,T>

java.lang.Object
  extended by org.apache.crunch.DoFn<S,T>
All Implemented Interfaces:
Serializable
Direct Known Subclasses:
Aggregate.TopKFn, BloomFilterFn, CombineFn, FilterFn, JoinFn, MapFn

public abstract class DoFn<S,T>
extends Object
implements Serializable

Base class for all data processing functions in Crunch.

Note that all DoFn instances implement Serializable, and thus all of their non-transient member variables must implement Serializable as well. If your DoFn depends on non-serializable classes for data processing, they may be declared as transient and initialized in the DoFn's initialize method.

See Also:
Serialized Form

Constructor Summary
DoFn()
           
 
Method Summary
 void cleanup(Emitter<T> emitter)
          Called during the cleanup of the MapReduce job this DoFn is associated with.
 void configure(org.apache.hadoop.conf.Configuration conf)
          Configure this DoFn.
 boolean disableDeepCopy()
          By default, Crunch will do a defensive deep copy of the outputs of a DoFn when there are multiple downstream consumers of that item, in order to prevent the downstream functions from making concurrent modifications to data objects.
 void initialize()
          Initialize this DoFn.
abstract  void process(S input, Emitter<T> emitter)
          Processes the records from a PCollection.
 float scaleFactor()
          Returns an estimate of how applying this function to a PCollection will cause it to change in side.
 void setContext(org.apache.hadoop.mapreduce.TaskInputOutputContext<?,?,?,?> context)
          Called during setup to pass the TaskInputOutputContext to this DoFn instance.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DoFn

public DoFn()
Method Detail

configure

public void configure(org.apache.hadoop.conf.Configuration conf)
Configure this DoFn. Subclasses may override this method to modify the configuration of the Job that this DoFn instance belongs to.

Called during the job planning phase by the crunch-client.

Parameters:
conf - The Configuration instance for the Job.

initialize

public void initialize()
Initialize this DoFn. This initialization will happen before the actual process(Object, Emitter) is triggered. Subclasses may override this method to do appropriate initialization.

Called during the setup of the job instance this DoFn is associated with.


process

public abstract void process(S input,
                             Emitter<T> emitter)
Processes the records from a PCollection.

Note: Crunch can reuse a single input record object whose content changes on each process(Object, Emitter) method call. This functionality is imposed by Hadoop's Reducer implementation: The framework will reuse the key and value objects that are passed into the reduce, therefore the application should clone the objects they want to keep a copy of.

Parameters:
input - The input record.
emitter - The emitter to send the output to

cleanup

public void cleanup(Emitter<T> emitter)
Called during the cleanup of the MapReduce job this DoFn is associated with. Subclasses may override this method to do appropriate cleanup.

Parameters:
emitter - The emitter that was used for output

setContext

public void setContext(org.apache.hadoop.mapreduce.TaskInputOutputContext<?,?,?,?> context)
Called during setup to pass the TaskInputOutputContext to this DoFn instance.


scaleFactor

public float scaleFactor()
Returns an estimate of how applying this function to a PCollection will cause it to change in side. The optimizer uses these estimates to decide where to break up dependent MR jobs into separate Map and Reduce phases in order to minimize I/O.

Subclasses of DoFn that will substantially alter the size of the resulting PCollection should override this method.


disableDeepCopy

public boolean disableDeepCopy()
By default, Crunch will do a defensive deep copy of the outputs of a DoFn when there are multiple downstream consumers of that item, in order to prevent the downstream functions from making concurrent modifications to data objects. This introduces some extra overhead in cases where you know that the downstream code is only reading the objects and not modifying it, so you can disable this feature by overriding this function to return true.



Copyright © 2013 The Apache Software Foundation. All Rights Reserved.