DoFn (Apache Crunch 0.15.0 API)

java.lang.Object
- org.apache.crunch.DoFn<S,T>

All Implemented Interfaces:

Serializable

Direct Known Subclasses:

Aggregate.TopKFn, BloomFilterFn, CombineFn, FilterFn, JoinFn, MapFn, SDoubleFlatMapFunction, SFlatMapFunction, SFlatMapFunction2, SPairFlatMapFunction
```
public abstract class DoFn<S,T>
extends Object
implements Serializable
```
Base class for all data processing functions in Crunch.
Note that all DoFn instances implement Serializable, and thus all of their non-transient member variables must implement Serializable as well. If your DoFn depends on non-serializable classes for data processing, they may be declared as transient and initialized in the DoFn's initialize method.

See Also:

Serialized Form

Constructor Summary

Constructors
Constructor and Description

DoFn()

Constructors
Constructor and Description
`DoFn()`

Method Summary

All Methods Instance Methods Abstract Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`cleanup(Emitter<T> emitter)` Called during the cleanup of the MapReduce job this `DoFn` is associated with.
`void`	`configure(org.apache.hadoop.conf.Configuration conf)` Configure this DoFn.
`boolean`	`disableDeepCopy()` By default, Crunch will do a defensive deep copy of the outputs of a DoFn when there are multiple downstream consumers of that item, in order to prevent the downstream functions from making concurrent modifications to data objects.
`void`	`initialize()` Initialize this DoFn.
`abstract void`	`process(S input, Emitter<T> emitter)` Processes the records from a `PCollection`.
`float`	`scaleFactor()` Returns an estimate of how applying this function to a `PCollection` will cause it to change in side.
`void`	`setConfiguration(org.apache.hadoop.conf.Configuration conf)` Called during the setup of an initialized `PType` that relies on this instance.
`void`	`setContext(org.apache.hadoop.mapreduce.TaskInputOutputContext<?,?,?,?> context)` Called during setup to pass the `TaskInputOutputContext` to this `DoFn` instance.

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - DoFn
```
public DoFn()
```
- Method Detail
  - configure
```
public void configure(org.apache.hadoop.conf.Configuration conf)
```
    Configure this DoFn. Subclasses may override this method to modify the configuration of the Job that this DoFn instance belongs to.
    Called during the job planning phase by the crunch-client.
    
    Parameters:
    
    conf - The Configuration instance for the Job.
  - initialize
```
public void initialize()
```
    Initialize this DoFn. This initialization will happen before the actual process(Object, Emitter) is triggered. Subclasses may override this method to do appropriate initialization.
    Called during the setup of the job instance this DoFn is associated with.
  - process
```
public abstract void process(S input,
                             Emitter<T> emitter)
```
    Processes the records from a PCollection.
    
    Note: Crunch can reuse a single input record object whose content changes on each process(Object, Emitter) method call. This functionality is imposed by Hadoop's Reducer implementation: The framework will reuse the key and value objects that are passed into the reduce, therefore the application should clone the objects they want to keep a copy of.
    
    Parameters:
    
    input - The input record.
    
    emitter - The emitter to send the output to
  - cleanup
```
public void cleanup(Emitter<T> emitter)
```
    Called during the cleanup of the MapReduce job this DoFn is associated with. Subclasses may override this method to do appropriate cleanup.
    
    Parameters:
    
    emitter - The emitter that was used for output
  - setContext
```
public void setContext(@Nonnull
                       org.apache.hadoop.mapreduce.TaskInputOutputContext<?,?,?,?> context)
```
    Called during setup to pass the TaskInputOutputContext to this DoFn instance. The specified TaskInputOutputContext must not be null.
  - setConfiguration
```
public void setConfiguration(@Nonnull
                             org.apache.hadoop.conf.Configuration conf)
```
    Called during the setup of an initialized PType that relies on this instance.
    
    Parameters:
    
    conf - The non-null configuration for the PType being initialized
  - scaleFactor
```
public float scaleFactor()
```
    Returns an estimate of how applying this function to a PCollection will cause it to change in side. The optimizer uses these estimates to decide where to break up dependent MR jobs into separate Map and Reduce phases in order to minimize I/O.
    Subclasses of DoFn that will substantially alter the size of the resulting PCollection should override this method.
  - disableDeepCopy
```
public boolean disableDeepCopy()
```
    By default, Crunch will do a defensive deep copy of the outputs of a DoFn when there are multiple downstream consumers of that item, in order to prevent the downstream functions from making concurrent modifications to data objects. This introduces some extra overhead in cases where you know that the downstream code is only reading the objects and not modifying it, so you can disable this feature by overriding this function to return true.

Class DoFn<S,T>

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

DoFn

Method Detail

configure

initialize

process

cleanup

setContext

setConfiguration

scaleFactor

disableDeepCopy