DoFn (Apache Crunch 0.10.0 API)

This project has retired. For details please refer to its Attic page.

Overview

Package

Class

Use

Tree

Deprecated

Index

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.crunch
Class DoFn<S,T>

java.lang.Object
  org.apache.crunch.DoFn<S,T>

All Implemented Interfaces:: Serializable

Direct Known Subclasses:: Aggregate.TopKFn, BloomFilterFn, CombineFn, FilterFn, JoinFn, MapFn

public abstract class DoFn<S,T>
extends Object
implements Serializable
extends Object
implements Serializable

Base class for all data processing functions in Crunch.

Note that all DoFn instances implement Serializable, and thus all of their non-transient member variables must implement Serializable as well. If your DoFn depends on non-serializable classes for data processing, they may be declared as transient and initialized in the DoFn's initialize method.

See Also:: Serialized Form

Constructor Summary

DoFn()


Method Summary

void cleanup(Emitter<T> emitter)
          Called during the cleanup of the MapReduce job this DoFn is associated with.

void configure(org.apache.hadoop.conf.Configuration conf)
          Configure this DoFn.

boolean disableDeepCopy()
          By default, Crunch will do a defensive deep copy of the outputs of a DoFn when there are multiple downstream consumers of that item, in order to prevent the downstream functions from making concurrent modifications to data objects.

void initialize()
          Initialize this DoFn.

abstract void process(S input, Emitter<T> emitter)
          Processes the records from a PCollection.

float scaleFactor()
          Returns an estimate of how applying this function to a PCollection will cause it to change in side.

void setConfiguration(org.apache.hadoop.conf.Configuration conf)
          Called during the setup of an initialized PType that relies on this instance.

void setContext(org.apache.hadoop.mapreduce.TaskInputOutputContext<?,?,?,?> context)
          Called during setup to pass the TaskInputOutputContext to this DoFn instance.

Methods inherited from class java.lang.Object

equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Summary
`DoFn()`

Method Summary
`void`	`cleanup(Emitter<T> emitter)` Called during the cleanup of the MapReduce job this `DoFn` is associated with.
`void`	`configure(org.apache.hadoop.conf.Configuration conf)` Configure this DoFn.
`boolean`	`disableDeepCopy()` By default, Crunch will do a defensive deep copy of the outputs of a DoFn when there are multiple downstream consumers of that item, in order to prevent the downstream functions from making concurrent modifications to data objects.
`void`	`initialize()` Initialize this DoFn.
`abstract void`	`process(S input, Emitter<T> emitter)` Processes the records from a `PCollection`.
`float`	`scaleFactor()` Returns an estimate of how applying this function to a `PCollection` will cause it to change in side.
`void`	`setConfiguration(org.apache.hadoop.conf.Configuration conf)` Called during the setup of an initialized `PType` that relies on this instance.
`void`	`setContext(org.apache.hadoop.mapreduce.TaskInputOutputContext<?,?,?,?> context)` Called during setup to pass the `TaskInputOutputContext` to this `DoFn` instance.

Methods inherited from class java.lang.Object
`equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

DoFn

public DoFn()

Method Detail

configure

public void configure(org.apache.hadoop.conf.Configuration conf)

Configure this DoFn. Subclasses may override this method to modify the configuration of the Job that this DoFn instance belongs to.

Called during the job planning phase by the crunch-client.

Parameters:: conf - The Configuration instance for the Job.

initialize

public void initialize()

Initialize this DoFn. This initialization will happen before the actual process(Object, Emitter) is triggered. Subclasses may override this method to do appropriate initialization.

Called during the setup of the job instance this DoFn is associated with.

process

public abstract void process(S input,
                             Emitter<T> emitter)

Processes the records from a PCollection.

Note: Crunch can reuse a single input record object whose content changes on each process(Object, Emitter) method call. This functionality is imposed by Hadoop's Reducer implementation: The framework will reuse the key and value objects that are passed into the reduce, therefore the application should clone the objects they want to keep a copy of.

Parameters:: input - The input record.; emitter - The emitter to send the output to

cleanup

public void cleanup(Emitter<T> emitter)

Called during the cleanup of the MapReduce job this DoFn is associated with. Subclasses may override this method to do appropriate cleanup.

Parameters:: emitter - The emitter that was used for output

setContext

public void setContext(org.apache.hadoop.mapreduce.TaskInputOutputContext<?,?,?,?> context)

Called during setup to pass the TaskInputOutputContext to this DoFn instance.

setConfiguration

public void setConfiguration(org.apache.hadoop.conf.Configuration conf)

Called during the setup of an initialized PType that relies on this instance.

Parameters:: conf - The configuration for the PType being initialized

scaleFactor

public float scaleFactor()

Returns an estimate of how applying this function to a PCollection will cause it to change in side. The optimizer uses these estimates to decide where to break up dependent MR jobs into separate Map and Reduce phases in order to minimize I/O.

Subclasses of DoFn that will substantially alter the size of the resulting PCollection should override this method.

disableDeepCopy

public boolean disableDeepCopy()

By default, Crunch will do a defensive deep copy of the outputs of a DoFn when there are multiple downstream consumers of that item, in order to prevent the downstream functions from making concurrent modifications to data objects. This introduces some extra overhead in cases where you know that the downstream code is only reading the objects and not modifying it, so you can disable this feature by overriding this function to return true.