Project Crunch has retired. For details please refer to its Attic page.
Aggregator (Apache Crunch 0.10.0 API)

org.apache.crunch
Interface Aggregator<T>

Type Parameters:
T - The value types to aggregate
All Superinterfaces:
Serializable
All Known Implementing Classes:
Aggregators.SimpleAggregator

public interface Aggregator<T>
extends Serializable

Aggregate a sequence of values into a possibly smaller sequence of the same type.

In most cases, an Aggregator will turn multiple values into a single value, like creating a sum, finding the minimum or maximum, etc. In some cases (ie. finding the top K elements), an implementation may return more than one value. The Aggregators utility class contains factory methods for creating all kinds of pre-defined Aggregators that should cover the most common cases.

Aggregator implementations should usually be associative and commutative, which makes their results deterministic. If your aggregation function isn't commutative, you can still use secondary sort to that effect.

The lifecycle of an Aggregator always begins with you instantiating it and passing it to Crunch. When running your Pipeline, Crunch serializes the instance and deserializes it wherever it is needed on the cluster. This is how Crunch uses a deserialized instance:

  1. call initialize(Configuration) once
  2. call reset()
  3. call update(Object) multiple times until all values of a sequence have been aggregated
  4. call results() to retrieve the aggregated result
  5. go back to step 2 until all sequences have been aggregated


Method Summary
 void initialize(org.apache.hadoop.conf.Configuration conf)
          Perform any setup of this instance that is required prior to processing inputs.
 void reset()
          Clears the internal state of this Aggregator and prepares it for the values associated with the next key.
 Iterable<T> results()
          Returns the current aggregated state of this instance.
 void update(T value)
          Incorporate the given value into the aggregate state maintained by this instance.
 

Method Detail

initialize

void initialize(org.apache.hadoop.conf.Configuration conf)
Perform any setup of this instance that is required prior to processing inputs.

Parameters:
conf - Hadoop configuration

reset

void reset()
Clears the internal state of this Aggregator and prepares it for the values associated with the next key. Depending on what you aggregate, this typically means setting a variable to zero or clearing a list. Failing to do this will yield wrong results!


update

void update(T value)
Incorporate the given value into the aggregate state maintained by this instance.

Parameters:
value - The value to add to the aggregated state

results

Iterable<T> results()
Returns the current aggregated state of this instance.



Copyright © 2014 The Apache Software Foundation. All Rights Reserved.