Aggregator (Apache Crunch 0.7.0 API)

This project has retired. For details please refer to its Attic page.

Overview

Package

Class

Use

Tree

Deprecated

Index

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.crunch
Interface Aggregator<T>


Type Parameters:: T - The value types to aggregate

All Superinterfaces:: Serializable

All Known Implementing Classes:: Aggregators.SimpleAggregator

public interface Aggregator<T>
extends Serializable
extends Serializable

Aggregate a sequence of values into a possibly smaller sequence of the same type.

In most cases, an Aggregator will turn multiple values into a single value, like creating a sum, finding the minimum or maximum, etc. In some cases (ie. finding the top K elements), an implementation may return more than one value. The Aggregators utility class contains factory methods for creating all kinds of pre-defined Aggregators that should cover the most common cases.

Aggregator implementations should usually be associative and commutative, which makes their results deterministic. If your aggregation function isn't commutative, you can still use secondary sort to that effect.

The lifecycle of an Aggregator always begins with you instantiating it and passing it to Crunch. When running your Pipeline, Crunch serializes the instance and deserializes it wherever it is needed on the cluster. This is how Crunch uses a deserialized instance:

call initialize(Configuration) once
call reset()
call update(Object) multiple times until all values of a sequence have been aggregated
call results() to retrieve the aggregated result
go back to step 2 until all sequences have been aggregated

Method Summary

void initialize(org.apache.hadoop.conf.Configuration conf)
          Perform any setup of this instance that is required prior to processing inputs.

void reset()
          Clears the internal state of this Aggregator and prepares it for the values associated with the next key.

Iterable<T> results()
          Returns the current aggregated state of this instance.

void update(T value)
          Incorporate the given value into the aggregate state maintained by this instance.

Method Summary
`void`	`initialize(org.apache.hadoop.conf.Configuration conf)` Perform any setup of this instance that is required prior to processing inputs.
`void`	`reset()` Clears the internal state of this Aggregator and prepares it for the values associated with the next key.
`Iterable<T>`	`results()` Returns the current aggregated state of this instance.
`void`	`update(T value)` Incorporate the given value into the aggregate state maintained by this instance.

Method Detail

initialize

void initialize(org.apache.hadoop.conf.Configuration conf)

Perform any setup of this instance that is required prior to processing inputs.

Parameters:: conf - Hadoop configuration

reset

void reset()

Clears the internal state of this Aggregator and prepares it for the values associated with the next key. Depending on what you aggregate, this typically means setting a variable to zero or clearing a list. Failing to do this will yield wrong results!

update

void update(T value)

Incorporate the given value into the aggregate state maintained by this instance.

Parameters:: value - The value to add to the aggregated state

results

Iterable<T> results()

Returns the current aggregated state of this instance.