Project Crunch has retired. For details please refer to its Attic page.
BloomFilterJoinStrategy (Apache Crunch 0.10.0 API)

org.apache.crunch.lib.join
Class BloomFilterJoinStrategy<K,U,V>

java.lang.Object
  extended by org.apache.crunch.lib.join.BloomFilterJoinStrategy<K,U,V>
All Implemented Interfaces:
Serializable, JoinStrategy<K,U,V>

public class BloomFilterJoinStrategy<K,U,V>
extends Object
implements JoinStrategy<K,U,V>

Join strategy that uses a Bloom filter that is trained on the keys of the left-side table to filter the key/value pairs of the right-side table before sending through the shuffle and reduce phase.

This strategy is useful in cases where the right-side table contains many keys that are not present in the left-side table. In this case, the use of the Bloom filter avoids a potentially costly shuffle phase for data that would never be joined to the left side.

See Also:
Serialized Form

Constructor Summary
BloomFilterJoinStrategy(int numElements)
          Instantiate with the expected number of unique keys in the left table.
BloomFilterJoinStrategy(int numElements, float falsePositiveRate)
          Instantiate with the expected number of unique keys in the left table, and the acceptable false positive rate for the Bloom filter.
BloomFilterJoinStrategy(int numElements, float falsePositiveRate, JoinStrategy<K,U,V> delegateJoinStrategy)
          Instantiate with the expected number of unique keys in the left table, and the acceptable false positive rate for the Bloom filter, and an underlying join strategy to delegate to.
 
Method Summary
 PTable<K,Pair<U,V>> join(PTable<K,U> left, PTable<K,V> right, JoinType joinType)
          Join two tables with the given join type.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

BloomFilterJoinStrategy

public BloomFilterJoinStrategy(int numElements)
Instantiate with the expected number of unique keys in the left table.

The DefaultJoinStrategy will be used to perform the actual join after filtering.

Parameters:
numElements - expected number of unique keys

BloomFilterJoinStrategy

public BloomFilterJoinStrategy(int numElements,
                               float falsePositiveRate)
Instantiate with the expected number of unique keys in the left table, and the acceptable false positive rate for the Bloom filter.

The DefaultJoinStrategy will be used to perform the actual join after filtering.

Parameters:
numElements - expected number of unique keys
falsePositiveRate - acceptable false positive rate for Bloom Filter

BloomFilterJoinStrategy

public BloomFilterJoinStrategy(int numElements,
                               float falsePositiveRate,
                               JoinStrategy<K,U,V> delegateJoinStrategy)
Instantiate with the expected number of unique keys in the left table, and the acceptable false positive rate for the Bloom filter, and an underlying join strategy to delegate to.

Parameters:
numElements - expected number of unique keys
falsePositiveRate - acceptable false positive rate for Bloom Filter
delegateJoinStrategy - join strategy to delegate to after filtering
Method Detail

join

public PTable<K,Pair<U,V>> join(PTable<K,U> left,
                                PTable<K,V> right,
                                JoinType joinType)
Description copied from interface: JoinStrategy
Join two tables with the given join type.

Specified by:
join in interface JoinStrategy<K,U,V>
Parameters:
left - left table to be joined
right - right table to be joined
joinType - type of join to perform
Returns:
joined tables


Copyright © 2014 The Apache Software Foundation. All Rights Reserved.