This project has retired. For details please refer to its Attic page.
Apache Crunch - Scrunch

Scrunch A Scala Wrapper for the Apache Crunch Java API

Introduction

Scrunch is an experimental Scala wrapper for the Apache Crunch Java API, based on the same ideas as the Cascade project at Google, which created a Scala wrapper for FlumeJava.

Why Scala?

In many ways, Scala is the perfect language for writing MapReduce pipelines. Scala supports a mixture of functional and object-oriented programming styles and has powerful type-inference capabilities, allowing us to create complex pipelines using very few keystrokes. Here is an implementation of the classic WordCount problem using the Scrunch API:

import org.apache.crunch.io.{From => from}
import org.apache.crunch.scrunch._
import org.apache.crunch.scrunch.Conversions_  # For implicit type conversions

class WordCountExample {
  val pipeline = new Pipeline[WordCountExample]

  def wordCount(fileName: String) = {
    pipeline.read(from.textFile(fileName))
      .flatMap(_.toLowerCase.split("\\W+"))
      .filter(!_.isEmpty())
      .count
  }
}

The Scala compiler can infer the return type of the flatMap function as an Array[String], and the Scrunch wrapper code uses the type inference mechanism to figure out how to serialize the data between the Map and Reduce stages. Here's a slightly more complex example, in which we get the word counts for two different files and compute the deltas of how often different words occur, and then only returns the words where the first file had more occurrences then the second:

class WordCountExample {
  def wordGt(firstFile: String, secondFile: String) = {
    wordCount(firstFile).cogroup(wordCount(secondFile))
      .map((k, v) => (k, (v._1.sum - v._2.sum)))
      .filter((k, v) => v > 0).map((k, v) => k)
  }
}

Materializing Job Outputs

The Scrunch API also incorporates the Java library's materialize functionality, which allows us to easily read the output of a MapReduce pipeline into the client:

class WordCountExample {
  def hasHamlet = wordGt("shakespeare.txt", "maugham.txt").materialize.exists(_ == "hamlet")
}

Notes and Thanks

Scrunch emerged out of conversations with Dmitriy Ryaboy, Oscar Boykin, and Avi Bryant from Twitter. Many thanks to them for their feedback, guidance, and encouragement. We are also grateful to Matei Zaharia, whose Spark Project inspired much of the original Scrunch API implementation.