Getting Started

The Apache Crunch library is developed against version 1.0.3 of the Apache Hadoop library, and is also tested against version 2.0.0-alpha. The library should also work with any version after 1.0.3 or 2.0.0-alpha, and is also known to work with distributions from vendors like Cloudera, Hortonworks, and IBM. The library is not compatible with versions of Hadoop prior to 1.0.x or 2.0.x, such as version 0.20.x.

The easiest way to get started with the library is to use the Maven archetype to generate a simple project. The archetype is available from Maven Central; just enter the following command, answer a few questions, and you're ready to go:

$ mvn archetype:generate -Dfilter=org.apache.crunch:crunch-archetype
[...]
1: remote -> org.apache.crunch:crunch-archetype (Create a basic, self-contained job with the core library.)
Choose a number or apply filter (format: [groupId:]artifactId, case sensitive contains): : 1
Define value for property 'groupId': : com.example
Define value for property 'artifactId': : crunch-demo
Define value for property 'version':  1.0-SNAPSHOT: : [HIT ENTER]
Define value for property 'package':  com.example: : [HIT ENTER]
Confirm properties configuration:
groupId: com.example
artifactId: crunch-demo
version: 1.0-SNAPSHOT
package: com.example
 Y: : [HIT ENTER]
[...]
$

The generated Maven project contains an example application that counts word frequencies in text files:

$ cd crunch-demo
$ tree
.
|-- pom.xml
`-- src
    |-- main
    |   |-- assembly
    |   |   `-- hadoop-job.xml
    |   `-- java
    |       `-- com
    |           `-- example
    |               |-- StopWordFilter.java
    |               |-- Tokenizer.java
    |               `-- WordCount.java
    `-- test
        `-- java
            `-- com
                `-- example
                    |-- StopWordFilterTest.java
                    `-- TokenizerTest.java

The WordCount.java file contains the main class that defines a pipeline application which is referenced from pom.xml.

Build the code:

$ mvn package

Your packaged application is created in the target directory. The build process uses Maven's assembly plugin with some configuration in hadoop-job.xml to create a special JAR file (suffix -job.jar). Depending on your Hadoop configuration, you can run it locally or on a cluster using Hadoop's launcher script:

$ hadoop jar target/hadoop-job-demo-1.0-SNAPSHOT-job.jar <in> <out>

The <in> parameter references a text file or a directory containing text files, while <out> is a directory where the pipeline writes the final results to.

The library also supports running applications from within an IDE, either as standalone Java applications or from unit tests. All required dependencies are on Maven's classpath so you can run the WordCount class directly without any additional setup.