The Apache Crunch library is developed against version 1.0.3 of the Apache Hadoop library, and is also tested against version 2.0.0-alpha. The library should also work with any version after 1.0.3 or 2.0.0-alpha, and is also known to work with distributions from vendors like Cloudera, Hortonworks, and IBM. The library is not compatible with versions of Hadoop prior to 1.0.x or 2.0.x, such as version 0.20.x.
The easiest way to get started with the library is to use the Maven archetype to generate a simple project. The archetype is available from Maven Central; just enter the following command, answer a few questions, and you're ready to go:
$ mvn archetype:generate -Dfilter=org.apache.crunch:crunch-archetype [...] 1: remote -> org.apache.crunch:crunch-archetype (Create a basic, self-contained job with the core library.) Choose a number or apply filter (format: [groupId:]artifactId, case sensitive contains): : 1 Define value for property 'groupId': : com.example Define value for property 'artifactId': : crunch-demo Define value for property 'version': 1.0-SNAPSHOT: : [HIT ENTER] Define value for property 'package': com.example: : [HIT ENTER] Confirm properties configuration: groupId: com.example artifactId: crunch-demo version: 1.0-SNAPSHOT package: com.example Y: : [HIT ENTER] [...] $
The generated Maven project contains an example application that counts word frequencies in text files:
$ cd crunch-demo
$ tree
.
|-- pom.xml
`-- src
|-- main
| |-- assembly
| | `-- hadoop-job.xml
| `-- java
| `-- com
| `-- example
| |-- StopWordFilter.java
| |-- Tokenizer.java
| `-- WordCount.java
`-- test
`-- java
`-- com
`-- example
|-- StopWordFilterTest.java
`-- TokenizerTest.java
The WordCount.java file contains the main class that defines a pipeline
application which is referenced from pom.xml.
Build the code:
$ mvn package
Your packaged application is created in the target directory. The build
process uses Maven's assembly plugin with some configuration in
hadoop-job.xml to create a special JAR file (suffix -job.jar).
Depending on your Hadoop configuration, you can run it locally or on a
cluster using Hadoop's launcher script:
$ hadoop jar target/hadoop-job-demo-1.0-SNAPSHOT-job.jar <in> <out>
The <in> parameter references a text file or a directory containing text
files, while <out> is a directory where the pipeline writes the final results to.
The library also supports running applications from within an IDE, either as standalone
Java applications or from unit tests. All required dependencies are on Maven's
classpath so you can run the WordCount class directly without any additional
setup.