Cloud9

A Hadoop toolkit for working with big data

Cloud9 is designed to work with Hadoop YARN and has been tested against Cloudera CDH 5.3.0 (on both Mac and Linux). It should work with other Hadoop distributions or on other platforms with only minor modifications. In this tutorial, we'll take you through running word count on a toy collection. First, we'll use Hadoop local mode (also called standalone mode). Running in local model, as the name suggests, does not require setting up a cluster, but of course, you won't get the benefits of distributed processing either. Next, we'll run word count on the single node virtual Hadoop cluster provided by Cloudera.

Warning Note that local mode doesn't work properly under Windows, even with cygwin, so Windows users following this guide should start with "Running Cloud9 on a Single Node Virtual Cluster"

Download and Install Hadoop

Download the Cloudera CDH 5.3.0 tarball here (caution, it's 278 MB). Unpack the tarball onto your machine and make sure PATH_TO_HADOOP/bin is on your path. Verify that Hadoop is running with the pi example. In a shell, run the following command:

$ hadoop jar PATH_TO_HADOOP/jars/hadoop-mapreduce-examples-2.5.0-cdh5.3.0.jar pi 100 100000

After the above Hadoop local job finishes, you should see the computed value of pi... something that's reasonably close to 3.14.

Clone the Cloud9 Repo

Open up a shell and clone the Cloud9 github repo:

$ git clone git://github.com/lintool/Cloud9.git

Go into the Cloud9/ directory and build with Maven by typing:

$ mvn clean package

The build should complete without error.

Let's now run the word count demo. Maven should have automatically created a "fatjar" for you located in target/cloud9-X.Y.Z-fatjar.jar (substitute in the right version number for X.Y.Z). The "fatjar" packages up all the dependent libraries in a self-contained jar. Now run the following command:

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.simple.DemoWordCount \
   -input data/bible+shakes.nopunc.gz -output wc

In local model, there is no HDFS, so you can use standard shell commands to see the output. For example:

$ head wc/part-r-00000
&c	70
&c'	1
''all	1
''among	1
''and	1
''but	1
''how	1
''lo	2
''look	1
''my	1

And that's it!

Running Cloud9 on a Single Node Virtual Cluster

The next step is to run Cloud9 on an actual Hadoop cluster. How to set up a Hadoop cluster is beyond the scope of this tutorial, but the next best thing is to use Cloudera's virtual machine images, which come with a pre-configured single-node cluster. The images can be downloaded here.

Use the VirtualBox image, since VirtualBox is freely available across all major platforms. Download the image and unpack the tarball. VirtualBox itself can be download here.

Install VirtualBox and import the appliance. Then launch the Cloudera VM. Once you're inside the VM, open up a shell and clone the Cloud9 github repo:

$ git clone git://github.com/lintool/Cloud9.git

As with before, go into the Cloud9/ directory and build with Maven:

$ mvn clean package

After that's done, we need to put the sample data onto HDFS:

$ hadoop fs -put data/bible+shakes.nopunc.gz

You can verify that the file is there:

$ hadoop fs -ls

Next, run the word count demo, as follows:

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.simple.DemoWordCount \
   -input bible+shakes.nopunc.gz -output wc -numReducers 5

After the job completes, you should be able to see the output on HDFS:

$ hadoop fs -ls wc

You should see five part-r-XXXXX files, since we specified five reducers above. Now copy the data from HDFS onto the local disk:

$ hadoop fs -get wc/part-r-* .

From here, you should be able to examine the contents of the file using normal shell commands.

And that's it!