Data-Intensive Information Processing Applications (Spring 2010)

University of Maryland, College Park

Data-Intensive Information Processing Applications (Spring 2010)

Assignment 1-1: Getting started on Hadoop

Due: Tuesday 2/2 (2pm)

The primary purpose of this assignment is to familiarize you with running Hadoop in two different ways: in standalone mode and in the Cloudera VM. You will be asked to work through a few tutorials. The assignment does not involve actual coding, but requires a lot of activity on the command line (running Hadoop jobs, copying files around, etc.). This is the first part of a two-part assignment: in the second part, you'll run Hadoop on the Google/IBM cluster and on EC2.

The secondary purpose of this assignment is to make sure that you have sufficient background to take this course. This assignment is written in such a way that you should be able to figure out details that we have omitted (for example, on downloading, configuring, and installing software). Also note: machines are configured in slightly different ways, and as a result you may run into issues that require troubleshooting (e.g., differences in install paths, environment settings, etc.). We expect that you have sufficient familiarity with general operating system concepts to be able to solve most issues yourself. If you are having a lot of trouble completing this assignment, you might not be ready for the course.

For this class, we'll be using Cloud⁹, a Hadoop library developed at Maryland both for this course and for research in text processing. The first goal of this assignment is to get the word count demo in Cloud⁹ running in standalone mode. In standalone mode, Hadoop runs in a single thread on your local machine. Do the following:

First, following instructions on downloading and setting up Cloud⁹.
Next, work through the tutorial on getting started in standalone mode.

Now, answer the following questions:

Question 1. Have you successfully completed the above tutorials and run the word count demo in standalone mode? (yes or no)

Look at the output in demo/part-00000.

Question 2. What's the next term found in the collection after ''my? How many times does it appear?

Question 3. Scan down the output a few more lines: how many times does 'and appear in the collection?

Note: It is very important to understand that in standalone mode, there is no HDFS.

The second goal of this assignment is to get the word count demo running inside the Cloudera VM. Follow the instructions on the page to download the image and also VMware Player (for Windows and Linux) or VMware Fusion (for Mac). Start up the VM.

Inside the VM, Hadoop is running in what's called "pseudo-distributed mode", which means that all the daemon processes (JT, NN, TT) are running on the same machine and communicating via loopback. Inside the VM, open up a browser, and you should see the Hadoop webapps.

Your task is to now run the Cloud⁹ word count demo inside the VM. The requires that you copy over the data (bible+shakes.nopunc) to the VM. Once the data is inside the VM, you'll need to put the data into HDFS. You'll also need to copy the Cloud⁹ jar onto the VM. Once you've done all of this, you can now submit a Hadoop job. Run word count example on the bible+shakes.nopunc data with 10 mappers and 5 reducers. Answer the following questions:

Question 4. Have you successfully run the word count demo inside the Cloudera VM on the sample dataset? (yes or no)

Question 5. What is the first term in part-00000 and how many times does it appear?

Question 6. What is the third to last term in part-00004 and how many times does it appear?

Question 7. How long did it take you to complete this assignment?

Hints:

Copying the data onto the Cloudera VM does not mean that the data is placed into HDFS. That requires a second step.
Here is a user guide for HDFS commands.
It your job to figure out how to get data from your local machine onto the Cloudera VM. Think about using scp and ifconfig.

Submission Instructions

This assignment is due by 2pm, Tuesday 2/2. Please send us (both Jimmy and Nitin) an email with "Cloud Computing Course: Assignment 1-1" as the subject. In the body of the email put answers to the questions above.

Important: Follow these instructions exactly as specified. This means exactly the subject line indicated above and answers in the email message itself (not as an attachment).

Back to main page

This page, first created: 22 Jan 2010; last updated: