University of Maryland, College Park

Data-Intensive Information Processing Applications (Spring 2010)

Assignment 1-2: Getting into the Clouds

Due: Tuesday 2/9 (2pm)

The purpose of this assignment is to familiarize you with running Hadoop in two different ways: on the Google/IBM cluster and on EC2. It is the second part follow-up to Assignment 1-1 and builds directly on it.

First, let's start with the Google/IBM cluster. You should have received separate instructions on account creation. Remember, when selecting a username, please prefix the username by "ccc_", so that, for example, I would be "ccc_jimmylin". This allows us to distinguish students in the class from other cluster users.

On the cluster, we've prepped a raw text dump of Wikipedia for you to play with:

hadoop fs -ls /tmp/wiki

You can check out the contents with something like this:

hadoop fs -cat /tmp/wiki/part-00000 | head

Now, run the word count demo on this dataset, with 100 reducers, as such:

hadoop jar cloud9.jar edu.umd.cloud9.demo.DemoWordCount /tmp/wiki /tmp/lin-course/cnt1-USERNAME 200 100

Substitute USERNAME with your actual username without the "ccc_" prefix. Therefore, I would put the output in /tmp/lin-course/cnt1-jimmylin. It is important that you follow these instructions exactly, because this is where we are going to look for your output.

Question 1. What is your job id? If you ran the code more than once, any job id of a successful run will do.

Question 2. How large is the input data? (Hint, look in the jobtracker webapp.)

Question 3. How many map tasks does your job contain?

Question 4. What is the 6th word in part-00042 and how many times does it appear?

You'll notice that there is a lot of "junk" in the output. Let's try to clean this up by throwing away terms that don't appear often. Modify the word count demo so to retain only words that occur more than 100 times (i.e., cnt > 100).

Once you've modified the program, run it again:

hadoop jar cloud9.jar edu.umd.cloud9.demo.DemoWordCount /tmp/wiki /tmp/lin-course/cnt2-USERNAME 200 10

This time, use only 10 reducers. Note the slightly different path in which to put your results.

Question 5. How many terms appear more than 100 times in the collection?

Question 6. How many time does "life" appear in the collection?

Okay, we're done with exercises using the Google/IBM cluster. Now lets play with EC2. You should have separately received tokens for EC2. First, go through all the steps described in this tutorial except Step 6 which asks you to terminate your cluster. You should not terminate because you still have stuff to do after the tutorial.

Question 7. Have you successfully completed the EC2 tutorial? (yes or no)

Transfer part-00000 of the wiki data onto EC2 and run word count in EC2. Keep the changes to the code where you retain only words that appear more than 100 times.

Question 8. What words beginning with "c" appear 160 times in this part of the collection?

You should now terminate your cluster (see above tutorial for details). One final question:

Question 9. How long did it take you to complete this assignment?

Submission Instructions

This assignment is due by 2pm, Tuesday 2/9. Please send us (both Jimmy and Nitin) an email with "Cloud Computing Course: Assignment 1-2" as the subject. In the body of the email put answers to the questions above.

Note: The Google/IBM cluster is a shared resource accessible by many. Any impropriety on the cluster will be taken very seriously. This includes tampering or attempting to tamper with another student's results, attempting to pass another student's result as one's own, etc. See the Code of Academic Integrity or the Student Honor Council for more information.

Back to main page

This page, first created: 22 Jan 2010; last updated: Creative Commons: Attribution-Noncommercial-Share Alike 3.0 United States Valid XHTML 1.0! Valid CSS!