University of Maryland, College Park


Data-Intensive Information Processing Applications (Spring 2010)

Assignment 2: Bigram Counts

Due: Tuesday 2/23 (2pm)

Complete the bigram count exercise in Cloud9. The exercise contains two parts, with four questions in Part I and three questions in Part II. Answer all these questions. You may complete this portion of the assignment in standalone mode if you wish. Alternatively, if you wish to use the Google/IBM cluster, the sample dataset (the Bible and the complete works of Shakespeare) is loaded at /tmp/sample-data.

After you've completed Part I and Part II of the above exercise, you're going to make sure that everything runs "at scale" over the Wikipedia collection stored at /tmp/wiki on the Google/IBM cluster (also used in the previous assignment).

Part III

Count the bigrams on the Wikipedia collection (as in Part I of the exercise above). Make sure your job successfully completes on the entire collection. Put your output (complete bigram counts for the entire collection) in /tmp/lin-course/bigrams-USERNAME. Substitute USERNAME with your actual username without the "ccc_" prefix. Therefore, I would put my output in /tmp/lin-course/bigrams-jimmylin. It is important that you follow these instructions exactly, because this is where we are going to look for your output.

Hint: For these large jobs, run around 100-200 reducers. You want enough reducers to get good parallelism, but you don't want to run too many reducers as to consume all cluster resources (leaving none for others).

Question 1. What is your job id? If you ran the code more than once, any job id of a successful run will do. Save the job details page and attach it as part of your answer. That is, from the jobtracker webapp, find your job, save that page in HTML, and include it in your assignment submission. Name this file part3q1.htm.

Question 2. For the job id you identified above, how many map tasks and reduce tasks did the job contain? How long did the job take?

Question 3. How many unique bigrams are there in the entire collection?

Question 4. What are the counts of the following bigrams?

Hint: There's no reason why you couldn't write a separate program to find these counts.

Part IV

Now, compute bigram probabilities on the Wikipedia collection (as in Part II of the exercise above). Make sure your job successfully completes on the entire dataset. Put your output (bigram probabilities for the entire collection) in /tmp/lin-course/condprob-USERNAME. Substitute USERNAME with your actual username without the "ccc_" prefix. Therefore, I would put my output in /tmp/lin-course/condprob-jimmylin. It is important that you follow these instructions exactly, because this is where we are going to look for your output.

Hint: Depending on how you implement your algorithm, computing bigram probabilities should be relatively quick (around 10-15 minutes or less). If it's taking significantly longer, you should rethink your algorithm. Don't leave an errant job running on the cluster for too long: kill it with "hadoop job -kill". A "pairs" solution of this problem is probably the easiest—you might want to look at edu.umd.cloud9.demo.DemoWordCondProb in Cloud9 as a starting point. Despite its deficiencies, "pairs" will scale fine to this collection.

Question 1. What is your job id? If you ran the code more than once, any job id of a successful run will do. Save the job details page and attach it as part of your answer. That is, from the jobtracker webapp, find your job, save that page in HTML, and include it in your assignment submission. Name this file part4q1.htm.

Question 2. For the job id you identified above, how many map tasks and reduce tasks did the job contain? How long did the job take?

Question 3. Compare the running time of the bigram counting algorithm and the bigram probabilities algorithm (question 2 from Part III and and question 2 from Part IV). Factoring away variations in cluster load (i.e., how many other jobs are running), can you explain the differences in running time? Which algorithm should take longer and why?

Question 4. What is P(birthday|happy)?

Hint: There's no reason why you couldn't write a separate program to answer this specific question.

Question 5. How long did it take you to complete this assignment (Parts I through IV)?

Submission Instructions

This assignment is due by 2pm, Tuesday 2/23. Please send us (both Jimmy and Nitin) an email with "Cloud Computing Course: Assignment 2" as the subject. In the body of the email put answers to the questions above. If you have collaborated with anyone else or have received any assistance in completing this assignment, you must tell us.

Pack up your code into a zip file named USERNAME-code.zip, and attach it your assignment submission. So for example, I would pack up my code in a file named jimmylin-code.zip. Once again, please follow these instructions exactly.

Note: The Google/IBM cluster is a shared resource accessible by many. Any impropriety on the cluster will be taken very seriously. This includes tampering or attempting to tamper with another student's results, attempting to pass another student's result as one's own, etc. See the Code of Academic Integrity or the Student Honor Council for more information.

Back to main page


This page, first created: 09 Feb 2010; last updated: Creative Commons: Attribution-Noncommercial-Share Alike 3.0 United States Valid XHTML 1.0! Valid CSS!