Data-Intensive Information Processing Applications (Spring 2010)

University of Maryland, College Park

Data-Intensive Information Processing Applications (Spring 2010)

Assignment 3: Inverted Indexing and Boolean Retrieval

Due: Tuesday 3/2 (2pm)

Part I

Complete the inverted indexing exercise in Cloud⁹. You can implement Dean and Ghemawat's original algorithm—no need to implement the fancier algorithms we discussed in class. You may complete this portion of the assignment in standalone mode if you wish. Alternatively, if you wish to use the Google/IBM cluster, the sample collection (the Bible and the complete works of Shakespeare) is loaded at /tmp/sample-data.

Answer the three questions listed in the exercise (repeated below), plus one more:

Question 1. Look up the postings corresponding to the term "starcross'd". There should only be one line in the entire collection that contains the term. What is that line? What's its docno (i.e., byte offset)?

Question 2. Look up the postings corresponding to the term "gold". Generate a histogram of tf values. That is, in how many lines does "gold" appear once, twice, three times, etc.?

Question 3. Do the same for the terms "silver" and "bronze".

Question 4. How did you represent postings lists in your implementation?

Part II

Complete the Boolean retrieval exercise in Cloud⁹. Show results of the three queries, repeated below:

means AND deceit
(white OR red ) AND rose AND pluck
(unhappy OR outrageous OR (good AND your)) AND fortune

Extra credit will be given if you implement one or both of the "Extensions". However, we strongly advise that you complete the entire assignment before trying out these extensions.

Part III

Now make sure your inverted indexing algorithm scales to a non-trivial collection. We've prepared a Wikipedia collection for you, stored at /tmp/wiki-articles on the Google/IBM cluster. From a raw Wikipedia XML dump, we sampled approximately one sixth of all articles (ignoring disambiguation and redirect pages)—953,999 articles in total. The entire collection is contained in one large file, with an article per line.

Dean and Ghemawat's algorithm should scale to this collection, provided you do two things:

First, ignore the terms with the highest document frequencies (i.e., treat these as stopwords). We've compiled a a list here for you.

Second, increase the heap size of your map and reduce tasks. This is especially important in the reducer where you need to buffer all postings in memory before you write out the entire postings list. To increase the heap size, call the following method on the JobConf object in your main driver code:

conf.set("mapred.child.java.opts", "-Xmx2048m");

Don't increase the heap size to more than 2 GB, because there isn't enough physical memory on the machines! Remember, each node is running multiple tasks, and they all must share the same physical memory. By dropping the most common terms, this heap size should be sufficient for you to complete the assignment.

Make sure your inverted indexing algorithm is able to scale to the Wikipedia collection. Put your output (inverted index for the entire collection) in /tmp/lin-course/index-USERNAME. Substitute USERNAME with your actual username without the "ccc_" prefix. Therefore, I would put my output in /tmp/lin-course/index-jimmylin. It is important that you follow these instructions exactly, because this is where we are going to look for your output.

There are only two more questions to answer:

Question 1. The term "alien-elephant" appears exactly once in the collection. Which Wikipedia article does the term appear in?

Hint: You'll need to find the right postings list, and then find the byte offset of the posting. From there, you'll need to look up the actual Wikipedia article. There's no reason why you couldn't write one or more separate programs to answer this question.

Question 2. How long did it take you to complete this assignment (Parts I through III)?

Submission Instructions

This assignment is due by 2pm, Tuesday 3/2. Please send us (both Jimmy and Nitin) an email with "Cloud Computing Course: Assignment 3" as the subject. In the body of the email put answers to the questions above. If you have collaborated with anyone else or have received any assistance in completing this assignment, you must tell us.

Pack up your code into a zip file named USERNAME-code.zip, and attach it your assignment submission. So for example, I would pack up my code in a file named jimmylin-code.zip. Once again, please follow these instructions exactly.

Note: The Google/IBM cluster is a shared resource accessible by many. Any impropriety on the cluster will be taken very seriously. This includes tampering or attempting to tamper with another student's results, attempting to pass another student's result as one's own, etc. See the Code of Academic Integrity or the Student Honor Council for more information.

Back to main page

This page, first created: 10 Feb 2010; last updated: