Cloud9

A Hadoop toolkit for working with big data

This page presents solutions for the inverted indexing exercise.

Let's start by putting the collection into HDFS:

$ hadoop fs -put data/bible+shakes.nopunc

Note that we've uncompressed the collection first. The reason for this is that in order to fetch a random document, we need to seek to arbitrary positions in the file, which is not possible with gzip. (In case you're wondering, with block compression schemes such as lzo this is possible, but requires creating a block-level index ahead of time. This is beyond the scope of this exercise.)

The main class for building the inverted index is edu.umd.cloud9.example.ir.BuildInvertedIndex and a sample command-line invocation is as follows:

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.ir.BuildInvertedIndex \
   -input bible+shakes.nopunc -output index -numReducers 1

Command-line invocation for a simple program that performs the analysis asked in the questions:

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.ir.LookupPostings \
   -index index -collection bible+shakes.nopunc

Although LookupPostings is not a MapReduce program, we can use Hadoop to launch the program so that HDFS configurations are properly set up. In this case, we read data directly from HDFS. Alternatively, you can copy the index data out of HDFS onto local drive and use Maven:

$ mvn exec:java -Dexec.mainClass=edu.umd.cloud9.example.ir.LookupPostings -Dexec.args="-index index -collection data/bible+shakes.nopunc"

Now, onto the answers. Posting corresponding to the term "starcross'd":

(5047738, 1)
a pair of starcross'd lovers take their life

Histogram of tf values for "gold":

1       523
2       58
3       3

Histogram of tf values for "silver":

1       314
2       39
3       1

The term "bronze" does not appear in the collection.