A Hadoop toolkit for working with big data
This page presents solutions for the inverted indexing exercise.
Let's start by putting the collection into HDFS:
$ hadoop fs -put data/bible+shakes.nopunc
Note that we've uncompressed the collection first. The reason for this is that in order to fetch a random document, we need to seek to arbitrary positions in the file, which is not possible with gzip. (In case you're wondering, with block compression schemes such as lzo this is possible, but requires creating a block-level index ahead of time. This is beyond the scope of this exercise.)
The main
class for building the inverted index is
edu.umd.cloud9.example.ir.BuildInvertedIndex
and a sample
command-line invocation is as follows:
$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.ir.BuildInvertedIndex \ -input bible+shakes.nopunc -output index -numReducers 1
Command-line invocation for a simple program that performs the analysis asked in the questions:
$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.ir.LookupPostings \ -index index -collection bible+shakes.nopunc
Although LookupPostings
is not a MapReduce program, we
can use Hadoop to launch the program so that HDFS configurations are
properly set up. In this case, we read data directly from
HDFS. Alternatively, you can copy the index data out of HDFS onto local
drive and use Maven:
$ mvn exec:java -Dexec.mainClass=edu.umd.cloud9.example.ir.LookupPostings -Dexec.args="-index index -collection data/bible+shakes.nopunc"
Now, onto the answers. Posting corresponding to the term "starcross'd":
(5047738, 1) a pair of starcross'd lovers take their life
Histogram of tf values for "gold":
1 523 2 58 3 3
Histogram of tf values for "silver":
1 314 2 39 3 1
The term "bronze" does not appear in the collection.