A Hadoop toolkit for working with big data

Warning This exercise builds on the the inverted indexing exercise, which must be completed before this.

Write a Boolean retrieval engine that uses the index created in the inverted indexing exercise. Your retrieval engine should be able to handle complex Boolean queries of the following form:

( slings AND arrows ) OR ( mortal AND coil )
( et AND tu AND brute )

As a note, the inverted index you constructed in the previous exercise is specifically designed for ranked retrieval. For Boolean retrieval, the df and tf of terms are not used.

Your system will need to parse nested Boolean queries. To make query processing simpler, you can choose the notation in which the queries are expressed: infix notation (as above) or prefix notation, e.g., "(AND slings arrows)". In fact, you can even write a query parser that uses reverse polish notation—it has many advantages, one of which is that all expressions can be unambiguously interpreted without parentheses. As a hint, reverse polish notation actually yields the simplest implementation.

As with previous exercises, use the sample input included in Cloud9, the Bible and the complete works of Shakespeare. As a bare minimum, your Boolean retrieval engine must be able to:

For example, in response to the query "outrageous AND fortune", your retrieval engine should return something like:

4442172  the slings and arrows of outrageous fortune

To give another example, in response to the query "white AND rose", your retrieval engine should return something like:

7841087  from off this brier pluck a white rose with me
7841354  i pluck this white rose with plantagenet
7841879  giving my verdict on the white rose side
7841972  lest bleeding you do paint the white rose red
7842315  in sign whereof i pluck a white rose too
7842458  shall dye your white rose in a bloody red
7845524  shall send between the red rose and the white
8237199  until the white rose that i wear be dyed
8275306  the red rose and the white are on his face
9067070  we will unite the white rose and the red

Once you have the Boolean search engine working, try out these sample queries:

means AND deceit
(white OR red ) AND rose AND pluck
(unhappy OR outrageous OR (good AND your)) AND fortune

Practical Tips

This exercise is meant to be completed outside of Hadoop. Build your inverted index in Hadoop, as per the inverted indexing exercise. If you set the number of reducers to one, all the postings will be written to a single file. This will make the index easier to manipulate. Obviously, this isn't a scalable solution, but it will suffice for this exercise. Retrieve your index from HDFS with the hadoop fs -get command. Implement your Boolean retrieval engine to manipulate the inverted index stored on your local machine.

As another hint, MapFileOutputFormat might come in handy.


Some optional features you might want to consider implementing are:


When you're ready, the solutions to this exercise are located here.