A Hadoop toolkit for working with big data
Warning | This exercise builds on the the inverted indexing exercise, which must be completed before this. |
Write a Boolean retrieval engine that uses the index created in the inverted indexing exercise. Your retrieval engine should be able to handle complex Boolean queries of the following form:
( slings AND arrows ) OR ( mortal AND coil ) ( et AND tu AND brute )
As a note, the inverted index you constructed in the previous exercise is specifically designed for ranked retrieval. For Boolean retrieval, the df and tf of terms are not used.
Your system will need to parse nested Boolean queries. To make query processing simpler, you can choose the notation in which the queries are expressed: infix notation (as above) or prefix notation, e.g., "(AND slings arrows)". In fact, you can even write a query parser that uses reverse polish notation—it has many advantages, one of which is that all expressions can be unambiguously interpreted without parentheses. As a hint, reverse polish notation actually yields the simplest implementation.
As with previous exercises, use the sample input included in Cloud9, the Bible and the complete works of Shakespeare. As a bare minimum, your Boolean retrieval engine must be able to:
For example, in response to the query "outrageous AND fortune", your retrieval engine should return something like:
4442172 the slings and arrows of outrageous fortune
To give another example, in response to the query "white AND rose", your retrieval engine should return something like:
7841087 from off this brier pluck a white rose with me 7841354 i pluck this white rose with plantagenet 7841879 giving my verdict on the white rose side 7841972 lest bleeding you do paint the white rose red 7842315 in sign whereof i pluck a white rose too 7842458 shall dye your white rose in a bloody red 7845524 shall send between the red rose and the white 8237199 until the white rose that i wear be dyed 8275306 the red rose and the white are on his face 9067070 we will unite the white rose and the red
Once you have the Boolean search engine working, try out these sample queries:
means AND deceit (white OR red ) AND rose AND pluck (unhappy OR outrageous OR (good AND your)) AND fortune
This exercise is meant to be completed outside of Hadoop. Build
your inverted index in Hadoop, as per
the inverted indexing exercise. If you set
the number of reducers to one, all the postings will be written to a
single file. This will make the index easier to manipulate.
Obviously, this isn't a scalable solution, but it will suffice for
this exercise. Retrieve your index from HDFS with the hadoop fs
-get
command. Implement your Boolean retrieval engine to
manipulate the inverted index stored on your local machine.
As another hint, MapFileOutputFormat
might come in
handy.
Some optional features you might want to consider implementing are:
When you're ready, the solutions to this exercise are located here.