A Hadoop toolkit for web-scale information retrieval research
Ivory is designed to work with Hadoop YARN and has been tested against Cloudera CDH 4.3.0 (on both Mac and Linux). It should work with other Hadoop distributions or on other platforms with only minor modifications; however, switching to a non-YARN version of Hadoop will requiring recompiling the jars. In this tutorial, we'll take you through the process of building an inverted index on a toy collection and running retrieval experiments. First, we'll use Hadoop local mode (also called standalone mode). Running in local model, as the name suggests, does not require setting up a cluster, but of course, you won't get the benefits of distributed processing either. Next, we'll run Ivory on the single node virtual Hadoop cluster provided by Cloudera.
Download Cloudera CDH
4.3.0 here. The
easiest way is to download the tarball and unpack on your local
machine. Make sure
PATH_TO_HADOOP/bin
is on your path. Verify that Hadoop is running
with the pi example. In a shell, run the following command:
hadoop jar PATH_TO_HADOOP/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.0.0-cdh4.2.1.jar pi \ -D mapreduce.framework.name=local -D mapreduce.jobtracker.address=local -D fs.default.name=file:/// \ -D mapreduce.cluster.local.dir=/tmp/mapred/local \ -D mapreduce.cluster.temp.dir=/tmp/mapred/temp\ -D mapreduce.jobtracker.staging.root.dir=/tmp/mapred/staging \ -D mapreduce.jobtracker.system.dir=/tmp/mapred/system \ 100 100000
Note that the multitude of -D
options overrides the
Hadoop config and forces local mode. It isn't necessary if you just
downloaded the tarball straight from the site above. This is just in
case you have Hadoop set up already.
After the above Hadoop local job finishes, you should see the computed value of pi... something that's reasonably closer to 3.14.
Open up a shell and clone the Ivory github repo:
git clone git://github.com/lintool/Ivory.git
Go into the Ivory/
directory and build with ant by
typing ant
. The build should complete without error.
You should then be able to run one of the integration JUnit test cases that builds an inverted index on the CACM collection, runs retrieval experiments, and verifies the results:
etc/junit.sh ivory.integration.local.VerifyLocalCACMPositionalIndexIP
After the job finishes, you should see:
OK (1 test)
This indicates that the test has passed. If you got this far, congratulations, you've gotten Ivory to work without a hitch.
Next, we'll run through the above integration test step by step to see what it's doing. Let's look at the collection first:
gunzip -c data/cacm/cacm-collection.xml.gz | less
This is the collection of documents we're going to search against, in the standard "TREC format":
<DOC> <DOCNO>CACM-0001</DOCNO> ... </DOC> <DOCNO>CACM-0002</DOCNO> ... </DOC> ...
The <DOC>
and </DOC>
tags
denote documents. The <DOCNO>
and </DOCNO>
tags surround the unique document
identifier.
This is the CACM collection, which contains abstracts from the Communications of the ACM from the late 1950's through the mid 1960's. It's too small and too old to be useful for serious research today, but it works fine as a toy test collection.
Let's build an inverted index. Ant should have automatically
created a script for you located at etc/hadoop-local.sh
for running Hadoop jobs in local mode. It conveniently sets up the
environment, so you shouldn't have to worry about classpaths, libjars,
etc. Building the index involves two separated commands:
etc/hadoop-local.sh ivory.app.PreprocessTrecCollection \ -collectionName cacm -collection data/cacm/cacm-collection.xml.gz -index index-cacm etc/hadoop-local.sh ivory.app.BuildIndex \ -index index-cacm -indexPartitions 1 -positionalIndexIP
The first preprocesses the collection, while the second actually
performs inverted index construction. The index should now be in
index-cacm/
.
Now let's run some retrieval experiments. To do so, we need "topics" (which is IR parlance for queries) and "qrels" (which is IR parlance for relevance judgments, or lists that tell us which documents are relevant with respect to which queries):
Next, take a look at the model parameter settings. It specifies the index location and two different retrieval models that we're going to run: one based on language modeling (with the Dirichlet score), and the second based on BM25.
Now go ahead and do the experimental runs:
etc/run.sh ivory.smrf.retrieval.RunQueryLocal data/cacm/run.cacm.xml data/cacm/queries.cacm.xml
If you see warnings about not being able to find terms, that's
okay. After the program finishes you should see two files:
ranking.cacm-dir-base.txt
, ranking.cacm-bm25-base.txt
. These
are the retrieval results, the files containing the top 1000 hits per topic.
Finally, let's evaluate the ranked list to see how well our
retrieval models perform. Before doing so, make sure you've built
the trec_eval
evaluation package
from NIST. For your
convenience, v9.0 is included
in etc/trec_eval.9.0.tar.gz
. Build the package
by make
and place the executable at etc/trec_eval
.
You can then run run trec_eval:
etc/trec_eval data/cacm/qrels.cacm.txt ranking.cacm-dir-base.txt etc/trec_eval data/cacm/qrels.cacm.txt ranking.cacm-bm25-base.txt
For the first, you should get a map
(mean average
precision) of 0.3032 and P_10
(precision at 10) of
0.3038. For the second, you should get a map
of 0.2863
and a P_10
of 0.2885.
The next step is to run Ivory on an actual Hadoop cluster. How to set up a Hadoop cluster is beyond the scope of this tutorial, but the next best thing is to use Cloudera's virtual machine images, which come with pre-configured single-node cluster. The images can be downloaded here.
Use the VirtualBox image, since VirtualBox is freely available across all major platforms. Download the image and unpack the tarball. VirtualBox itself can be download here.
Install VirtualBox and open up the application. To install the Cloudera Hadoop image, click "New" on the tool bar. For "Name and operating system", put in the following information:
Next, for "Memory size", put in as much as you can spare, with a minimum of 3GB. Next, "Hard drive", select "Use an existing virtual hard drive file" and select the VM image you downloaded from above. To finish, click "Create". Back in the main window, the VM should have been added. Select it and click "Start" in the toolbar. That'll boot up the image.
Info | On Mac, if you get the error "Failed to
load VMMR0.r0 (VERR_SUPLIB_WORLD_WRITABLE) " when booting up, it's
complaining because the directory /Application is world
writable. Apparently, that's bad practice, so change that: chmod
775 should do the trick.
|
The VM is missing a few packages that we need, so open up a shell and install from the command line:
sudo yum install git sudo yum install ant sudo yum install gcc
Open up a shell and clone the Ivory github repo:
git clone git://github.com/lintool/Ivory.git
As with before, go into the Ivory/
directory and build
with ant by typing ant
.
After that's done, we need to put the CACM collection onto HDFS, because we're no longer indexing on the local disk anymore:
hadoop fs -put data/cacm/cacm-collection.xml.gz
You can verify that the file is there:
hadoop fs -ls
Next, build the indexes using
the etc/hadoop-cluster.sh
script, as follows:
etc/hadoop-cluster.sh ivory.app.PreprocessTrecCollection \ -collectionName cacm -collection cacm-collection.xml.gz -index index-cacm etc/hadoop-cluster.sh ivory.app.BuildIndex \ -index index-cacm -indexPartitions 1 -positionalIndexIP
The script is a wrapper around hadoop
that sets up the
environment, handles libjars, etc. If you're curious, cat
it and you'll see. Note that the paths here are referencing HDFS
paths, not local paths.
After the indexing finishes, you should be able to see the indexes on HDFS:
hadoop fs -ls index-cacm
Now copy the indexes from HDFS onto the local disk:
hadoop fs -get index-cacm .
From here, you should be able to run the retrieval experiments above.
And that's it! Happy searching!