A Hadoop toolkit for web-scale information retrieval research

Ivory is designed to work with Hadoop YARN and has been tested against Cloudera CDH 4.3.0 (on both Mac and Linux). It should work with other Hadoop distributions or on other platforms with only minor modifications; however, switching to a non-YARN version of Hadoop will requiring recompiling the jars. In this tutorial, we'll take you through the process of building an inverted index on a toy collection and running retrieval experiments. First, we'll use Hadoop local mode (also called standalone mode). Running in local model, as the name suggests, does not require setting up a cluster, but of course, you won't get the benefits of distributed processing either. Next, we'll run Ivory on the single node virtual Hadoop cluster provided by Cloudera.

Download and Install Hadoop

Download Cloudera CDH 4.3.0 here. The easiest way is to download the tarball and unpack on your local machine. Make sure PATH_TO_HADOOP/bin is on your path. Verify that Hadoop is running with the pi example. In a shell, run the following command:

hadoop jar PATH_TO_HADOOP/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.0.0-cdh4.2.1.jar pi \
  -D mapreduce.framework.name=local -D mapreduce.jobtracker.address=local -D fs.default.name=file:/// \
  -D mapreduce.cluster.local.dir=/tmp/mapred/local \
  -D mapreduce.cluster.temp.dir=/tmp/mapred/temp\
  -D mapreduce.jobtracker.staging.root.dir=/tmp/mapred/staging \
  -D mapreduce.jobtracker.system.dir=/tmp/mapred/system \
  100 100000

Note that the multitude of -D options overrides the Hadoop config and forces local mode. It isn't necessary if you just downloaded the tarball straight from the site above. This is just in case you have Hadoop set up already.

After the above Hadoop local job finishes, you should see the computed value of pi... something that's reasonably closer to 3.14.

Clone the Ivory Repo

Open up a shell and clone the Ivory github repo:

git clone git://github.com/lintool/Ivory.git

Go into the Ivory/ directory and build with ant by typing ant. The build should complete without error.

You should then be able to run one of the integration JUnit test cases that builds an inverted index on the CACM collection, runs retrieval experiments, and verifies the results:

etc/junit.sh ivory.integration.local.VerifyLocalCACMPositionalIndexIP

After the job finishes, you should see:

OK (1 test)

This indicates that the test has passed. If you got this far, congratulations, you've gotten Ivory to work without a hitch.

Building an Index

Next, we'll run through the above integration test step by step to see what it's doing. Let's look at the collection first:

gunzip -c data/cacm/cacm-collection.xml.gz | less

This is the collection of documents we're going to search against, in the standard "TREC format":


The <DOC> and </DOC> tags denote documents. The <DOCNO> and </DOCNO> tags surround the unique document identifier.

This is the CACM collection, which contains abstracts from the Communications of the ACM from the late 1950's through the mid 1960's. It's too small and too old to be useful for serious research today, but it works fine as a toy test collection.

Let's build an inverted index. Ant should have automatically created a script for you located at etc/hadoop-local.sh for running Hadoop jobs in local mode. It conveniently sets up the environment, so you shouldn't have to worry about classpaths, libjars, etc. Building the index involves two separated commands:

etc/hadoop-local.sh ivory.app.PreprocessTrecCollection \
  -collectionName cacm -collection data/cacm/cacm-collection.xml.gz -index index-cacm

etc/hadoop-local.sh ivory.app.BuildIndex \
  -index index-cacm -indexPartitions 1 -positionalIndexIP

The first preprocesses the collection, while the second actually performs inverted index construction. The index should now be in index-cacm/.

Retrieval Experiments

Now let's run some retrieval experiments. To do so, we need "topics" (which is IR parlance for queries) and "qrels" (which is IR parlance for relevance judgments, or lists that tell us which documents are relevant with respect to which queries):

Next, take a look at the model parameter settings. It specifies the index location and two different retrieval models that we're going to run: one based on language modeling (with the Dirichlet score), and the second based on BM25.

Now go ahead and do the experimental runs:

etc/run.sh ivory.smrf.retrieval.RunQueryLocal data/cacm/run.cacm.xml data/cacm/queries.cacm.xml

If you see warnings about not being able to find terms, that's okay. After the program finishes you should see two files: ranking.cacm-dir-base.txt, ranking.cacm-bm25-base.txt. These are the retrieval results, the files containing the top 1000 hits per topic.

Finally, let's evaluate the ranked list to see how well our retrieval models perform. Before doing so, make sure you've built the trec_eval evaluation package from NIST. For your convenience, v9.0 is included in etc/trec_eval.9.0.tar.gz. Build the package by make and place the executable at etc/trec_eval.

You can then run run trec_eval:

etc/trec_eval data/cacm/qrels.cacm.txt ranking.cacm-dir-base.txt
etc/trec_eval data/cacm/qrels.cacm.txt ranking.cacm-bm25-base.txt

For the first, you should get a map (mean average precision) of 0.3032 and P_10 (precision at 10) of 0.3038. For the second, you should get a map of 0.2863 and a P_10 of 0.2885.

Running Ivory on a Single Node Virtual Cluster

The next step is to run Ivory on an actual Hadoop cluster. How to set up a Hadoop cluster is beyond the scope of this tutorial, but the next best thing is to use Cloudera's virtual machine images, which come with pre-configured single-node cluster. The images can be downloaded here.

Use the VirtualBox image, since VirtualBox is freely available across all major platforms. Download the image and unpack the tarball. VirtualBox itself can be download here.

Install VirtualBox and open up the application. To install the Cloudera Hadoop image, click "New" on the tool bar. For "Name and operating system", put in the following information:

Next, for "Memory size", put in as much as you can spare, with a minimum of 3GB. Next, "Hard drive", select "Use an existing virtual hard drive file" and select the VM image you downloaded from above. To finish, click "Create". Back in the main window, the VM should have been added. Select it and click "Start" in the toolbar. That'll boot up the image.

Info On Mac, if you get the error "Failed to load VMMR0.r0 (VERR_SUPLIB_WORLD_WRITABLE)" when booting up, it's complaining because the directory /Application is world writable. Apparently, that's bad practice, so change that: chmod 775 should do the trick.

The VM is missing a few packages that we need, so open up a shell and install from the command line:

sudo yum install git 
sudo yum install ant 
sudo yum install gcc

Open up a shell and clone the Ivory github repo:

git clone git://github.com/lintool/Ivory.git

As with before, go into the Ivory/ directory and build with ant by typing ant.

After that's done, we need to put the CACM collection onto HDFS, because we're no longer indexing on the local disk anymore:

hadoop fs -put data/cacm/cacm-collection.xml.gz

You can verify that the file is there:

hadoop fs -ls

Next, build the indexes using the etc/hadoop-cluster.sh script, as follows:

etc/hadoop-cluster.sh ivory.app.PreprocessTrecCollection \
  -collectionName cacm -collection cacm-collection.xml.gz -index index-cacm

etc/hadoop-cluster.sh ivory.app.BuildIndex \
  -index index-cacm -indexPartitions 1 -positionalIndexIP

The script is a wrapper around hadoop that sets up the environment, handles libjars, etc. If you're curious, cat it and you'll see. Note that the paths here are referencing HDFS paths, not local paths.

After the indexing finishes, you should be able to see the indexes on HDFS:

hadoop fs -ls index-cacm

Now copy the indexes from HDFS onto the local disk:

hadoop fs -get index-cacm .

From here, you should be able to run the retrieval experiments above.

And that's it! Happy searching!