A Hadoop toolkit for web-scale information retrieval research

This tutorial provides a guide to batch retrieval with Ivory on the venerable TREC disks 4 and 5 document collection, which is distributed by NIST and used in many Text Retrieval Conferences (TRECs). Although the collection is over a decade old, it is still used as a starting point for information retrieval research. This guide will cover both indexing the collection and performing retrieval runs with queries from the TREC 2004 robust track.

Getting the Collection

The first task is to obtain the collection (from NIST). We're assuming you have it in hand already. A standard "view" of the disks is to ignore the Congressional Record (CR) and Federal Register (FR), so the collection is often written shorthand as TREC 45 (-CR,FR) or something similar. There are a total of 472,525 documents in the collection as described above, distributed in a number of files; see complete list of all files. Since Hadoop doesn't work well will lots of small files, the first step is to prepare the collection by concatenating all the documents into a large file. This is most easily done with a Perl or Python script. See this simple Perl script, but it should be very easy to write your own.

Building the Index

Assuming you've completed the getting started tutorial, building the TREC 45 index is pretty straightforward. In fact, the collection is small enough that building an index in Hadoop local mode takes only a short while:

etc/hadoop-local.sh ivory.app.PreprocessTrec45 \
  -collection /shared/collections/trec/trec4-5_noCRFR.xml -index index-trec

etc/hadoop-local.sh ivory.app.BuildIndex \
  -index index-trec -indexPartitions 1 -positionalIndexIP

On a 2012 15" Retina Display MacBook Pro (2.7 GHz Intel Core i7), it takes about 25 minutes for the preprocessing and 10 minutes for the actual inverted indexing.

Alternatively, you can build the index on a real Hadoop cluster:

etc/hadoop-cluster.sh ivory.app.PreprocessTrec45 \
  -collection /shared/collections/trec/trec4-5_noCRFR.xml -index index-trec

etc/hadoop-cluster.sh ivory.app.BuildIndex \
  -index index-trec -indexPartitions 1 -positionalIndexIP

After building the index, you should be able to run the retrieval experiments described below and replicate our results.

Basic MRF models

To demonstrate batch retrieval, we're going to use topics from the TREC 2004 robust track. In the data/trec/ directory, you'll find the following data:

The first configuration file specifies six different models:

Info Before running the following experiments, make sure you've built the trec_eval evaluation package from NIST. For your convenience, v9.0 is included in etc/trec_eval.9.0.tar.gz. Build the package by make and place the executable at etc/trec_eval.
Info Before running the following experiments, you have to copy the indexes out of HDFS (if you built the indexes using distributed Hadoop and not Hadoop local). Also make sure to change the index location in the run.xml files to the actual index path (under the <index> attribute).

Here are the command-line invocations for running and evaluating the models:

# command-line
etc/run.sh ivory.smrf.retrieval.RunQueryLocal data/trec/run.robust04.basic.xml data/trec/queries.robust04.xml

# evaluating effectiveness
etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-dir-base.txt
etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-dir-sd.txt
etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-dir-fd.txt
etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-bm25-base.txt
etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-bm25-sd.txt
etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-bm25-fd.txt

# junit
etc/junit.sh ivory.regression.basic.Robust04_Basic
description tag MAP P10
Dirichlet, full independence robust04-dir-base 0.3063 0.4424
Dirichlet, sequential dependence robust04-dir-sd 0.3194 0.4485
Dirichlet, full dependence robust04-dir-fd 0.3253 0.4576
bm25, full independence robust04-bm25-base 0.3033 0.4283
bm25, sequential dependence robust04-bm25-sd 0.3212 0.4505
bm25, full dependence robust04-bm25-fd 0.3212 0.4545

WSD models

WSD refers to Bendersky et al.'s Weighted Sequential Dependence model (WSDM 2010).

# command-line
etc/run.sh ivory.smrf.retrieval.RunQueryLocal data/trec/run.robust04.wsd.xml data/trec/queries.robust04.xml

# evaluating effectiveness
etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-dir-wsd-sd.txt
etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-dir-wsd-fd.txt

# junit
etc/junit.sh ivory.regression.basic.Robust04_WSD
description tag MAP P10
Dirichlet, WSD, sequential dependence robust04-dir-wsd-sd 0.3246 0.4626
Dirichlet, WSD, full dependence robust04-dir-wsd-fd 0.3283 0.4667

Basic MRF + LCE models

LCE refers to Metzler et al.'s Latent Concept Expansion model (SIGIR 2007).

# command-line
etc/run.sh ivory.smrf.retrieval.RunQueryLocal data/trec/run.robust04.basic.lce.xml data/trec/queries.robust04.xml

# evaluating effectiveness
etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-dir-rm3-f.txt
etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-dir-rm3-s.txt
etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-dir-sd-lce-f.txt
etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-dir-sd-lce-s.txt
etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-dir-sd-lce-bigram.txt

# junit
etc/junit.sh ivory.regression.basic.Robust04_Basic_LCE
description tag MAP P10
Dir., full indep., LCE (unigrams) ["RM3"] (fast) robust04-dir-rm3-f 0.3558 0.4596
Dir., full indep., LCE (unigrams) ["RM3"] (slow) robust04-dir-rm3-s 0.3557 0.4596
Dir., SD, LCE (unigrams) (fast) robust04-dir-sd-lce-f 0.3789 0.4808
Dir., SD, LCE (unigrams) (slow) robust04-dir-sd-lce-s 0.3753 0.4657
Dir., SD, LCE (bigrams) robust04-dir-sd-lce-bigram 0.3510 0.4535

WSD + LCE models

# command-line
etc/run.sh ivory.smrf.retrieval.RunQueryLocal data/trec/run.robust04.wsd.lce.xml data/trec/queries.robust04.xml

# evaluating effectiveness
etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-dir-wsd-lce.txt

# junit
etc/junit.sh ivory.regression.basic.Robust04_WSD_LCE
description tag MAP P10
Dir., WSD, LCE (unigrams) (fast) robust04-dir-wsd-lce 0.3941 0.4980