Ivory: Experiments

This tutorial provides a guide to batch retrieval with Ivory on the venerable TREC disks 4 and 5 document collection, which is distributed by NIST and used in many Text Retrieval Conferences (TRECs). Although the collection is over a decade old, it is still used as a starting point for information retrieval research. This guide will cover both indexing the collection and performing retrieval runs with queries from the TREC 2004 robust track.

Getting the Collection

The first task is to obtain the collection (from NIST). We're assuming you have it in hand already. A standard "view" of the disks is to ignore the Congressional Record (CR) and Federal Register (FR), so the collection is often written shorthand as TREC 45 (-CR,FR) or something similar. There are a total of 472,525 documents in the collection as described above, distributed in a number of files; see complete list of all files. Since Hadoop doesn't work well will lots of small files, the first step is to prepare the collection by concatenating all the documents into a large file. This is most easily done with a Perl or Python script. See this simple Perl script, but it should be very easy to write your own.

Building the Index

Assuming you've completed the getting started tutorial, building the TREC 45 index is pretty straightforward. In fact, the collection is small enough that building an index in Hadoop local mode takes only a short while:

etc/hadoop-local.sh ivory.app.PreprocessTrec45 \
  -collection /shared/collections/trec/trec4-5_noCRFR.xml -index index-trec

etc/hadoop-local.sh ivory.app.BuildIndex \
  -index index-trec -indexPartitions 1 -positionalIndexIP

On a 2012 15" Retina Display MacBook Pro (2.7 GHz Intel Core i7), it takes about 25 minutes for the preprocessing and 10 minutes for the actual inverted indexing.

Alternatively, you can build the index on a real Hadoop cluster:

etc/hadoop-cluster.sh ivory.app.PreprocessTrec45 \
  -collection /shared/collections/trec/trec4-5_noCRFR.xml -index index-trec

etc/hadoop-cluster.sh ivory.app.BuildIndex \
  -index index-trec -indexPartitions 1 -positionalIndexIP

After building the index, you should be able to run the retrieval experiments described below and replicate our results.

Basic MRF models

To demonstrate batch retrieval, we're going to use topics from the TREC 2004 robust track. In the data/trec/ directory, you'll find the following data:

data/trec/run.robust04.basic.xml: retrieval models and parameters
data/trec/queries.robust04.xml: queries (TREC 2004 robust track)

The first configuration file specifies six different models:

robust04-dir-base: language modeling framework, Dirichlet prior, simple query likelihood.
robust04-dir-sd: language modeling framework, Dirichlet prior, sequential dependence model using MRFs.
robust04-dir-fd: language modeling framework, Dirichlet prior, full dependence model using MRFs.
robust04-bm25-base: bm25 term weighting, simple bag-of-words queries.
robust04-bm25-sd: sequential dependence model using MRFs, with bm25 term weighting.
robust04-bm25-fd: full dependence model using MRFs, with bm25 term weighting.

Info	Before running the following experiments, make sure you've built the `trec_eval` evaluation package from NIST. For your convenience, v9.0 is included in `etc/trec_eval.9.0.tar.gz`. Build the package by `make` and place the executable at `etc/trec_eval`.

Info	Before running the following experiments, you have to copy the indexes out of HDFS (if you built the indexes using distributed Hadoop and not Hadoop local). Also make sure to change the index location in the `run.xml` files to the actual index path (under the `<index>` attribute).

Here are the command-line invocations for running and evaluating the models:

# command-line
etc/run.sh ivory.smrf.retrieval.RunQueryLocal data/trec/run.robust04.basic.xml data/trec/queries.robust04.xml

# evaluating effectiveness
etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-dir-base.txt
etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-dir-sd.txt
etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-dir-fd.txt
etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-bm25-base.txt
etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-bm25-sd.txt
etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-bm25-fd.txt

# junit
etc/junit.sh ivory.regression.basic.Robust04_Basic

description	tag	MAP	P10
Dirichlet, full independence	robust04-dir-base	0.3063	0.4424
Dirichlet, sequential dependence	robust04-dir-sd	0.3194	0.4485
Dirichlet, full dependence	robust04-dir-fd	0.3253	0.4576
bm25, full independence	robust04-bm25-base	0.3033	0.4283
bm25, sequential dependence	robust04-bm25-sd	0.3212	0.4505
bm25, full dependence	robust04-bm25-fd	0.3212	0.4545

WSD models

WSD refers to Bendersky et al.'s Weighted Sequential Dependence model (WSDM 2010).

# command-line
etc/run.sh ivory.smrf.retrieval.RunQueryLocal data/trec/run.robust04.wsd.xml data/trec/queries.robust04.xml

# evaluating effectiveness
etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-dir-wsd-sd.txt
etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-dir-wsd-fd.txt

# junit
etc/junit.sh ivory.regression.basic.Robust04_WSD

description	tag	MAP	P10
Dirichlet, WSD, sequential dependence	robust04-dir-wsd-sd	0.3246	0.4626
Dirichlet, WSD, full dependence	robust04-dir-wsd-fd	0.3283	0.4667

Basic MRF + LCE models

LCE refers to Metzler et al.'s Latent Concept Expansion model (SIGIR 2007).

# command-line
etc/run.sh ivory.smrf.retrieval.RunQueryLocal data/trec/run.robust04.basic.lce.xml data/trec/queries.robust04.xml

# evaluating effectiveness
etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-dir-rm3-f.txt
etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-dir-rm3-s.txt
etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-dir-sd-lce-f.txt
etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-dir-sd-lce-s.txt
etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-dir-sd-lce-bigram.txt

# junit
etc/junit.sh ivory.regression.basic.Robust04_Basic_LCE

description	tag	MAP	P10
Dir., full indep., LCE (unigrams) ["RM3"] (fast)	robust04-dir-rm3-f	0.3558	0.4596
Dir., full indep., LCE (unigrams) ["RM3"] (slow)	robust04-dir-rm3-s	0.3557	0.4596
Dir., SD, LCE (unigrams) (fast)	robust04-dir-sd-lce-f	0.3789	0.4808
Dir., SD, LCE (unigrams) (slow)	robust04-dir-sd-lce-s	0.3753	0.4657
Dir., SD, LCE (bigrams)	robust04-dir-sd-lce-bigram	0.3510	0.4535

WSD + LCE models

# command-line
etc/run.sh ivory.smrf.retrieval.RunQueryLocal data/trec/run.robust04.wsd.lce.xml data/trec/queries.robust04.xml

# evaluating effectiveness
etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-dir-wsd-lce.txt

# junit
etc/junit.sh ivory.regression.basic.Robust04_WSD_LCE

description	tag	MAP	P10
Dir., WSD, LCE (unigrams) (fast)	robust04-dir-wsd-lce	0.3941	0.4980

Experiments: TREC Disks 4-5