A Hadoop toolkit for web-scale information retrieval research
This tutorial provides a guide to batch retrieval with Ivory on the venerable TREC disks 4 and 5 document collection, which is distributed by NIST and used in many Text Retrieval Conferences (TRECs). Although the collection is over a decade old, it is still used as a starting point for information retrieval research. This guide will cover both indexing the collection and performing retrieval runs with queries from the TREC 2004 robust track.
The first task is to obtain the collection (from NIST). We're assuming you have it in hand already. A standard "view" of the disks is to ignore the Congressional Record (CR) and Federal Register (FR), so the collection is often written shorthand as TREC 45 (-CR,FR) or something similar. There are a total of 472,525 documents in the collection as described above, distributed in a number of files; see complete list of all files. Since Hadoop doesn't work well will lots of small files, the first step is to prepare the collection by concatenating all the documents into a large file. This is most easily done with a Perl or Python script. See this simple Perl script, but it should be very easy to write your own.
Assuming you've completed the getting started tutorial, building the TREC 45 index is pretty straightforward. In fact, the collection is small enough that building an index in Hadoop local mode takes only a short while:
etc/hadoop-local.sh ivory.app.PreprocessTrec45 \ -collection /shared/collections/trec/trec4-5_noCRFR.xml -index index-trec etc/hadoop-local.sh ivory.app.BuildIndex \ -index index-trec -indexPartitions 1 -positionalIndexIP
On a 2012 15" Retina Display MacBook Pro (2.7 GHz Intel Core i7), it takes about 25 minutes for the preprocessing and 10 minutes for the actual inverted indexing.
Alternatively, you can build the index on a real Hadoop cluster:
etc/hadoop-cluster.sh ivory.app.PreprocessTrec45 \ -collection /shared/collections/trec/trec4-5_noCRFR.xml -index index-trec etc/hadoop-cluster.sh ivory.app.BuildIndex \ -index index-trec -indexPartitions 1 -positionalIndexIP
After building the index, you should be able to run the retrieval experiments described below and replicate our results.
To demonstrate batch retrieval, we're going to use topics from the
TREC 2004 robust track. In the data/trec/
directory, you'll find the following data:
data/trec/run.robust04.basic.xml
:
retrieval models and parametersdata/trec/queries.robust04.xml
:
queries (TREC 2004 robust track)The first configuration file specifies six different models:
Info | Before running
the following experiments, make sure you've built
the trec_eval evaluation package
from NIST. For your
convenience, v9.0 is included
in etc/trec_eval.9.0.tar.gz . Build the package
by make and place the executable at etc/trec_eval . |
Info | Before running the following
experiments, you have to copy the indexes out of HDFS (if you built
the indexes using distributed Hadoop and not Hadoop local). Also make
sure to change the index location in the
run.xml files to the actual index path (under the
<index> attribute). |
Here are the command-line invocations for running and evaluating the models:
# command-line etc/run.sh ivory.smrf.retrieval.RunQueryLocal data/trec/run.robust04.basic.xml data/trec/queries.robust04.xml # evaluating effectiveness etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-dir-base.txt etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-dir-sd.txt etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-dir-fd.txt etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-bm25-base.txt etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-bm25-sd.txt etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-bm25-fd.txt # junit etc/junit.sh ivory.regression.basic.Robust04_Basic
description | tag | MAP | P10 |
Dirichlet, full independence | robust04-dir-base | 0.3063 | 0.4424 |
Dirichlet, sequential dependence | robust04-dir-sd | 0.3194 | 0.4485 |
Dirichlet, full dependence | robust04-dir-fd | 0.3253 | 0.4576 |
bm25, full independence | robust04-bm25-base | 0.3033 | 0.4283 |
bm25, sequential dependence | robust04-bm25-sd | 0.3212 | 0.4505 |
bm25, full dependence | robust04-bm25-fd | 0.3212 | 0.4545 |
WSD refers to Bendersky et al.'s Weighted Sequential Dependence model (WSDM 2010).
# command-line etc/run.sh ivory.smrf.retrieval.RunQueryLocal data/trec/run.robust04.wsd.xml data/trec/queries.robust04.xml # evaluating effectiveness etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-dir-wsd-sd.txt etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-dir-wsd-fd.txt # junit etc/junit.sh ivory.regression.basic.Robust04_WSD
description | tag | MAP | P10 |
Dirichlet, WSD, sequential dependence | robust04-dir-wsd-sd | 0.3246 | 0.4626 |
Dirichlet, WSD, full dependence | robust04-dir-wsd-fd | 0.3283 | 0.4667 |
LCE refers to Metzler et al.'s Latent Concept Expansion model (SIGIR 2007).
# command-line etc/run.sh ivory.smrf.retrieval.RunQueryLocal data/trec/run.robust04.basic.lce.xml data/trec/queries.robust04.xml # evaluating effectiveness etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-dir-rm3-f.txt etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-dir-rm3-s.txt etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-dir-sd-lce-f.txt etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-dir-sd-lce-s.txt etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-dir-sd-lce-bigram.txt # junit etc/junit.sh ivory.regression.basic.Robust04_Basic_LCE
description | tag | MAP | P10 |
Dir., full indep., LCE (unigrams) ["RM3"] (fast) | robust04-dir-rm3-f | 0.3558 | 0.4596 |
Dir., full indep., LCE (unigrams) ["RM3"] (slow) | robust04-dir-rm3-s | 0.3557 | 0.4596 |
Dir., SD, LCE (unigrams) (fast) | robust04-dir-sd-lce-f | 0.3789 | 0.4808 |
Dir., SD, LCE (unigrams) (slow) | robust04-dir-sd-lce-s | 0.3753 | 0.4657 |
Dir., SD, LCE (bigrams) | robust04-dir-sd-lce-bigram | 0.3510 | 0.4535 |
# command-line etc/run.sh ivory.smrf.retrieval.RunQueryLocal data/trec/run.robust04.wsd.lce.xml data/trec/queries.robust04.xml # evaluating effectiveness etc/trec_eval data/trec/qrels.robust04.noCRFR.txt ranking.robust04-dir-wsd-lce.txt # junit etc/junit.sh ivory.regression.basic.Robust04_WSD_LCE
description | tag | MAP | P10 |
Dir., WSD, LCE (unigrams) (fast) | robust04-dir-wsd-lce | 0.3941 | 0.4980 |