A Hadoop toolkit for web-scale information retrieval research
This tutorial provides a guide to batch retrieval with Ivory on the first English segment of the ClueWeb09 collection, a modern web collection distributed by Carnegie Mellon University and used in many Text Retrieval Conferences (TRECs). This guide will cover both indexing the collection and performing retrieval runs with queries from the 2009 web track.
Tip | The procedure for preparing and indexing the ClueWeb collection is similar to those of the Gov2 collection, which is described in a separate tutorial, so it might be a good idea to complete that first. |
In total, there are 503,903,810 pages in the English portion of the
ClueWeb09 collection. The English data is distributed in ten parts
(called segments), each corresponding to a directory. The first
segment is commonly-known as "category B", and corresponds to the
contents of directory ClueWeb09_English_1/
. There are a
total of 1492 files, with 50,220,423 web pages.
It's easiest to work with the collection as block-compressed
SequenceFile
s, so you'll want to first repack the
distribution WARC files. There's a program
in Cloud9
for repacking the collection:
hadoop jar lib/cloud9-X.X.X.jar edu.umd.cloud9.collection.clue.RepackClueWarcRecords \ -libjars lib/guava-X.X.X.jar \ /shared/collections/ClueWeb09/collection.raw clueweb09catB 1 docno-mapping.dat block
Replace the X.X.X
with the actual latest version of
the jars. The first command-line argument is the base path of your
ClueWeb09 distribution; the second is the output path; the third is
the segment number; the fourth is the docno mapping data file, which
is here (put it
on HDFS); the fifth is "block" to specify block-level compression.
Once the collection has been repacked, building the inverted index follows a procedure very similar to TREC and all other collections:
etc/hadoop-cluster.sh ivory.app.PreprocessClueWebEnglish \ -collection clueweb09catB -index index-clueweb09catB -segment 1 etc/hadoop-cluster.sh ivory.app.BuildIndex \ -index index-clueweb09catB -indexPartitions 200 -positionalIndexIP
Info | Before running
the following experiments, make sure you've built
the trec_eval evaluation package
from NIST. For your
convenience, v9.0 is included
in etc/trec_eval.9.0.tar.gz . Build the package
by make and place the executable at etc/trec_eval . |
Info | Before running
the following experiments, you have to
copy the indexes out of HDFS. Also make sure to change the index location in
data/clue/run.web09catB.xml
and other model specification files to the actual index path (under the
<index> attribute). |
To demonstrate batch retrieval, we're going to use topics from the 2009 TREC web track. The two retrieval models are bm25 and query likelihood (from language modeling) with Dirichlet scores. Here are the command-line invocations for running and evaluating the models:
# command-line etc/run.sh ivory.smrf.retrieval.RunQueryLocal data/clue/run.web09catB.xml data/clue/queries.web09.xml # evaluating effectiveness etc/trec_eval data/clue/qrels.web09catB.txt ranking.web09catB-bm25.txt etc/trec_eval data/clue/qrels.web09catB.txt ranking.web09catB-ql.txt # junit etc/junit.sh ivory.regression.basic.Web09catB_Baseline
description | tag | MAP | P10 |
bm25 | UMHOO-BM25-catB | 0.2051 | 0.3720 |
QL | UMHOO-QL-catB | 0.1931 | 0.3380 |
Dependence Models
These runs contrast baseline models with dependence models (Dirichlet vs. bm25 term weighting). SD is Metzler and Croft's Sequential Dependence model (SIGIR 2005), and WSD is Bendersky et al.'s Weighted Sequential Dependence model (WSDM 2010). Note that the SD model is not trained, since it has hard-coded parameters. On the other hand, the WSD model is trained with all queries from TREC 2009 (optimizing StatMAP), which makes the WSD figures unrealistically high, since we're testing on the training set.
# command-line etc/run.sh ivory.smrf.retrieval.RunQueryLocal data/clue/run.web09catB.all.xml data/clue/queries.web09.xml # evaluating effectiveness etc/trec_eval data/clue/qrels.web09catB.txt ranking.web09catB.all.ql.base.txt etc/trec_eval data/clue/qrels.web09catB.txt ranking.web09catB.all.ql.sd.txt etc/trec_eval data/clue/qrels.web09catB.txt ranking.web09catB.all.ql.wsd.txt etc/trec_eval data/clue/qrels.web09catB.txt ranking.web09catB.all.bm25.base.txt etc/trec_eval data/clue/qrels.web09catB.txt ranking.web09catB.all.bm25.sd.txt etc/trec_eval data/clue/qrels.web09catB.txt ranking.web09catB.all.bm25.wsd.txt # junit etc/junit.sh ivory.regression.basic.Web09catB_All
description | tag | MAP | P10 |
Dirichlet | ql-base | 0.1931 | 0.3380 |
Dirichlet + SD | ql-sd | 0.2048 | 0.3620 |
Dirichlet + WSD | ql-wsd | 0.2120 | 0.3580 |
bm25 | bm25-base | 0.2051 | 0.3720 |
bm25 + SD | bm25-sd | 0.2188 | 0.3920 |
bm25 + WSD | bm25-wsd | 0.2205 | 0.3940 |
These runs are the same as the set above, except they also take advantage of Waterloo spam scores (simple linear interpolation). The training process started with the above models, and then the spam weight was tuned. Note that these figures are all unrealistically high, since we're testing on the training set.
Running these models will require the Waterloo spam scores, packed in a manner usable by Ivory. They can be found here.
# command-line etc/run.sh ivory.smrf.retrieval.RunQueryLocal data/clue/run.web09catB.all.spam.xml data/clue/queries.web09.xml # evaluating effectiveness etc/trec_eval data/clue/qrels.web09catB.txt ranking.web09catB.spam.ql.base.txt etc/trec_eval data/clue/qrels.web09catB.txt ranking.web09catB.spam.ql.sd.txt etc/trec_eval data/clue/qrels.web09catB.txt ranking.web09catB.spam.ql.wsd.txt etc/trec_eval data/clue/qrels.web09catB.txt ranking.web09catB.spam.bm25.base.txt etc/trec_eval data/clue/qrels.web09catB.txt ranking.web09catB.spam.bm25.sd.txt etc/trec_eval data/clue/qrels.web09catB.txt ranking.web09catB.spam.bm25.wsd.txt # junit etc/junit.sh ivory.regression.basic.Web09catB_All_Spam
description | tag | MAP | P10 |
Dirichlet | ql-base | 0.2134 | 0.4540 |
Dirichlet + SD | ql-sd | 0.2223 | 0.4560 |
Dirichlet + WSD | ql-wsd | 0.2283 | 0.4160 |
bm25 | bm25-base | 0.2167 | 0.4220 |
bm25 + SD | bm25-sd | 0.2280 | 0.4420 |
bm25 + WSD | bm25-wsd | 0.2290 | 0.4340 |