A Hadoop toolkit for web-scale information retrieval research

This tutorial provides a guide to batch retrieval with Ivory on the first English segment of the ClueWeb09 collection, a modern web collection distributed by Carnegie Mellon University and used in many Text Retrieval Conferences (TRECs). This guide will cover both indexing the collection and performing retrieval runs with queries from the 2009 web track.

Tip The procedure for preparing and indexing the ClueWeb collection is similar to those of the Gov2 collection, which is described in a separate tutorial, so it might be a good idea to complete that first.

Building the Index

In total, there are 503,903,810 pages in the English portion of the ClueWeb09 collection. The English data is distributed in ten parts (called segments), each corresponding to a directory. The first segment is commonly-known as "category B", and corresponds to the contents of directory ClueWeb09_English_1/. There are a total of 1492 files, with 50,220,423 web pages.

It's easiest to work with the collection as block-compressed SequenceFiles, so you'll want to first repack the distribution WARC files. There's a program in Cloud9 for repacking the collection:

hadoop jar lib/cloud9-X.X.X.jar edu.umd.cloud9.collection.clue.RepackClueWarcRecords \
  -libjars lib/guava-X.X.X.jar \
  /shared/collections/ClueWeb09/collection.raw clueweb09catB 1 docno-mapping.dat block

Replace the X.X.X with the actual latest version of the jars. The first command-line argument is the base path of your ClueWeb09 distribution; the second is the output path; the third is the segment number; the fourth is the docno mapping data file, which is here (put it on HDFS); the fifth is "block" to specify block-level compression.

Once the collection has been repacked, building the inverted index follows a procedure very similar to TREC and all other collections:

etc/hadoop-cluster.sh ivory.app.PreprocessClueWebEnglish \
  -collection clueweb09catB -index index-clueweb09catB -segment 1

etc/hadoop-cluster.sh ivory.app.BuildIndex \
  -index index-clueweb09catB -indexPartitions 200 -positionalIndexIP

Info Before running the following experiments, make sure you've built the trec_eval evaluation package from NIST. For your convenience, v9.0 is included in etc/trec_eval.9.0.tar.gz. Build the package by make and place the executable at etc/trec_eval.
Info Before running the following experiments, you have to copy the indexes out of HDFS. Also make sure to change the index location in data/clue/run.web09catB.xml and other model specification files to the actual index path (under the <index> attribute).

Baseline models

To demonstrate batch retrieval, we're going to use topics from the 2009 TREC web track. The two retrieval models are bm25 and query likelihood (from language modeling) with Dirichlet scores. Here are the command-line invocations for running and evaluating the models:

# command-line 
etc/run.sh ivory.smrf.retrieval.RunQueryLocal data/clue/run.web09catB.xml data/clue/queries.web09.xml

# evaluating effectiveness
etc/trec_eval data/clue/qrels.web09catB.txt ranking.web09catB-bm25.txt
etc/trec_eval data/clue/qrels.web09catB.txt ranking.web09catB-ql.txt

# junit
etc/junit.sh ivory.regression.basic.Web09catB_Baseline
description tag MAP P10
bm25 UMHOO-BM25-catB 0.2051 0.3720
QL UMHOO-QL-catB 0.1931 0.3380

Dependence Models

These runs contrast baseline models with dependence models (Dirichlet vs. bm25 term weighting). SD is Metzler and Croft's Sequential Dependence model (SIGIR 2005), and WSD is Bendersky et al.'s Weighted Sequential Dependence model (WSDM 2010). Note that the SD model is not trained, since it has hard-coded parameters. On the other hand, the WSD model is trained with all queries from TREC 2009 (optimizing StatMAP), which makes the WSD figures unrealistically high, since we're testing on the training set.

# command-line
etc/run.sh ivory.smrf.retrieval.RunQueryLocal data/clue/run.web09catB.all.xml data/clue/queries.web09.xml

# evaluating effectiveness
etc/trec_eval data/clue/qrels.web09catB.txt ranking.web09catB.all.ql.base.txt
etc/trec_eval data/clue/qrels.web09catB.txt ranking.web09catB.all.ql.sd.txt
etc/trec_eval data/clue/qrels.web09catB.txt ranking.web09catB.all.ql.wsd.txt
etc/trec_eval data/clue/qrels.web09catB.txt ranking.web09catB.all.bm25.base.txt
etc/trec_eval data/clue/qrels.web09catB.txt ranking.web09catB.all.bm25.sd.txt
etc/trec_eval data/clue/qrels.web09catB.txt ranking.web09catB.all.bm25.wsd.txt

# junit
etc/junit.sh ivory.regression.basic.Web09catB_All
description tag MAP P10
Dirichlet ql-base 0.1931 0.3380
Dirichlet + SD ql-sd 0.2048 0.3620
Dirichlet + WSD ql-wsd 0.2120 0.3580
bm25 bm25-base 0.2051 0.3720
bm25 + SD bm25-sd 0.2188 0.3920
bm25 + WSD bm25-wsd 0.2205 0.3940

Dependence Models + Waterloo spam scores

These runs are the same as the set above, except they also take advantage of Waterloo spam scores (simple linear interpolation). The training process started with the above models, and then the spam weight was tuned. Note that these figures are all unrealistically high, since we're testing on the training set.

Running these models will require the Waterloo spam scores, packed in a manner usable by Ivory. They can be found here.

# command-line
etc/run.sh ivory.smrf.retrieval.RunQueryLocal data/clue/run.web09catB.all.spam.xml data/clue/queries.web09.xml

# evaluating effectiveness
etc/trec_eval data/clue/qrels.web09catB.txt ranking.web09catB.spam.ql.base.txt
etc/trec_eval data/clue/qrels.web09catB.txt ranking.web09catB.spam.ql.sd.txt
etc/trec_eval data/clue/qrels.web09catB.txt ranking.web09catB.spam.ql.wsd.txt
etc/trec_eval data/clue/qrels.web09catB.txt ranking.web09catB.spam.bm25.base.txt
etc/trec_eval data/clue/qrels.web09catB.txt ranking.web09catB.spam.bm25.sd.txt
etc/trec_eval data/clue/qrels.web09catB.txt ranking.web09catB.spam.bm25.wsd.txt

# junit
etc/junit.sh ivory.regression.basic.Web09catB_All_Spam
description tag MAP P10
Dirichlet ql-base 0.2134 0.4540
Dirichlet + SD ql-sd 0.2223 0.4560
Dirichlet + WSD ql-wsd 0.2283 0.4160
bm25 bm25-base 0.2167 0.4220
bm25 + SD bm25-sd 0.2280 0.4420
bm25 + WSD bm25-wsd 0.2290 0.4340