A Hadoop toolkit for web-scale information retrieval research

This tutorial provides a guide to batch retrieval with Ivory on Wt10g, a document collection which is distributed by the University of Glasgow and used in many Text Retrieval Conferences (TRECs). The collection is somewhat dated and small by modern web collection standards, but it is still useful for experimentation. For newer, larger Web data sets, please see the guides for getting started with Gov2 and ClueWeb. This guide will cover both indexing the collection and performing retrieval runs with queries from the 2000 and 2001 TREC Web tracks

Tip The procedure for preparing and indexing the Wt10g collection is similar to those of TREC disks 4-5, which is described in a separate tutorial, so it might be a good idea to complete that first.

Building the Index

The first task is to obtain the collection (from the University of Glasgow). There are a total of 1,692,096 documents in the collection. The data is distributed in 104 directories (WTX001/ to WTX104/), with each directory, except for the last, consisting of 50 files (B01.gz to B50.gz). The last directory consists of just 7 files (B01.gz to B07.gz). Therefore, there are a total of 5,157 input files. Each file consists of multiple of web pages stored in an SGML format known as TREC web format.

The collection is distributed as a large number of relatively small files, which is not something that Hadoop handles well. It's easiest to work with the collection as block-compressed SequenceFiles, so you'll want to first repack the original Wt10g files. There's a program in Cloud9 for repacking the collection:

hadoop jar lib/cloud9-X.X.X.jar edu.umd.cloud9.collection.trecweb.RepackTrecWebCollection \
  -libjars lib/guava-X.X.X.jar \
  -collection /shared/collections/wt10g/collection.raw/ \
  -output wt10g-repacked \
  -compressionType block

Replace the X.X.X with the actual latest version of the jars. The -collection option specifies the base path of the raw collection (described above). The -output specifies where the repacked output goes. The -compressionType option specifies compression type, block compression in this case.

Once the collection has been repacked, building the inverted index follows a procedure very similar to TREC and all other collections:

etc/hadoop-cluster.sh ivory.app.PreprocessWt10g \
  -collection wt10g-repacked -index index-wt10g

etc/hadoop-cluster.sh ivory.app.BuildIndex \
  -index index-wt10g -indexPartitions 10 -positionalIndexIP

Retrieval Experiments

To demonstrate batch retrieval, we're going to use topics from the 2000 and 2001 TREC Web tracks. In the data/wt10g/ directory, you'll find the following data:

The first configuration file specifies six different models:

Info Before running the following experiments, make sure you've built the trec_eval evaluation package from NIST. For your convenience, v9.0 is included in etc/trec_eval.9.0.tar.gz. Build the package by make and place the executable at etc/trec_eval.
Info Before running the following experiments, you have to copy the indexes out of HDFS. Also make sure to change the index location in data/wt10g/run.wt10g.basic.xml and other model specification files to the actual index path (under the <index> attribute).

Here are the command-line invocations for running and evaluating the models:

# command-line
etc/run.sh ivory.smrf.retrieval.RunQueryLocal \
  data/wt10g/run.wt10g.basic.xml data/wt10g/queries.wt10g.451-500.xml data/wt10g/queries.wt10g.501-550.xml

# evaluating effectiveness
etc/trec_eval data/wt10g/qrels.wt10g.all ranking.wt10g-dir-base.txt
etc/trec_eval data/wt10g/qrels.wt10g.all ranking.wt10g-dir-sd.txt
etc/trec_eval data/wt10g/qrels.wt10g.all ranking.wt10g-dir-fd.txt
etc/trec_eval data/wt10g/qrels.wt10g.all ranking.wt10g-bm25-base.txt
etc/trec_eval data/wt10g/qrels.wt10g.all ranking.wt10g-bm25-sd.txt
etc/trec_eval data/wt10g/qrels.wt10g.all ranking.wt10g-bm25-fd.txt

# junit
etc/junit.sh ivory.regression.basic.Wt10g_Basic
description tag MAP P10
Dirichlet, full independence wt10g-dir-base 0.2093 0.3131
Dirichlet, sequential dependence wt10g-dir-sd 0.2187 0.3192
Dirichlet, full dependence wt10g-dir-fd 0.2205 0.3242
bm25, full independence wt10g-bm25-base 0.2105 0.3202
bm25, sequential dependence wt10g-bm25-sd 0.2248 0.3333
bm25, full dependence wt10g-bm25-fd 0.2226 0.3394