A Hadoop toolkit for web-scale information retrieval research

This tutorial provides a guide to batch retrieval with Ivory on Gov2, a document collection which is distributed by the University of Glasgow and used in many Text Retrieval Conferences (TRECs). Users interested in experimenting with a newer, larger web data set should see the guide for getting started with ClueWeb09. This guide will cover both indexing the collection and performing retrieval runs with queries from the 2004, 2005 and 2006 TREC terabyte tracks.

Tip The procedure for preparing and indexing the Gov2 collection is similar to those of TREC disks 4-5, which is described in a separate tutorial, so it might be a good idea to complete that first.

Building the Index

The first task is to obtain the collection (from the University of Glasgow). There are a total of 25,205,179 documents in the collection. The data is distributed in 273 directories (GX000/ to GX272/), containing a total of 27,204 input files. Each file consists of multiple of web pages stored in an SGML format known as TREC web format. The entire collection is 81 GB compressed (426 GB uncompressed). See this page for additional details.

The collection is distributed as a large number of relatively small files, which is not something that Hadoop handles well. It's easiest to work with the collection as block-compressed SequenceFiles, so you'll want to first repack the original Gov2 files. There's a program in Cloud9 for repacking the collection:

hadoop jar lib/cloud9-X.X.X.jar edu.umd.cloud9.collection.trecweb.RepackTrecWebCollection \
  -libjars lib/guava-X.X.X.jar \
  -collection /shared/collections/gov2/collection.raw/gov2-corpus/ \
  -output gov2-repacked \
  -compressionType block

Replace the X.X.X with the actual latest version of the jars. The -collection option specifies the base path of the raw collection (described above). The -output specifies where the repacked output goes. The -compressionType option specifies compression type, block compression in this case.

Once the collection has been repacked, building the inverted index follows a procedure very similar to TREC and all other collections:

etc/hadoop-cluster.sh ivory.app.PreprocessGov2 \
  -collection gov2-repacked -index index-gov2

etc/hadoop-cluster.sh ivory.app.BuildIndex \
  -index index-gov2 -indexPartitions 100 -positionalIndexIP

Retrieval Experiments

To demonstrate batch retrieval, we're going to use topics from the 2004, 2005, and 2006 TREC terabyte tracks. In the docs/data/gov2/ directory, you'll find the following data:

The first configuration file specifies six different models:

Info Before running the following experiments, make sure you've built the trec_eval evaluation package from NIST. For your convenience, v9.0 is included in etc/trec_eval.9.0.tar.gz. Build the package by make and place the executable at etc/trec_eval.
Info Before running the following experiments, you have to copy the indexes out of HDFS. Also make sure to change the index location in data/gov2/run.gov2.basic.xml and other model specification files to the actual index path (under the <index> attribute).

Here are the command-line invocations for running and evaluating the models:

# command-line
etc/run.sh ivory.smrf.retrieval.RunQueryLocal \
  data/gov2/run.gov2.basic.xml data/gov2/queries.gov2.title.701-775.xml data/gov2/queries.gov2.title.776-850.xml

# evaluating effectiveness
etc/trec_eval data/gov2/qrels.gov2.all ranking.gov2-dir-base.txt
etc/trec_eval data/gov2/qrels.gov2.all ranking.gov2-dir-sd.txt
etc/trec_eval data/gov2/qrels.gov2.all ranking.gov2-dir-fd.txt
etc/trec_eval data/gov2/qrels.gov2.all ranking.gov2-bm25-base.txt
etc/trec_eval data/gov2/qrels.gov2.all ranking.gov2-bm25-sd.txt
etc/trec_eval data/gov2/qrels.gov2.all ranking.gov2-bm25-fd.txt

# junit
etc/junit.sh ivory.regression.basic.Gov2_Basic
description tag MAP P10
Dirichlet, full independence gov2-dir-base 0.3077 0.5631
Dirichlet, sequential dependence gov2-dir-sd 0.3239 0.6007
Dirichlet, full dependence gov2-dir-fd 0.3237 0.5933
bm25, full independence gov2-bm25-base 0.2999 0.5846
bm25, sequential dependence gov2-bm25-sd 0.3294 0.6081
bm25, full dependence gov2-bm25-fd 0.3295 0.6094