Ivory: Experiments

This tutorial provides a guide to batch retrieval with Ivory on Gov2, a document collection which is distributed by the University of Glasgow and used in many Text Retrieval Conferences (TRECs). Users interested in experimenting with a newer, larger web data set should see the guide for getting started with ClueWeb09. This guide will cover both indexing the collection and performing retrieval runs with queries from the 2004, 2005 and 2006 TREC terabyte tracks.

Tip	The procedure for preparing and indexing the Gov2 collection is similar to those of TREC disks 4-5, which is described in a separate tutorial, so it might be a good idea to complete that first.

Building the Index

The first task is to obtain the collection (from the University of Glasgow). There are a total of 25,205,179 documents in the collection. The data is distributed in 273 directories (GX000/ to GX272/), containing a total of 27,204 input files. Each file consists of multiple of web pages stored in an SGML format known as TREC web format. The entire collection is 81 GB compressed (426 GB uncompressed). See this page for additional details.

The collection is distributed as a large number of relatively small files, which is not something that Hadoop handles well. It's easiest to work with the collection as block-compressed SequenceFiles, so you'll want to first repack the original Gov2 files. There's a program in Cloud⁹ for repacking the collection:

hadoop jar lib/cloud9-X.X.X.jar edu.umd.cloud9.collection.trecweb.RepackTrecWebCollection \
  -libjars lib/guava-X.X.X.jar \
  -collection /shared/collections/gov2/collection.raw/gov2-corpus/ \
  -output gov2-repacked \
  -compressionType block

Replace the X.X.X with the actual latest version of the jars. The -collection option specifies the base path of the raw collection (described above). The -output specifies where the repacked output goes. The -compressionType option specifies compression type, block compression in this case.

Once the collection has been repacked, building the inverted index follows a procedure very similar to TREC and all other collections:

etc/hadoop-cluster.sh ivory.app.PreprocessGov2 \
  -collection gov2-repacked -index index-gov2

etc/hadoop-cluster.sh ivory.app.BuildIndex \
  -index index-gov2 -indexPartitions 100 -positionalIndexIP

Retrieval Experiments

To demonstrate batch retrieval, we're going to use topics from the 2004, 2005, and 2006 TREC terabyte tracks. In the docs/data/gov2/ directory, you'll find the following data:

data/gov2/run.gov2.basic.xml: retrieval models and parameters
data/gov2/queries.gov2.title.701-775.xml: queries (topics 701-750 from TREC 2004 and topics 751-775 from TREC 2005)
data/gov2/queries.gov2.title.776-850.xml: queries (topics 776-800 from TREC 2005 and topics 801-850 from TREC 2006)
data/gov2/qrels.gov2.all: document relevance information (qrels).

The first configuration file specifies six different models:

gov2-dir-base: language modeling framework, simple query likelihood.
gov2-dir-sd: language modeling framework, sequential dependencies using MRFs.
gov2-dir-fd: language modeling framework, full dependencies using MRFs.
gov2-bm25-base: bm25 term weighting, simple bag-of-words queries.
gov2-bm25-base: bm25 term weighting, sequential dependencies using MRFs.
gov2-bm25-base: bm25 term weighting, full depencencies using MRFs.

Info	Before running the following experiments, make sure you've built the `trec_eval` evaluation package from NIST. For your convenience, v9.0 is included in `etc/trec_eval.9.0.tar.gz`. Build the package by `make` and place the executable at `etc/trec_eval`.

Info	Before running the following experiments, you have to copy the indexes out of HDFS. Also make sure to change the index location in `data/gov2/run.gov2.basic.xml` and other model specification files to the actual index path (under the `<index>` attribute).

Here are the command-line invocations for running and evaluating the models:

# command-line
etc/run.sh ivory.smrf.retrieval.RunQueryLocal \
  data/gov2/run.gov2.basic.xml data/gov2/queries.gov2.title.701-775.xml data/gov2/queries.gov2.title.776-850.xml

# evaluating effectiveness
etc/trec_eval data/gov2/qrels.gov2.all ranking.gov2-dir-base.txt
etc/trec_eval data/gov2/qrels.gov2.all ranking.gov2-dir-sd.txt
etc/trec_eval data/gov2/qrels.gov2.all ranking.gov2-dir-fd.txt
etc/trec_eval data/gov2/qrels.gov2.all ranking.gov2-bm25-base.txt
etc/trec_eval data/gov2/qrels.gov2.all ranking.gov2-bm25-sd.txt
etc/trec_eval data/gov2/qrels.gov2.all ranking.gov2-bm25-fd.txt

# junit
etc/junit.sh ivory.regression.basic.Gov2_Basic

description	tag	MAP	P10
Dirichlet, full independence	gov2-dir-base	0.3077	0.5631
Dirichlet, sequential dependence	gov2-dir-sd	0.3239	0.6007
Dirichlet, full dependence	gov2-dir-fd	0.3237	0.5933
bm25, full independence	gov2-bm25-base	0.2999	0.5846
bm25, sequential dependence	gov2-bm25-sd	0.3294	0.6081
bm25, full dependence	gov2-bm25-fd	0.3295	0.6094

Experiments: Gov2

Building the Index

Retrieval Experiments