Ivory: Experiments

This tutorial provides a guide to batch retrieval with Ivory on Wt10g, a document collection which is distributed by the University of Glasgow and used in many Text Retrieval Conferences (TRECs). The collection is somewhat dated and small by modern web collection standards, but it is still useful for experimentation. For newer, larger Web data sets, please see the guides for getting started with Gov2 and ClueWeb. This guide will cover both indexing the collection and performing retrieval runs with queries from the 2000 and 2001 TREC Web tracks

Tip	The procedure for preparing and indexing the Wt10g collection is similar to those of TREC disks 4-5, which is described in a separate tutorial, so it might be a good idea to complete that first.

Building the Index

The first task is to obtain the collection (from the University of Glasgow). There are a total of 1,692,096 documents in the collection. The data is distributed in 104 directories (WTX001/ to WTX104/), with each directory, except for the last, consisting of 50 files (B01.gz to B50.gz). The last directory consists of just 7 files (B01.gz to B07.gz). Therefore, there are a total of 5,157 input files. Each file consists of multiple of web pages stored in an SGML format known as TREC web format.

The collection is distributed as a large number of relatively small files, which is not something that Hadoop handles well. It's easiest to work with the collection as block-compressed SequenceFiles, so you'll want to first repack the original Wt10g files. There's a program in Cloud⁹ for repacking the collection:

hadoop jar lib/cloud9-X.X.X.jar edu.umd.cloud9.collection.trecweb.RepackTrecWebCollection \
  -libjars lib/guava-X.X.X.jar \
  -collection /shared/collections/wt10g/collection.raw/ \
  -output wt10g-repacked \
  -compressionType block

Replace the X.X.X with the actual latest version of the jars. The -collection option specifies the base path of the raw collection (described above). The -output specifies where the repacked output goes. The -compressionType option specifies compression type, block compression in this case.

Once the collection has been repacked, building the inverted index follows a procedure very similar to TREC and all other collections:

etc/hadoop-cluster.sh ivory.app.PreprocessWt10g \
  -collection wt10g-repacked -index index-wt10g

etc/hadoop-cluster.sh ivory.app.BuildIndex \
  -index index-wt10g -indexPartitions 10 -positionalIndexIP

Retrieval Experiments

To demonstrate batch retrieval, we're going to use topics from the 2000 and 2001 TREC Web tracks. In the data/wt10g/ directory, you'll find the following data:

data/wt10g/run.wt10g.basic.xml: retrieval models and parameters
data/wt10g/queries.wt10g.451-500.xml: queries (TREC 2000 Web track)
data/wt10g/queries.wt10g.501-550.xml: queries (TREC 2001 Web track)
data/wt10g/qrels.wt10g.all: document relevance information (qrels).

The first configuration file specifies six different models:

wt10g-dir-base: language modeling framework, Dirichlet prior, simple query likelihood.
wt10g-dir-sd: language modeling framework, Dirichlet prior, sequential dependence model using MRFs.
wt10g-dir-fd: language modeling framework, Dirichlet prior, full dependence model using MRFs.
wt10g-bm25-base: bm25 term weighting, simple bag-of-words queries.
wt10g-bm25-sd: sequential dependence model using MRFs, with bm25 term weighting.
wt10g-bm25-fd: full dependence models using MRFs, with bm25 term weighting.

Info	Before running the following experiments, make sure you've built the `trec_eval` evaluation package from NIST. For your convenience, v9.0 is included in `etc/trec_eval.9.0.tar.gz`. Build the package by `make` and place the executable at `etc/trec_eval`.

Info	Before running the following experiments, you have to copy the indexes out of HDFS. Also make sure to change the index location in `data/wt10g/run.wt10g.basic.xml` and other model specification files to the actual index path (under the `<index>` attribute).

Here are the command-line invocations for running and evaluating the models:

# command-line
etc/run.sh ivory.smrf.retrieval.RunQueryLocal \
  data/wt10g/run.wt10g.basic.xml data/wt10g/queries.wt10g.451-500.xml data/wt10g/queries.wt10g.501-550.xml

# evaluating effectiveness
etc/trec_eval data/wt10g/qrels.wt10g.all ranking.wt10g-dir-base.txt
etc/trec_eval data/wt10g/qrels.wt10g.all ranking.wt10g-dir-sd.txt
etc/trec_eval data/wt10g/qrels.wt10g.all ranking.wt10g-dir-fd.txt
etc/trec_eval data/wt10g/qrels.wt10g.all ranking.wt10g-bm25-base.txt
etc/trec_eval data/wt10g/qrels.wt10g.all ranking.wt10g-bm25-sd.txt
etc/trec_eval data/wt10g/qrels.wt10g.all ranking.wt10g-bm25-fd.txt

# junit
etc/junit.sh ivory.regression.basic.Wt10g_Basic

description	tag	MAP	P10
Dirichlet, full independence	wt10g-dir-base	0.2093	0.3131
Dirichlet, sequential dependence	wt10g-dir-sd	0.2187	0.3192
Dirichlet, full dependence	wt10g-dir-fd	0.2205	0.3242
bm25, full independence	wt10g-bm25-base	0.2105	0.3202
bm25, sequential dependence	wt10g-bm25-sd	0.2248	0.3333
bm25, full dependence	wt10g-bm25-fd	0.2226	0.3394

Experiments: Wt10g

Building the Index

Retrieval Experiments