A Hadoop toolkit for web-scale information retrieval research
This tutorial provides a guide to batch retrieval with Ivory on Wt10g, a document collection which is distributed by the University of Glasgow and used in many Text Retrieval Conferences (TRECs). The collection is somewhat dated and small by modern web collection standards, but it is still useful for experimentation. For newer, larger Web data sets, please see the guides for getting started with Gov2 and ClueWeb. This guide will cover both indexing the collection and performing retrieval runs with queries from the 2000 and 2001 TREC Web tracks
Tip | The procedure for preparing and indexing the Wt10g collection is similar to those of TREC disks 4-5, which is described in a separate tutorial, so it might be a good idea to complete that first. |
The first task is to obtain the collection (from
the University
of Glasgow). There are a total of 1,692,096 documents in the
collection. The data is distributed in 104 directories
(WTX001/
to WTX104/
), with each directory,
except for the last, consisting of 50 files (B01.gz
to B50.gz
). The last directory consists of just 7 files
(B01.gz
to B07.gz
). Therefore, there are a
total of 5,157 input files. Each file consists of multiple of web
pages stored in an SGML format known as TREC web format.
The collection is distributed as a large number of relatively small
files, which is not something that Hadoop handles well. It's easiest
to work with the collection as
block-compressed SequenceFile
s, so you'll want to first
repack the original Wt10g files. There's a program
in Cloud9
for repacking the collection:
hadoop jar lib/cloud9-X.X.X.jar edu.umd.cloud9.collection.trecweb.RepackTrecWebCollection \ -libjars lib/guava-X.X.X.jar \ -collection /shared/collections/wt10g/collection.raw/ \ -output wt10g-repacked \ -compressionType block
Replace the X.X.X
with the actual latest version of the
jars. The -collection
option specifies the base path of
the raw collection (described above). The -output
specifies where the repacked output
goes. The -compressionType
option specifies compression
type, block compression in this case.
Once the collection has been repacked, building the inverted index follows a procedure very similar to TREC and all other collections:
etc/hadoop-cluster.sh ivory.app.PreprocessWt10g \ -collection wt10g-repacked -index index-wt10g etc/hadoop-cluster.sh ivory.app.BuildIndex \ -index index-wt10g -indexPartitions 10 -positionalIndexIP
To demonstrate batch retrieval, we're going to use topics from the
2000 and 2001 TREC Web tracks. In the data/wt10g/
directory, you'll find the following data:
data/wt10g/run.wt10g.basic.xml
:
retrieval models and parametersdata/wt10g/queries.wt10g.451-500.xml
:
queries (TREC 2000 Web track)data/wt10g/queries.wt10g.501-550.xml
:
queries (TREC 2001 Web track)data/wt10g/qrels.wt10g.all
:
document relevance information (qrels).The first configuration file specifies six different models:
Info | Before running
the following experiments, make sure you've built
the trec_eval evaluation package
from NIST. For your
convenience, v9.0 is included
in etc/trec_eval.9.0.tar.gz . Build the package
by make and place the executable at etc/trec_eval . |
Info | Before running
the following experiments, you have to
copy the indexes out of HDFS. Also make sure to change the index location in
data/wt10g/run.wt10g.basic.xml
and other model specification files to the actual index path (under the
<index> attribute). |
Here are the command-line invocations for running and evaluating the models:
# command-line etc/run.sh ivory.smrf.retrieval.RunQueryLocal \ data/wt10g/run.wt10g.basic.xml data/wt10g/queries.wt10g.451-500.xml data/wt10g/queries.wt10g.501-550.xml # evaluating effectiveness etc/trec_eval data/wt10g/qrels.wt10g.all ranking.wt10g-dir-base.txt etc/trec_eval data/wt10g/qrels.wt10g.all ranking.wt10g-dir-sd.txt etc/trec_eval data/wt10g/qrels.wt10g.all ranking.wt10g-dir-fd.txt etc/trec_eval data/wt10g/qrels.wt10g.all ranking.wt10g-bm25-base.txt etc/trec_eval data/wt10g/qrels.wt10g.all ranking.wt10g-bm25-sd.txt etc/trec_eval data/wt10g/qrels.wt10g.all ranking.wt10g-bm25-fd.txt # junit etc/junit.sh ivory.regression.basic.Wt10g_Basic
description | tag | MAP | P10 |
Dirichlet, full independence | wt10g-dir-base | 0.2093 | 0.3131 |
Dirichlet, sequential dependence | wt10g-dir-sd | 0.2187 | 0.3192 |
Dirichlet, full dependence | wt10g-dir-fd | 0.2205 | 0.3242 |
bm25, full independence | wt10g-bm25-base | 0.2105 | 0.3202 |
bm25, sequential dependence | wt10g-bm25-sd | 0.2248 | 0.3333 |
bm25, full dependence | wt10g-bm25-fd | 0.2226 | 0.3394 |