A Hadoop toolkit for web-scale information retrieval research
This tutorial provides a guide to batch retrieval with Ivory on Gov2, a document collection which is distributed by the University of Glasgow and used in many Text Retrieval Conferences (TRECs). Users interested in experimenting with a newer, larger web data set should see the guide for getting started with ClueWeb09. This guide will cover both indexing the collection and performing retrieval runs with queries from the 2004, 2005 and 2006 TREC terabyte tracks.
Tip | The procedure for preparing and indexing the Gov2 collection is similar to those of TREC disks 4-5, which is described in a separate tutorial, so it might be a good idea to complete that first. |
The first task is to obtain the collection (from
the University
of Glasgow). There are a total of 25,205,179 documents in the
collection. The data is distributed in 273 directories
(GX000/
to GX272/
), containing a total of
27,204 input files. Each file consists of multiple of web
pages stored in an SGML format known as TREC web format.
The entire collection is 81 GB compressed (426 GB uncompressed).
See this page
for additional details.
The collection is distributed as a large number of relatively small
files, which is not something that Hadoop handles well. It's easiest
to work with the collection as
block-compressed SequenceFile
s, so you'll want to first
repack the original Gov2 files. There's a program
in Cloud9
for repacking the collection:
hadoop jar lib/cloud9-X.X.X.jar edu.umd.cloud9.collection.trecweb.RepackTrecWebCollection \ -libjars lib/guava-X.X.X.jar \ -collection /shared/collections/gov2/collection.raw/gov2-corpus/ \ -output gov2-repacked \ -compressionType block
Replace the X.X.X
with the actual latest version of the
jars. The -collection
option specifies the base path of
the raw collection (described above). The -output
specifies where the repacked output
goes. The -compressionType
option specifies compression
type, block compression in this case.
Once the collection has been repacked, building the inverted index follows a procedure very similar to TREC and all other collections:
etc/hadoop-cluster.sh ivory.app.PreprocessGov2 \ -collection gov2-repacked -index index-gov2 etc/hadoop-cluster.sh ivory.app.BuildIndex \ -index index-gov2 -indexPartitions 100 -positionalIndexIP
To demonstrate batch retrieval, we're going to use topics from the
2004, 2005, and 2006 TREC terabyte tracks. In
the docs/data/gov2/
directory, you'll find the following
data:
data/gov2/run.gov2.basic.xml
:
retrieval models and parametersdata/gov2/queries.gov2.title.701-775.xml
:
queries (topics 701-750 from TREC 2004 and topics 751-775 from TREC 2005)data/gov2/queries.gov2.title.776-850.xml
:
queries (topics 776-800 from TREC 2005 and topics 801-850 from TREC 2006)data/gov2/qrels.gov2.all
:
document relevance information (qrels).The first configuration file specifies six different models:
Info | Before running
the following experiments, make sure you've built
the trec_eval evaluation package
from NIST. For your
convenience, v9.0 is included
in etc/trec_eval.9.0.tar.gz . Build the package
by make and place the executable at etc/trec_eval . |
Info | Before running
the following experiments, you have to
copy the indexes out of HDFS. Also make sure to change the index location in
data/gov2/run.gov2.basic.xml
and other model specification files to the actual index path (under the
<index> attribute). |
Here are the command-line invocations for running and evaluating the models:
# command-line etc/run.sh ivory.smrf.retrieval.RunQueryLocal \ data/gov2/run.gov2.basic.xml data/gov2/queries.gov2.title.701-775.xml data/gov2/queries.gov2.title.776-850.xml # evaluating effectiveness etc/trec_eval data/gov2/qrels.gov2.all ranking.gov2-dir-base.txt etc/trec_eval data/gov2/qrels.gov2.all ranking.gov2-dir-sd.txt etc/trec_eval data/gov2/qrels.gov2.all ranking.gov2-dir-fd.txt etc/trec_eval data/gov2/qrels.gov2.all ranking.gov2-bm25-base.txt etc/trec_eval data/gov2/qrels.gov2.all ranking.gov2-bm25-sd.txt etc/trec_eval data/gov2/qrels.gov2.all ranking.gov2-bm25-fd.txt # junit etc/junit.sh ivory.regression.basic.Gov2_Basic
description | tag | MAP | P10 |
Dirichlet, full independence | gov2-dir-base | 0.3077 | 0.5631 |
Dirichlet, sequential dependence | gov2-dir-sd | 0.3239 | 0.6007 |
Dirichlet, full dependence | gov2-dir-fd | 0.3237 | 0.5933 |
bm25, full independence | gov2-bm25-base | 0.2999 | 0.5846 |
bm25, sequential dependence | gov2-bm25-sd | 0.3294 | 0.6081 |
bm25, full dependence | gov2-bm25-fd | 0.3295 | 0.6094 |