This tutorial provides a guide to batch retrieval with Ivory on Gov2, a document collection which is distributed by the University of Glasgow and used in many Text Retrieval Conferences (TRECs). Users interested in experimenting with a newer, larger web data set should see the guide for getting started with ClueWeb09. This guide will cover both indexing the collection and performing retrieval runs with queries from the 2004, 2005 and 2006 TREC terabyte tracks.
|Tip||The procedure for preparing and indexing the Gov2 collection is similar to those of TREC disks 4-5, which is described in a separate tutorial, so it might be a good idea to complete that first.|
The first task is to obtain the collection (from
of Glasgow). There are a total of 25,205,179 documents in the
collection. The data is distributed in 273 directories
GX272/), containing a total of
27,204 input files. Each file consists of multiple of web
pages stored in an SGML format known as TREC web format.
The entire collection is 81 GB compressed (426 GB uncompressed).
See this page
for additional details.
The collection is distributed as a large number of relatively small
files, which is not something that Hadoop handles well. It's easiest
to work with the collection as
SequenceFiles, so you'll want to first
repack the original Gov2 files. There's a program
for repacking the collection:
hadoop jar lib/cloud9-X.X.X.jar edu.umd.cloud9.collection.trecweb.RepackTrecWebCollection \ -libjars lib/guava-X.X.X.jar \ -collection /shared/collections/gov2/collection.raw/gov2-corpus/ \ -output gov2-repacked \ -compressionType block
X.X.X with the actual latest version of the
-collection option specifies the base path of
the raw collection (described above). The
specifies where the repacked output
-compressionType option specifies compression
type, block compression in this case.
Once the collection has been repacked, building the inverted index follows a procedure very similar to TREC and all other collections:
etc/hadoop-cluster.sh ivory.app.PreprocessGov2 \ -collection gov2-repacked -index index-gov2 etc/hadoop-cluster.sh ivory.app.BuildIndex \ -index index-gov2 -indexPartitions 100 -positionalIndexIP
To demonstrate batch retrieval, we're going to use topics from the
2004, 2005, and 2006 TREC terabyte tracks. In
docs/data/gov2/ directory, you'll find the following
data/gov2/run.gov2.basic.xml: retrieval models and parameters
data/gov2/queries.gov2.title.701-775.xml: queries (topics 701-750 from TREC 2004 and topics 751-775 from TREC 2005)
data/gov2/queries.gov2.title.776-850.xml: queries (topics 776-800 from TREC 2005 and topics 801-850 from TREC 2006)
data/gov2/qrels.gov2.all: document relevance information (qrels).
The first configuration file specifies six different models:
the following experiments, make sure you've built
the following experiments, you have to
copy the indexes out of HDFS. Also make sure to change the index location in
Here are the command-line invocations for running and evaluating the models:
# command-line etc/run.sh ivory.smrf.retrieval.RunQueryLocal \ data/gov2/run.gov2.basic.xml data/gov2/queries.gov2.title.701-775.xml data/gov2/queries.gov2.title.776-850.xml # evaluating effectiveness etc/trec_eval data/gov2/qrels.gov2.all ranking.gov2-dir-base.txt etc/trec_eval data/gov2/qrels.gov2.all ranking.gov2-dir-sd.txt etc/trec_eval data/gov2/qrels.gov2.all ranking.gov2-dir-fd.txt etc/trec_eval data/gov2/qrels.gov2.all ranking.gov2-bm25-base.txt etc/trec_eval data/gov2/qrels.gov2.all ranking.gov2-bm25-sd.txt etc/trec_eval data/gov2/qrels.gov2.all ranking.gov2-bm25-fd.txt # junit etc/junit.sh ivory.regression.basic.Gov2_Basic
|Dirichlet, full independence||gov2-dir-base||0.3077||0.5631|
|Dirichlet, sequential dependence||gov2-dir-sd||0.3239||0.6007|
|Dirichlet, full dependence||gov2-dir-fd||0.3237||0.5933|
|bm25, full independence||gov2-bm25-base||0.2999||0.5846|
|bm25, sequential dependence||gov2-bm25-sd||0.3294||0.6081|
|bm25, full dependence||gov2-bm25-fd||0.3295||0.6094|