A Hadoop toolkit for web-scale information retrieval research

The following is a step-by-step guide to replicating experiments from Asadi and Lin's paper Document Vector Representations for Feature Extraction in Multi-Stage Document Ranking in the Information Retrieval Journal (2012).

Queries, Documents, and Features

Queries must be in the following XML format:

  <query id="query_id">query_text</query>

A sample query file can be found at data/ivory/ffg/queries.xml.

Candidate documents from which features are extracted should be in a separate file in the following format:

<line> .=. <query-id> \t <document-id>
<query-id> .=. <integer>
<document-id> .=. <integer>

For a sample, see data/ivory/ffg/documents.txt.

Finally, a list of feature definitions must be provided. A sample list of features is included in data/ivory/ffg/features.xml.

Preparing ClueWeb09 Indexes

The first step is to construct a Positional Inverted Index (PII) or Document Vectors (DV). A PII can be used to perform both candidate generation and feature extraction together in a single pass. To construct a PII, the driver uses an Ivory index and spam scores.

etc/run.sh ivory.ffg.preprocessing.GenerateCompressedPositionalPostings \
  -index [Ivory-index-path] -query [queries-path] \
  -spam [spam-scores-path] -output [PII-path]

To construct document vectors, you need to pack documents into one of the various document vector representations in ivory.ffg.data. These include:

You can run the driver as follows:

etc/run.sh ivory.ffg.preprocessing.GenerateDocumentVectors \
  -index [Ivory-index-path] -dvclass [document-vector-class] \
  -candidate [documents-path] -output [document-vectors-path]

Computing Features

In order to perform candidate generation (with Small Adaptive) and feature extraction in one pass using a PII, use the following driver:

etc/run.sh ivory.ffg.driver.RankAndFeaturesSmallAdaptive \
  -index [Ivory-index-path] -posting [PII-path] \
  -query [queries-path] -candidate [documents-path] -feature [features-path] \
  (-hits [num-hits]) (-spam [spam-scores-path] -output [output-path])

The default value of hits is 10,000 documents.

To perform feature extraction using document vectors with either the sliding window technique or "on-the-fly indexing" technique, use the following drivers:

etc/run.sh ivory.ffg.driver.(DocumentVectorOnTheFlyIndexing|DocumentVectorSlidingWindow) \
  -index [Ivory-index-path] -dvclass [document-vector-class] -document [document-vectors-path] \
  -query [query-path] -candidate [documents-path] -feature [features-path] \
  (-output [output-path])

The output file will contain feature values for each document (one per line in the following format: "qid \t document number \t space-delimited-feature-values.")