A Hadoop toolkit for web-scale information retrieval research
The following is a step-by-step guide to replicating experiments from Asadi and Lin's paper Document Vector Representations for Feature Extraction in Multi-Stage Document Ranking in the Information Retrieval Journal (2012).
Queries must be in the following XML format:
<parameters> <query id="query_id">query_text</query> </parameters>
A sample query file can be found at data/ivory/ffg/queries.xml
.
Candidate documents from which features are extracted should be in a separate file in the following format:
<line> .=. <query-id> \t <document-id> <query-id> .=. <integer> <document-id> .=. <integer>
For a sample, see data/ivory/ffg/documents.txt
.
Finally, a list of feature definitions must be provided.
A sample list of features is included in
data/ivory/ffg/features.xml
.
The first step is to construct a Positional Inverted Index (PII) or Document Vectors (DV). A PII can be used to perform both candidate generation and feature extraction together in a single pass. To construct a PII, the driver uses an Ivory index and spam scores.
etc/run.sh ivory.ffg.preprocessing.GenerateCompressedPositionalPostings \ -index [Ivory-index-path] -query [queries-path] \ -spam [spam-scores-path] -output [PII-path]
To construct document vectors, you need to pack documents into one of the various
document vector representations in ivory.ffg.data
. These include:
ivory.ffg.data.DocumentVectorPForDetlaArray
ivory.ffg.data.DocumentVectorVIntArray
ivory.ffg.data.DocumentVectorHashedArray
ivory.ffg.data.DocumentVectorMiniInvertedIndex
You can run the driver as follows:
etc/run.sh ivory.ffg.preprocessing.GenerateDocumentVectors \ -index [Ivory-index-path] -dvclass [document-vector-class] \ -candidate [documents-path] -output [document-vectors-path]
In order to perform candidate generation (with Small Adaptive) and feature extraction in one pass using a PII, use the following driver:
etc/run.sh ivory.ffg.driver.RankAndFeaturesSmallAdaptive \ -index [Ivory-index-path] -posting [PII-path] \ -query [queries-path] -candidate [documents-path] -feature [features-path] \ (-hits [num-hits]) (-spam [spam-scores-path] -output [output-path])
The default value of hits is 10,000 documents.
To perform feature extraction using document vectors with either the sliding window technique or "on-the-fly indexing" technique, use the following drivers:
etc/run.sh ivory.ffg.driver.(DocumentVectorOnTheFlyIndexing|DocumentVectorSlidingWindow) \ -index [Ivory-index-path] -dvclass [document-vector-class] -document [document-vectors-path] \ -query [query-path] -candidate [documents-path] -feature [features-path] \ (-output [output-path])
The output file will contain feature values for each document (one per line in the following format: "qid \t document number \t space-delimited-feature-values.")