A Hadoop toolkit for web-scale information retrieval research

Ivory can perform cross-lingual pairwise similarity on large collections. The underlying scalable approach is implemented as a two-phase pipeline. The first phase was introduced in Ture et al.'s SIGIR'11 paper No Free Lunch: Brute Force vs Locality-Sensitive Hashing for Cross-Lingual Pairwise Similarity and the second is described in Ture & Lin's NAACL'12 paper Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences To Improve Translation Modeling.

In this two-part tutorial, we'll take you through the entire process using Wikipedia as our corpus: Starting from scratch, we first show how to find similar German-English article pairs (i.e., phase 1), and then illustrate how to extract German-English parallel text (i.e., phase 2). The output of phase 1 is used in phase 2, so you can skip to the second part of the tutorial if you have already completed the first one.

Info Our approach can be applied to any collection format that is supported by Cloud9, and any language supported by Ivory. As of 1/13/2013, Ivory supports the following languages: English, German, French, Spanish, Chinese, Arabic, Czech, and Turkish. Everything should work out of the box for these languages. See the tokenization guide on how to add support for other languages.

1. Downloading Wikipedia

Wikipedia articles can be downloaded in XML format from here. Wikipedia is offered in many languages, and you can download the content in different granularities. However, we only need the main article content to run our algorithm.

This is built around the dewiki-20121215-pages-articles.xml.bz2 dump of German Wikipedia and the enwiki-20121201-pages-articles.xml.bz2 dump of English Wikipedia. Download an uncompresse those files. A peek into these files should reveal Wikipedia's XML format:

$ less dewiki-20121215-pages-articles.xml
<mediawiki ... xml:lang="de">

Now, we are ready to transfer the XML-formatted collections to HDFS:

$ hdfs dfs -put *wiki*pages-articles.xml $wiki/raw/

2. Preprocessing Wikipedia

The standard preprocessing pipeline of Ivory is shown in detail here, where the goal is to represent each document as a vector of term weights.

For stop word removal and stemming, we need to upload data files to HDFS for tokenizers to load:

$ hdfs dfs -mkdir wikidata
$ hdfs dfs -mkdir wikidata/token
$ hdfs dfs -put data/tokenizer/*de*stop* wikidata/token/
$ hdfs dfs -put data/tokenizer/*en*stop* wikidata/token/
$ hdfs dfs -put data/tokenizer/de-token.bin wikidata/token/
$ hdfs dfs -put data/tokenizer/en-token.bin wikidata/token/
$ hdfs dfs -put data/vocab/*de-en* wikidata/
$ hdfs dfs -put data/vocab/*en-de* wikidata/
Tip For languages that do not require model files, there is no need to add the *-token.bin model file. See tokenization guide for details.

Since there are documents in two different languages in our case, we need to translate the representations into the same language space for comparison. We use the cross-language IR methods introduced in Darwish and Oard's Probabilistic Structured Query Methods (2003) to translate the German representations into English. For this, we assume that vocabularies and translation tables are available on some HDFS data directory. These can be generated using the Hooka word alignment toolkit (also supports conversion from GIZA++ or BerkeleyAligner).

Now we are ready to run the preprocessing pipeline. The first run is for the English Wikipedia collection:

$ nohup etc/hadoop-cluster.sh ivory.app.PreprocessWikipedia \
   -mode=crosslingE -xml=/shared/collections/wikipedia/raw/enwiki-20121201-pages-articles \
   -compressed=pwsim.enwiki.compressed -index=pwsim.enwiki.index -lang=en \
   -tokenizermodel=wikidata/token/en-token.bin \
   -collectionvocab=wikidata/vocab.de-en.en \
   -e_stopword=wikidata/token/en.stop >& pwsim.en.log &
Warning If the compressed Wikipedia collection already exists at pwsim.enwiki.compressed/, the above program will re-use it. If you are not sure that the existing files are compatible with what you are running, remove or change them before running the above.

Next, we preprocess and translate German Wikipedia articles:

$ nohup etc/hadoop-cluster.sh ivory.app.PreprocessWikipedia \
   -mode=crosslingF -xml=/shared/collections/wikipedia/raw/dewiki-20121215-pages-articles.xml \
   -compressed=pwsim.dewiki.compressed -index=pwsim.dewiki.index -targetindex=pwsim.enwiki.index \
   -lang=de -target_lang=en \
   -tokenizermodel=wikidata/token/de-token.bin \
   -e_tokenizermodel=wikidata/token/en-token.bin \
   -f_f2e_vocab=wikidata/vocab.de-en.de \
   -e_f2e_vocab=wikidata/vocab.de-en.en \
   -f2e_ttable=wikidata/ttable.de-en \
   -e_e2f_vocab=wikidata/vocab.en-de.en \
   -f_e2f_vocab=wikidata/vocab.en-de.de \
   -e2f_ttable=wikidata/ttable.en-de \
   -e_stopword=wikidata/token/en.stop \
   -f_stopword=wikidata/token/de.stop >& pwsim.de.log &

The log output of this program should gives various statistics, such as the number of documents, terms, tokens in the collection. Let's do a sanity check to make sure everything ran correctly:

$ grep -P 'TOTAL=|ARTICLE=|OTHER=|SumOfDocLengths=|Terms=|SHORT=|ZERO=|Total=|DISAMB' pwsim.en.log | uniq

You should see the same numbers in your log output.

$ grep -P 'TOTAL=|ARTICLE=|OTHER=|SumOfDocLengths=|Terms=|SHORT=|ZERO=|Total=|DISAMB' pwsim.de.log | uniq

3. Representing Documents as Signatures

Since we are done with preprocessing, we can now play with the document representation. It is possible to compress the content of a document vector substantially by computing a signature, a bit/int array representation. Three different signature approaches are implemented as part of Ivory (details available in Ture et al's SIGIR'11 paper): Simhash, min-hash, and randomized projections.

Each of these methods has certain benefits and drawbacks, yet we have decided to use random projections in this application mainly due to its ability to have an arbitrary length (i.e., number of bits). The number of bits can be set empirically, by looking at the trade-off between effectiveness (i.e., how good a signature approximates the document vector) and efficiency (i.e., time to compute similarity between two signatures) as we vary the number. We'll use 1000 bits in this tutorial:

$ etc/hadoop-cluster.sh ivory.lsh.driver.RunComputeSignatures -index=pwsim.dewiki.index -num_bits=1000 -type=random

Since this process is based on 1000 randomly generated unit vectors, we need to use the same vectors when running this process for the English side. That's why we copy the random vectors over first:

$ hdfs dfs -cp pwsim.dewiki.index/randomvectors_D=1000/ pwsim.enwiki.index/

And then run signature generation:

$ etc/hadoop-cluster.sh ivory.lsh.driver.RunComputeSignatures -index=pwsim.enwiki.index -num_bits=1000 -type=random
As a result of these jobs, you should see a signatures-random_D=1000 folder under each index directory.

4. Finding Similar Document Pairs

In order to find similar Wikipedia articles, we will compare signatures by calculating the Hamming distance (i.e., number of different bits), and returning any pair with less than a specified distance T. Instead of comparing all signatures of German Wikipedia to all signatures of English Wikipedia, which is called the brute-force approach, we have implemented the sliding window-based algorithm in Ravichandran et al.'s paper, Randomized Algorithms and NLP: Using Locality Sensitive Hash Functions for High Speed Noun Clustering.

According to this algorithm, we will generate Q permutations of each signature, resulting in Q signature tables. Each table is sorted with respect to the bit values, and then we compare each signature of each table to its neighbors that are at most B positions away.

$ nohup etc/hadoop-cluster.sh ivory.lsh.pwsim.GenerateChunkedPermutedTables \
   -sourceindex=pwsim.dewiki.index -index=pwsim.enwiki.index -num_bits=1000 \
   -type=random -Q=300 -overlap=10000 -B=2000 >& chunking.log &

$ nohup etc/hadoop-cluster.sh ivory.lsh.pwsim.cl.CLSlidingWindowPwsim \
    -index=pwsim.enwiki.index -num_bits=1000 -type=random -Q=300 \
    -overlap=10000 -B=2000 -T=400 >& windowing.log &
Info We perform a special optimization as the permuted signature tables are generated. Each table is split into a number of consecutive chunks in order to increase the degree of parallelism in the sliding window algorithm. The parameter overlap defines the amount of overlap between these chunks, and needs to be at least as large as B.

These two jobs may take a while to finish (e.g., few hours on a 128-core cluster), but if everything runs smoothly, you should see the output written to pwsim.enwiki.index/similardocs_random_maxdst=400_D=1000_Q=300_B=2000. The number of similar article pairs output by the algorithm will vary from run to run due to the randomized nature, but you should see tens of millions of German-English article pairs output (see counter Reduce output records) in the log output of ivory.lsh.pwsim.cl.CLSlidingWindowPwsim.

5. Evaluation

Ivory has many built-in functionalities for evaluation of pairwise similarity approaches. Let us describe the process of evaluating the sliding window algorithm's output by comparing it to the brute-force approach. First, we'll sample about 1000 German Wikipedia articles for this evaluation:

$ etc/hadoop-cluster.sh ivory.lsh.eval.SampleIntDocVectors \
   -index=pwsim.dewiki.index -size=1000 -docnos=pwsim.dewiki.index/sample-docnos

At this point, around 1000 sampled document vectors should be written to pwsim.dewiki.index/wt-int-doc-vectors_sample=1000, whereas the docnos identifying this sampled subset should be written to pwsim.dewiki.index/sample-docnos. We can do a quick sanity check on that file, making sure that the following returns a number close to 1000 (not exactly due to random sampling):

$ hdfs dfs -cat pwsim.dewiki.index/sample-docnos | wc -l

Next, we can filter the similar article pairs that do not appear in this sample:

$ etc/hadoop-cluster.sh ivory.lsh.eval.FilterResults \
   -input=pwsim.enwiki.index/similardocs_random_maxdst=400_D=1000_Q=300_B=2000 \
   -output=pwsim.enwiki.index/similardocs_random_maxdst=400_D=1000_Q=300_B=2000-filtered \

Notice that the number of similar pairs was filtered down from tens of millions to tens of thousands. We can compute comparable output using a brute-force approach (i.e., compare each sampled document vector from German Wikipedia to all document vectors of English Wikipedia):

$ etc/hadoop-cluster.sh ivory.lsh.eval.BruteForcePwsim \
   -input=pwsim.enwiki.index/wt-int-doc-vectors \
   -sample=pwsim.dewiki.index/wt-int-doc-vectors_sample=1000/part-00000 \
   -output=pwsim.enwiki.index/groundtruth_T=0.3 \
   -cosineT=0.3 -type=intdocvector

The cosine similarity of two document vectors is computed from the normalized inner product. This program outputs all article pairs with the cosine similarity above 0.3, which corresponds to a Hamming distance of 400 out of 1000 bits. This relation is due to the theory behind LSH and how it is used to approximate similarity. We consider this output as ground truth, so we can assess the error due to (i) the signature generation process, and (ii) the randomized sliding window algorithm (details about this evaluation is provided in Ture et al's SIGIR'11 paper).

Using the provided script etc/eval.pl, we can compute precision and recall of the filtered output with respect to the ground truth:

$ hadoop dfs -get pwsim.enwiki.index/similardocs_random_maxdst=400_D=1000_Q=300_B=2000-filtered/part-00000 ../pwsim-filtered_400-1000-300-2000
$ hadoop dfs -get pwsim.enwiki.index/groundtruth_T=0.3/part-00000 ./ground_0.3
$ cd etc/
$ perl eval.pl ../ground_0.3 ../pwsim-filtered_400-1000-300-2000 

Since the output of our algorithm mixes two kinds of error (mentioned above), we can also generate an upperbound output by running brute force on signatures. This allows us to measure the error due to the sliding window algorithm in isolation. For this, we first need to sample signatures (like we did for int doc vectors):

$ etc/hadoop-cluster.sh ivory.lsh.eval.SampleSignatures \
    pwsim.dewiki.index/signatures-random_D=1000 \
    pwsim.dewiki.index/signatures-random_D=1000_sample=1000 \
    random -1 pwsim.dewiki.index/sample-docnos

Now, we can run the same brute-force program to output pairs based on Hamming distance (as opposed to true cosine similarity):

$ etc/hadoop-cluster.sh ivory.lsh.eval.BruteForcePwsim \
    -input=pwsim.enwiki.index/signatures-random_D=1000 \
    -sample=pwsim.dewiki.index/signatures-random_D=1000_sample=1000/part-00000 \
    -output=pwsim.enwiki.index/upperbound_maxdst=400 \
    -cosineT=400 -type=signature

Again, using the provided script etc/eval.pl, we can compute precision and recall of the filtered output with respect to this upperbound:

$ hadoop dfs -get pwsim.enwiki.index/upperbound_maxdst=400/part-00000 ./upper_400
$ cd etc/
$ perl eval.pl ../upper_400 ../pwsim-filtered_400-1000-300-2000

Another useful tool is to convert the output of the sliding window algorithm (i.e., [key: (German document id, English document id), value: distance]) to a more human-friendly format, such as [key: German document title, value: English document title].

First, let's combine all of the SequenceFiles under pwsim.enwiki.index/similardocs_random_maxdst=400_D=1000_Q=300_B=2000 into a single file for easy loading:

$ etc/hadoop-cluster.sh edu.umd.cloud9.util.CombineSequenceFiles \
    pwsim.enwiki.index/similardocs_random_maxdst=400_D=1000_Q=300_B=2000 \
    pwsim.results/similardocs_random_maxdst=400_D=1000_Q=300_B=2000.single \
    100 1 edu.umd.cloud9.io.pair.PairOfInts org.apache.hadoop.io.IntWritable sequence

Now, we can convert the docno-based output format to title pairs:

$ etc/hadoop-cluster.sh ivory.lsh.eval.Docnos2Titles \
    -pwsim_output=pwsim.results/similardocs_random_maxdst=400_D=1000_Q=300_B=2000.single/part-00000 \
    -output=pwsim.results/similardocs_random_maxdst=400_D=1000_Q=300_B=2000-filtered.titles \
    -e_collection=pwsim.enwiki.compressed -f_collection=pwsim.dewiki.compressed \
    -f_lang=de -e_lang=en -docnos=pwsim.dewiki.index/sample-docnos

Once the job completes, we can check out some of the similar Wikipedia titles:

$ hdfs dfs -cat pwsim.results/similardocs_random_maxdst=400_D=1000_Q=300_B=2000-filtered.titles/part-00000 | less

Although most of the pairs should be about relevant topics, you'll probably notice that some of the pairs look wrong. This is expected, and may be due to several reasons, analyzed in more detail in Ture et al's SIGIR'11 paper.

Congratulations! You have completed the first phase of our cross-language similarity pipeline. In the next phase, we show how to find parallel sentence pairs from the output of this phase. For any further questions and comments, feel free to contact Ferhan Ture.