A Hadoop toolkit for web-scale information retrieval research
Ivory can perform cross-lingual pairwise similarity on large collections. The underlying scalable approach is implemented as a two-phase pipeline. The first phase was introduced in Ture et al.'s SIGIR'11 paper No Free Lunch: Brute Force vs Locality-Sensitive Hashing for Cross-Lingual Pairwise Similarity and the second is described in Ture & Lin's NAACL'12 paper Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences To Improve Translation Modeling.
In this two-part tutorial, we'll take you through the entire process using Wikipedia as our corpus: Starting from scratch, we first show how to find similar German-English article pairs (i.e., phase 1), and then illustrate how to extract German-English parallel text (i.e., phase 2). The output of phase 1 is used in phase 2, so you can skip to the second part of the tutorial if you have already completed the first one.
Info |
Our approach can be applied to any collection format that is supported by
Cloud9, and any
language supported by Ivory. As of 1/13/2013, Ivory supports the following languages: English, German, French, Spanish, Chinese,
Arabic, Czech, and Turkish. Everything should work out of the box for these languages. See the
tokenization guide on how to add support for other languages.
|
Wikipedia articles can be downloaded in XML format from here. Wikipedia is offered in many languages, and you can download the content in different granularities. However, we only need the main article content to run our algorithm.
This is built around
the dewiki-20121215-pages-articles.xml.bz2
dump of German
Wikipedia and the enwiki-20121201-pages-articles.xml.bz2
dump of English Wikipedia. Download an uncompresse those files. A peek
into these files should reveal Wikipedia's XML format:
$ less dewiki-20121215-pages-articles.xml <mediawiki ... xml:lang="de"> <siteinfo> ... <page> ... </page> <page> ... </page> </mediawiki>
Now, we are ready to transfer the XML-formatted collections to HDFS:
$ hdfs dfs -put *wiki*pages-articles.xml $wiki/raw/
The standard preprocessing pipeline of Ivory is shown in detail here, where the goal is to represent each document as a vector of term weights.
For stop word removal and stemming, we need to upload data files to HDFS for tokenizers to load:
$ hdfs dfs -mkdir wikidata $ hdfs dfs -mkdir wikidata/token $ hdfs dfs -put data/tokenizer/*de*stop* wikidata/token/ $ hdfs dfs -put data/tokenizer/*en*stop* wikidata/token/ $ hdfs dfs -put data/tokenizer/de-token.bin wikidata/token/ $ hdfs dfs -put data/tokenizer/en-token.bin wikidata/token/ $ hdfs dfs -put data/vocab/*de-en* wikidata/ $ hdfs dfs -put data/vocab/*en-de* wikidata/
Tip |
For languages that do not require model files, there is no need
to add the *-token.bin model file. See
tokenization guide for details.
|
Since there are documents in two different languages in our case, we need to translate the representations into the same language space for comparison. We use the cross-language IR methods introduced in Darwish and Oard's Probabilistic Structured Query Methods (2003) to translate the German representations into English. For this, we assume that vocabularies and translation tables are available on some HDFS data directory. These can be generated using the Hooka word alignment toolkit (also supports conversion from GIZA++ or BerkeleyAligner).
Now we are ready to run the preprocessing pipeline. The first run is for the English Wikipedia collection:
$ nohup etc/hadoop-cluster.sh ivory.app.PreprocessWikipedia \ -mode=crosslingE -xml=/shared/collections/wikipedia/raw/enwiki-20121201-pages-articles \ -compressed=pwsim.enwiki.compressed -index=pwsim.enwiki.index -lang=en \ -tokenizermodel=wikidata/token/en-token.bin \ -collectionvocab=wikidata/vocab.de-en.en \ -e_stopword=wikidata/token/en.stop >& pwsim.en.log &
Warning |
If the compressed Wikipedia collection already exists at pwsim.enwiki.compressed/ ,
the above program will re-use it. If you are not sure that the existing files are compatible with what you are
running, remove or change them before running the above.
|
Next, we preprocess and translate German Wikipedia articles:
$ nohup etc/hadoop-cluster.sh ivory.app.PreprocessWikipedia \ -mode=crosslingF -xml=/shared/collections/wikipedia/raw/dewiki-20121215-pages-articles.xml \ -compressed=pwsim.dewiki.compressed -index=pwsim.dewiki.index -targetindex=pwsim.enwiki.index \ -lang=de -target_lang=en \ -tokenizermodel=wikidata/token/de-token.bin \ -e_tokenizermodel=wikidata/token/en-token.bin \ -f_f2e_vocab=wikidata/vocab.de-en.de \ -e_f2e_vocab=wikidata/vocab.de-en.en \ -f2e_ttable=wikidata/ttable.de-en \ -e_e2f_vocab=wikidata/vocab.en-de.en \ -f_e2f_vocab=wikidata/vocab.en-de.de \ -e2f_ttable=wikidata/ttable.en-de \ -e_stopword=wikidata/token/en.stop \ -f_stopword=wikidata/token/de.stop >& pwsim.de.log &
The log output of this program should gives various statistics, such as the number of documents, terms, tokens in the collection. Let's do a sanity check to make sure everything ran correctly:
$ grep -P 'TOTAL=|ARTICLE=|OTHER=|SumOfDocLengths=|Terms=|SHORT=|ZERO=|Total=|DISAMB' pwsim.en.log | uniq ARTICLE=4033137 DISAMBIGUATION=123666 NON_ARTICLE=2899446 OTHER=138350 TOTAL=12961996 SumOfDocLengths=1014925528 Total=4033137 SumOfDocLengths=1014925528 Terms=56893 Total=56893 SHORT=24170 Total=4008884 ZERO=83 Total=4008884
You should see the same numbers in your log output.
$ grep -P 'TOTAL=|ARTICLE=|OTHER=|SumOfDocLengths=|Terms=|SHORT=|ZERO=|Total=|DISAMB' pwsim.de.log | uniq ARTICLE=1326111 DISAMBIGUATION=174678 NON_ARTICLE=450258 OTHER=35564 TOTAL=3001626 SumOfDocLengths=355374189 Total=1326111 SumOfDocLengths=355374189 Terms=168447 Total=168447 SHORT=486 Total=1325619 ZERO=6 Total=1325619
Since we are done with preprocessing, we can now play with the document representation. It is possible to compress the content of a document vector substantially by computing a signature, a bit/int array representation. Three different signature approaches are implemented as part of Ivory (details available in Ture et al's SIGIR'11 paper): Simhash, min-hash, and randomized projections.
Each of these methods has certain benefits and drawbacks, yet we have decided to use random projections in this application mainly due to its ability to have an arbitrary length (i.e., number of bits). The number of bits can be set empirically, by looking at the trade-off between effectiveness (i.e., how good a signature approximates the document vector) and efficiency (i.e., time to compute similarity between two signatures) as we vary the number. We'll use 1000 bits in this tutorial:
$ etc/hadoop-cluster.sh ivory.lsh.driver.RunComputeSignatures -index=pwsim.dewiki.index -num_bits=1000 -type=random
Since this process is based on 1000 randomly generated unit vectors, we need to use the same vectors when running this process for the English side. That's why we copy the random vectors over first:
$ hdfs dfs -cp pwsim.dewiki.index/randomvectors_D=1000/ pwsim.enwiki.index/
And then run signature generation:
$ etc/hadoop-cluster.sh ivory.lsh.driver.RunComputeSignatures -index=pwsim.enwiki.index -num_bits=1000 -type=randomAs a result of these jobs, you should see a
signatures-random_D=1000
folder under each index directory.
In order to find similar Wikipedia articles, we will compare signatures by calculating the Hamming distance
(i.e., number of different bits), and returning any pair with less than a specified distance T
.
Instead of comparing all signatures of German Wikipedia to all signatures of English Wikipedia, which is called the
brute-force approach, we have implemented the sliding window-based algorithm in Ravichandran et al.'s
paper, Randomized Algorithms and NLP: Using Locality Sensitive Hash Functions for High Speed Noun Clustering.
According to this algorithm, we will generate Q
permutations of each signature, resulting in Q
signature tables. Each table is sorted with respect to the bit values, and then we compare each signature of each
table to its neighbors that are at most B
positions away.
$ nohup etc/hadoop-cluster.sh ivory.lsh.pwsim.GenerateChunkedPermutedTables \ -sourceindex=pwsim.dewiki.index -index=pwsim.enwiki.index -num_bits=1000 \ -type=random -Q=300 -overlap=10000 -B=2000 >& chunking.log & $ nohup etc/hadoop-cluster.sh ivory.lsh.pwsim.cl.CLSlidingWindowPwsim \ -index=pwsim.enwiki.index -num_bits=1000 -type=random -Q=300 \ -overlap=10000 -B=2000 -T=400 >& windowing.log &
Info |
We perform a special optimization as the permuted signature tables are generated.
Each table is split into a number of consecutive chunks in order to increase
the degree of parallelism in the sliding window algorithm. The parameter
overlap defines the amount of overlap between these chunks, and needs
to be at least as large as B .
|
pwsim.enwiki.index/similardocs_random_maxdst=400_D=1000_Q=300_B=2000
.
The number of similar article pairs output by the algorithm will vary from run to run due to the randomized nature,
but you should see tens of millions of German-English article pairs output (see counter Reduce output records
)
in the log output of ivory.lsh.pwsim.cl.CLSlidingWindowPwsim
.
Ivory has many built-in functionalities for evaluation of pairwise similarity approaches. Let us describe the process of evaluating the sliding window algorithm's output by comparing it to the brute-force approach. First, we'll sample about 1000 German Wikipedia articles for this evaluation:
$ etc/hadoop-cluster.sh ivory.lsh.eval.SampleIntDocVectors \ -index=pwsim.dewiki.index -size=1000 -docnos=pwsim.dewiki.index/sample-docnos
At this point, around 1000 sampled document vectors should be written to pwsim.dewiki.index/wt-int-doc-vectors_sample=1000
,
whereas the docnos identifying this sampled subset should be written to pwsim.dewiki.index/sample-docnos
. We can do a quick
sanity check on that file, making sure that the following returns a number close to 1000 (not exactly due to random sampling):
$ hdfs dfs -cat pwsim.dewiki.index/sample-docnos | wc -l
Next, we can filter the similar article pairs that do not appear in this sample:
$ etc/hadoop-cluster.sh ivory.lsh.eval.FilterResults \ -input=pwsim.enwiki.index/similardocs_random_maxdst=400_D=1000_Q=300_B=2000 \ -output=pwsim.enwiki.index/similardocs_random_maxdst=400_D=1000_Q=300_B=2000-filtered \ -docnos=pwsim.dewiki.index/sample-docnos
Notice that the number of similar pairs was filtered down from tens of millions to tens of thousands. We can compute comparable output using a brute-force approach (i.e., compare each sampled document vector from German Wikipedia to all document vectors of English Wikipedia):
$ etc/hadoop-cluster.sh ivory.lsh.eval.BruteForcePwsim \ -input=pwsim.enwiki.index/wt-int-doc-vectors \ -sample=pwsim.dewiki.index/wt-int-doc-vectors_sample=1000/part-00000 \ -output=pwsim.enwiki.index/groundtruth_T=0.3 \ -cosineT=0.3 -type=intdocvector
The cosine similarity of two document vectors is computed from the normalized inner product. This program outputs all article pairs with the cosine similarity above 0.3, which corresponds to a Hamming distance of 400 out of 1000 bits. This relation is due to the theory behind LSH and how it is used to approximate similarity. We consider this output as ground truth, so we can assess the error due to (i) the signature generation process, and (ii) the randomized sliding window algorithm (details about this evaluation is provided in Ture et al's SIGIR'11 paper).
Using the provided script etc/eval.pl
, we can compute precision and recall of the filtered output
with respect to the ground truth:
$ hadoop dfs -get pwsim.enwiki.index/similardocs_random_maxdst=400_D=1000_Q=300_B=2000-filtered/part-00000 ../pwsim-filtered_400-1000-300-2000 $ hadoop dfs -get pwsim.enwiki.index/groundtruth_T=0.3/part-00000 ./ground_0.3 $ cd etc/ $ perl eval.pl ../ground_0.3 ../pwsim-filtered_400-1000-300-2000
Since the output of our algorithm mixes two kinds of error (mentioned above), we can also generate an upperbound output by running brute force on signatures. This allows us to measure the error due to the sliding window algorithm in isolation. For this, we first need to sample signatures (like we did for int doc vectors):
$ etc/hadoop-cluster.sh ivory.lsh.eval.SampleSignatures \ pwsim.dewiki.index/signatures-random_D=1000 \ pwsim.dewiki.index/signatures-random_D=1000_sample=1000 \ random -1 pwsim.dewiki.index/sample-docnos
Now, we can run the same brute-force program to output pairs based on Hamming distance (as opposed to true cosine similarity):
$ etc/hadoop-cluster.sh ivory.lsh.eval.BruteForcePwsim \ -input=pwsim.enwiki.index/signatures-random_D=1000 \ -sample=pwsim.dewiki.index/signatures-random_D=1000_sample=1000/part-00000 \ -output=pwsim.enwiki.index/upperbound_maxdst=400 \ -cosineT=400 -type=signature
Again, using the provided script etc/eval.pl
, we can compute precision and recall of the filtered output
with respect to this upperbound:
$ hadoop dfs -get pwsim.enwiki.index/upperbound_maxdst=400/part-00000 ./upper_400 $ cd etc/ $ perl eval.pl ../upper_400 ../pwsim-filtered_400-1000-300-2000
Another useful tool is to convert the output of the sliding window algorithm (i.e., [key: (German document id, English document id), value: distance]) to a more human-friendly format, such as [key: German document title, value: English document title].
First, let's combine all of the SequenceFile
s under pwsim.enwiki.index/similardocs_random_maxdst=400_D=1000_Q=300_B=2000
into a single file for easy loading:
$ etc/hadoop-cluster.sh edu.umd.cloud9.util.CombineSequenceFiles \ pwsim.enwiki.index/similardocs_random_maxdst=400_D=1000_Q=300_B=2000 \ pwsim.results/similardocs_random_maxdst=400_D=1000_Q=300_B=2000.single \ 100 1 edu.umd.cloud9.io.pair.PairOfInts org.apache.hadoop.io.IntWritable sequence
Now, we can convert the docno-based output format to title pairs:
$ etc/hadoop-cluster.sh ivory.lsh.eval.Docnos2Titles \ -pwsim_output=pwsim.results/similardocs_random_maxdst=400_D=1000_Q=300_B=2000.single/part-00000 \ -output=pwsim.results/similardocs_random_maxdst=400_D=1000_Q=300_B=2000-filtered.titles \ -e_collection=pwsim.enwiki.compressed -f_collection=pwsim.dewiki.compressed \ -f_lang=de -e_lang=en -docnos=pwsim.dewiki.index/sample-docnos
Once the job completes, we can check out some of the similar Wikipedia titles:
$ hdfs dfs -cat pwsim.results/similardocs_random_maxdst=400_D=1000_Q=300_B=2000-filtered.titles/part-00000 | less
Although most of the pairs should be about relevant topics, you'll probably notice that some of the pairs look wrong. This is expected, and may be due to several reasons, analyzed in more detail in Ture et al's SIGIR'11 paper.
Congratulations! You have completed the first phase of our cross-language similarity pipeline. In the next phase, we show how to find parallel sentence pairs from the output of this phase. For any further questions and comments, feel free to contact Ferhan Ture.