A Hadoop toolkit for web-scale information retrieval research

This is the second part of a two-part tutorial, so please make sure you have completed the first part before starting this one. In this part, we will show how to extract parallel sentence pairs from the similar article pairs we found previously.

1. Splitting Documents into Sentences

In order to split documents into sentences, we need to upload sentence detection model files to HDFS:

$ hdfs dfs -mkdir wikidata/sent
$ hdfs dfs -put data/vocab/de-sent.bin wikidata/sent/
$ hdfs dfs -put data/vocab/en-sent.bin wikidata/sent/
Tip For languages that are not currently supported (i.e., no *-sent.bin file under data/tokenizer), refer to OpenNLP documentation to train your own models.

Let's split each document into a list of sentences:

$ nohup etc/hadoop-cluster.sh ivory.lsh.bitext.Docs2Sentences \
   -data=wikidata -e_lang=en -f_lang=de \
   -e_collection=pwsim.enwiki.compressed -f_collection=pwsim.dewiki.compressed \
   -e_index=pwsim.enwiki.index -f_index=pwsim.dewiki.index \
   -sentences=pwsim.results/sentences.de-en >& sentence-split.log &
Info As a heuristic, the above program will discard any sentence with less than 5 tokens or 3 unique terms, and any sentence in which more than half of the tokens are out-of-vocabulary. The latter can be changed by setting option -oov_rate to the maximum allowed ratio of out-of-vocabulary tokens.

All sentences in the two collections are now written to pwsim.results/sentences.de-en, with identifiers to distinguish between German and English.

2. Classifying Sentence Pairs

Next step is to classify each German-English sentence pair as parallel or not, using a novel two-step classification approach described in Ture and Lin's NAACL'12 paper Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences To Improve Translation Modeling. Trained classifier models are provided under data/classifier. Let us first upload these files to HDFS where the program can access them:

$ hdfs dfs -put data/classifier/classifier-simple.de-en wikidata/
$ hdfs dfs -put data/classifier/classifier-complex.de-en wikidata/
The first classification step will generate all candidate sentence pairs, and then apply the simple classifier, emitting all pairs scored above the provided threshold (e.g., F1=0.9):

$ nohup etc/hadoop-cluster.sh ivory.lsh.bitext.FindParallelSentencePairs \
   -e_collection=pwsim.enwiki.compressed -f_collection=pwsim.dewiki.compressed \
   -sentences=pwsim.results/sentences.de-en \
   -pwsim_output=pwsim.results/similardocs_random_maxdst=400_D=1000_Q=300_B=2000.single/part-00000 \
   -bitext=pwsim.results/bitext.F1=90 \
   -e_index=pwsim.enwiki.index -f_index=pwsim.dewiki.index \
   -data=wikidata -e_lang=en -f_lang=de -threshold=0.9 -classifier_id=1 >& classify1.log &
Tip Modifying this threshold will have a corresponding effect on precision/recall. One way to set this parameter is to evaluate the classifier on held-out data, and pick a value that yields high recall for F1, and high precision for F2. You may learn more details about the classifier training process and our implementation from online documentation, or read Ture and Lin's NAACL'12 paper.

The second classification step will apply the complex classifier on each pair output by the first step, and emit all pairs scored above the provided threshold (e.g., F2=0.5):

$ nohup etc/hadoop-cluster.sh ivory.lsh.bitext.FilterSentencePairs \
   -input=pwsim.results/bitext.F1=90 -output=pwsim.results/bitext.F1=90.F2=50 \
   -e_index=pwsim.enwiki.index -f_index=pwsim.dewiki.index \
   -data=wikidata -e_lang=en -f_lang=de \
   -threshold=0.5 -classifier_id=1 >& classify2.log &