A Hadoop toolkit for web-scale information retrieval research
This is the second part of a two-part tutorial, so please make sure you have completed the first part before starting this one. In this part, we will show how to extract parallel sentence pairs from the similar article pairs we found previously.
In order to split documents into sentences, we need to upload sentence detection model files to HDFS:
$ hdfs dfs -mkdir wikidata/sent $ hdfs dfs -put data/vocab/de-sent.bin wikidata/sent/ $ hdfs dfs -put data/vocab/en-sent.bin wikidata/sent/
Tip |
For languages that are not currently supported (i.e., no *-sent.bin file under
data/tokenizer ), refer to
OpenNLP documentation to train your own models.
|
Let's split each document into a list of sentences:
$ nohup etc/hadoop-cluster.sh ivory.lsh.bitext.Docs2Sentences \ -data=wikidata -e_lang=en -f_lang=de \ -e_collection=pwsim.enwiki.compressed -f_collection=pwsim.dewiki.compressed \ -e_index=pwsim.enwiki.index -f_index=pwsim.dewiki.index \ -sentences=pwsim.results/sentences.de-en >& sentence-split.log &
Info |
As a heuristic, the above program will discard any sentence with less than 5 tokens or 3 unique terms,
and any sentence in which more than half of the tokens are out-of-vocabulary. The latter can be changed
by setting option -oov_rate to the maximum allowed ratio of out-of-vocabulary tokens.
|
All sentences in the two collections are now written to pwsim.results/sentences.de-en
, with identifiers
to distinguish between German and English.
data/classifier
. Let
us first upload these files to HDFS where the program can access them:
$ hdfs dfs -put data/classifier/classifier-simple.de-en wikidata/ $ hdfs dfs -put data/classifier/classifier-complex.de-en wikidata/The first classification step will generate all candidate sentence pairs, and then apply the simple classifier, emitting all pairs scored above the provided threshold (e.g.,
F1=0.9
):
$ nohup etc/hadoop-cluster.sh ivory.lsh.bitext.FindParallelSentencePairs \ -e_collection=pwsim.enwiki.compressed -f_collection=pwsim.dewiki.compressed \ -sentences=pwsim.results/sentences.de-en \ -pwsim_output=pwsim.results/similardocs_random_maxdst=400_D=1000_Q=300_B=2000.single/part-00000 \ -bitext=pwsim.results/bitext.F1=90 \ -e_index=pwsim.enwiki.index -f_index=pwsim.dewiki.index \ -data=wikidata -e_lang=en -f_lang=de -threshold=0.9 -classifier_id=1 >& classify1.log &
Tip |
Modifying this threshold will have a corresponding effect on precision/recall.
One way to set this parameter is to evaluate the classifier on held-out data, and
pick a value that yields high recall for F1 , and high precision for
F2 . You may learn more details about the classifier training process
and our implementation
from online documentation, or read
Ture and Lin's NAACL'12 paper.
|
F2=0.5
):
$ nohup etc/hadoop-cluster.sh ivory.lsh.bitext.FilterSentencePairs \ -input=pwsim.results/bitext.F1=90 -output=pwsim.results/bitext.F1=90.F2=50 \ -e_index=pwsim.enwiki.index -f_index=pwsim.dewiki.index \ -data=wikidata -e_lang=en -f_lang=de \ -threshold=0.5 -classifier_id=1 >& classify2.log &