A Hadoop toolkit for web-scale information retrieval research
Hooka is a MapReduce-based Expectation Maximization (EM) training framework for word alignment. It accepts a sentence-aligned parallel corpus (each sentence in the corpus is written in two different languages) as input and aligns each word in the source-language sentence to a word in the target-language sentence. Parameters of the statistical models (e.g., IBM Model 1, HMM model) are learned using a parallelized implementation of the Expectation Maximization (EM) process. Hooka can also be used for managing vocabularies and word translation tables. Specialized data structures allow efficient storage and conversion from popular word alignment tools, such as GIZA++ and BerkeleyAligner.
Hooka assumes an XML format for input, as shown below:
<?xml version="1.0" encoding="UTF8"?> <pdoc name="corpus.en"> <pchunk name="europarl-v6.de-en_1"> <s lang="en">this person refer to histori clear prove the good that the european union repres for all european .</s> <s lang="de">dies person hinweis auf die geschicht belegt ganz klar das gut , das die europa union fur all europa darstellt .</s> </pchunk> ... </pdoc>
where a single pchunk block contains the same sentence written in both languages, English and German in this case. Typically parallel corpora may not be already in this format, so we also provide a simple script that converts two text files (each containing one sentence per line in their respective languages) into the above format.
perl $IVORYDIR/docs/content/plain2chunk.pl europarl-v6.de-en.de europarl-v6.de-en.en europarl-v6.de-en de en > europarl-v6.de-en.xml
Once the XML-formatted parallel corpus is on HDFS, we can run word alignment as follows:
$IVORYDIR/etc/hadoop-cluster.sh edu.umd.hooka.alignment.HadoopAlign \ -input=$datadir/europarl-v6.de-en.xml -workdir=hooka.de-en -src_lang=de -trg_lang=en -model1=5 -hmm=5 -use_truncate
The first argument is the HDFS path to the input data in XML format, as described above. The second argument is the HDFS path to the working directory, to which data is written. The next two arguments indicate the language code of source and target language. Fifth and sixth arguments show the number of EM iterations using IBM Model 1 and an HMM model, respectively. The last argument indicates whether truncation/stemming should be done.
Info |
If the input text is already tokenized and stemmed, then you can opt out of doing it here by
omitting option -use_truncate .
|
The program starts by running a Hadoop job that preprocesses the dataset and performs tokenization/truncation. After preprocessing, the program runs the EM iterations, each consisting of an expected value computation step (E-step) and an aggregation and parameter re-computation step (M-step), in two separate Hadoop jobs.
The output of the program consists of two vocabularies (source-side vocabulary vocab.E
, target-side vocabulary vocab.F
) and a lexical conditional probability table. Each vocabulary is represented by an instance of the Vocab
class, as a mapping from terms in the language to a unique integer identifier. The translation table is represented by an instance of the TTable_monolithic_IFAs
class (implements TTable
), which contains all possible translations (with respective conditional probabilities) of each word in the target language (i.e., P(f|e) for all e in target vocabulary). In order to generate conditional probabilities in the other direction (i.e., P(e|f)) you should run Hooka with the language arguments swapped:
$IVORYDIR/etc/hadoop-cluster.sh edu.umd.hooka.alignment.HadoopAlign \ -input=$datadir/europarl-v6.de-en.xml -workdir=hooka.en-de -src_lang=en -trg_lang=de -model1=5 -hmm=5 -use_truncate
Once the vocabulary and translation tables are written to disk, they can be loaded into memory and used for certain operations.
For instance, a Vocab
object can be used to retrieve words as follows:
Vocab engVocab = HadoopAlign.loadVocab(new Path(vocabHDFSPath), hdfsConf); int eId = engVocab.get("book"); // integer id of book String eString = engVocab.get(eId); // "book"
A TTable
object can then be used to find conditional probabilities as follows:
TTable_monolithic_IFAs ttable_en2de = new TTable_monolithic_IFAs(FileSystem.get(hdfsConf), new Path(ttableHDFSPath), true); float prob = ttable_en2de.get(eId,fId); // find all German translations of "book" above probability 0.1 int[] fIdArray = ttable_en2de.get(eId).getTranslations(0.1f);
We provide a convenient command-line option for querying the translation table:
$IVORYDIR/etc/hadoop-cluster.sh ivory.core.util.CLIRUtils \ -f=buch -e=book -src_vocab=hooka.de-en/vocab.F -trg_vocab=hooka.de-en/vocab.E -ttable=hooka.de-en/tmp.ttable $IVORYDIR/etc/hadoop-cluster.sh ivory.core.util.CLIRUtils \ -f=buch -e=ALL -src_vocab=hooka.de-en/vocab.F -trg_vocab=hooka.de-en/vocab.E -ttable=hooka.de-en/tmp.ttable
The first command will compute P(book|buch)
, whereas the second one will output {(e,P(e|buch)) | P(e|f) > 0}
Alignment tools use statistical smoothing techniques to distribute the probability mass more conservatively.
This results in hundreds or even thousands of translations per word in the vocabulary. However, for many applications,
one may only need the most probable few translations. This may reduce redundancy in the TTable
object,
and also decrease noise in the word translation distributions. Our implementation allows two heuristics to address this issue:
Keep the most probable k
translations of each source term, unless the total sum of probabilities exceed
C
(as we accumulate starting from most probable). Various values for k
and C
can be
tested by passing it as an argument:
$IVORYDIR/etc/hadoop-cluster.sh ivory.core.util.CLIRUtils \ -hooka_src_vocab=hooka.de-en/vocab.F -hooka_trg_vocab=hooka.de-en/vocab.E -hooka_ttable=hooka.de-en/tmp.ttable \ -src_vocab=hooka.de-en/vocab.de-en.de -trg_vocab=hooka.de-en/vocab.de-en.en -ttable=hooka.de-en/ttable.de-en -k=15 -C=0.95 -hdfs
Info |
If k and C are omitted, all translations will be kept.
|
The output of the GIZA++ word alignment tool consists of two files: lex.f2e
contains probabilities from source language to target language,
and lex.e2f
contains the opposite. Hooka requires that the GIZA++ output files are sorted by the second column, which is not the case
for lex.e2f
:
sort -k2 giza.de-en/lex.e2f -o giza.de-en/lex.e2f.sorted
Both files should now have three items on each line: [trg-word] [src-word] [Pr(trg-word|src-word)]
, and lines should be sorted by the source word.
Now, Hooka can convert each of these two files into a TTable
object and a pair of Vocab
objects:
$IVORYDIR/etc/hadoop-cluster.sh ivory.core.util.CLIRUtils \ -aligner_out=giza.de-en/lex.f2e -src_vocab=vocab.de -trg_vocab=vocab.en -ttable=ttable.de-en -type=giza -C=0.95 -k=15 $IVORYDIR/etc/hadoop-cluster.sh ivory.core.util.CLIRUtils \ -aligner_out=giza.de-en/lex.e2f.sorted -src_vocab=vocab.en -trg_vocab=vocab.de -ttable=ttable.en-de -type=giza -C=0.95 -k=15
The output of berkeleyAligner also consists of two output files, named stage2.1.params.txt
and stage2.2.params.txt
,
in the following format:
book entropy 0.314 nTrans 20 sum 1.000000 buch: 0.907941 heft: 0.090876 textbuch: 0.001183 ...
All possible target-language translations follow each source word. No preprocessing is required for Hooka conversion:
etc/hadoop-cluster.sh ivory.core.util.CLIRUtils \ -aligner_out=berkeley.de-en/stage2.2.params.txt -src_vocab=vocab.de -trg_vocab=vocab.en -ttable=ttable.en-de -type=berkeley -C=0.95 -k=15 etc/hadoop-cluster.sh ivory.core.util.CLIRUtils \ -aligner_out=berkeley.de-en/stage2.1.params.txt -src_vocab=vocab.de -trg_vocab=vocab.en -ttable=ttable.en-de -type=berkeley -C=0.95 -k=15
Info |
If the GIZA++ or BerkeleyAligner files are on HDFS, include option -hdfs when running above commands.
|
We performed a preliminary evaluation of Hooka by comparing it to GIZA++ and berkeleyAligner. We designed an intrinsic evaluation to test the quality of the translation probability values output by each system. We experimented with the German and English portions of the Europarl corpus, which contains proceedings from the European Parliament. Documents were constructed artificially by concatenating every 10 consecutive sentences into a single document. In this manner, we sampled 505 document pairs that are mutual translations of each other (and therefore semantically similar by construction). This provides ground truth to evaluate the effectiveness of the three systems on the task of pairwise similarity.
We applied the standard CLIR approach to project document vectors from one language into the other, using the translation probabilities created by either Hooka, GIZA++ or berkeleyAligner. For each tool, we aligned the corpus in both directions: German to English, English to German. Cosine similarities computed with this approach are compared to a translation approach: An MT system (in this case, we used Google Translate) is used to translate German documents into English, which are then processed into vector form:
Aligner | Average cosine | Std. Dev. |
GIZA++ | 0.351 | 0.053 |
BerkeleyAligner | 0.397 | 0.072 |
Hooka | 0.302 | 0.055 |
Google Translate | 0.449 | 0.114 |
Using probability values from BerkeleyAligner yielded the highest similarity values for parallel documents, whereas Hooka produced the lowest scores. However, we should point out that this evaluation is biased, in the sense that we should also look at the similarity scores for non-parallel documents. The best approach should be able to distinguish parallel and non-parallel documents effectively.
The greatest advantage of Hooka is its ability to arbitrarily parallelize computation; here are total running times for aligning a dataset containing 1,079,017 sentence pairs. Hooka runs a 50-way parallelization, GIZA runs the two directions synchronously, and BerkeleyAligner is entirely sequential.
GIZA++ | 13:04:57 |
BerkeleyAligner | 44:41:03 |
Hooka | 0:58:00 |