A Hadoop toolkit for web-scale information retrieval research
As of 1/13/2013, Ivory supports tokenization in the following languages: English, German, Spanish, Chinese, French, Arabic, Czech, and Turkish. Tokenizer classes in Ivory are based on Lucene for Spanish, Arabic, Czech, and Turkish, OpenNLP for English, German, and French, and Stanford Chinese Segmenter for Chinese.
Tokenizing text in Ivory is done as follows:
$IVORYDIR/etc/run.sh ivory.core.tokenize.Tokenizer \ -input=text.$lang \ -output=text.$lang.tok \ -stem=false \ -lang=$lang \ -stopword=$IVORYDIR/data/tokenizer/$lang.stop \ -stemmed_stopword=$IVORYDIR/data/tokenizer/$lang.stop.stemmed \ -model=$IVORYDIR/data/tokenizer/$lang-token.bin
Info |
Stemming is on by default, so you should include -stem=false if it is not desired.
|
Info |
For stopword removal, include additional options -stopword=$IVORYDIR/data/tokenizer/$lang.stop and
-stemmed_stopword=$IVORYDIR/data/tokenizer/$lang.stop.stemmed . Either of these are used under different
settings, so we require both files to be present.
|
Info |
A tokenizer model path is required for certain classes, such as OpenNLPTokenizer or StanfordChineseTokenizer .
The -model option can be omitted in other cases. If there is no model file and language is set to English,
a Galago-based tokenizer is used.
|
In order to add support for a new language in Ivory, you can extend the ivory.core.tokenize.Tokenizer
class.
A few other changes are required for full support:
PreprocessHelper
(make a guess-timate for now, can be updated from data)
acceptedLanguages
in TokenizerFactory
getTokenizerClass
method in TokenizerFactory
The following language-specific data are assumed to be present under $IVORYDIR/data/tokenizer
:
$lang-token.bin
$lang.stop
Also, we require the stemmed version of the stopwords file. Once the tokenizer is set up, it can be used to generate the stemmed stopwords file as follows:
$IVORYDIR/etc/run.sh ivory.core.tokenize.Tokenizer \ -input=$IVORYDIR/data/tokenizer/$lang.stop \ -output=$IVORYDIR/data/tokenizer/$lang.stop.stemmed \ -stem=true -lang=$lang \ -model=$IVORYDIR/data/tokenizer/$lang-token.bin
Finally, if the language is going to be used in cross-lingual pairwise similarity, vocabulary and translation table files are required in Hooka format under the same folder (see Hooka tutorial on how to create these files).