As of 1/13/2013, Ivory supports tokenization in the following languages: English, German, Spanish, Chinese, French, Arabic, Czech, and Turkish. Tokenizer classes in Ivory are based on Lucene for Spanish, Arabic, Czech, and Turkish, OpenNLP for English, German, and French, and Stanford Chinese Segmenter for Chinese.
Tokenizing text in Ivory is done as follows:
$IVORYDIR/etc/run.sh ivory.core.tokenize.Tokenizer \ -input=text.$lang \ -output=text.$lang.tok \ -stem=false \ -lang=$lang \ -stopword=$IVORYDIR/data/tokenizer/$lang.stop \ -stemmed_stopword=$IVORYDIR/data/tokenizer/$lang.stop.stemmed \ -model=$IVORYDIR/data/tokenizer/$lang-token.bin
Stemming is on by default, so you should include
For stopword removal, include additional options
A tokenizer model path is required for certain classes, such as
In order to add support for a new language in Ivory, you can extend the
A few other changes are required for full support:
PreprocessHelper(make a guess-timate for now, can be updated from data)
The following language-specific data are assumed to be present under
Also, we require the stemmed version of the stopwords file. Once the tokenizer is set up, it can be used to generate the stemmed stopwords file as follows:
$IVORYDIR/etc/run.sh ivory.core.tokenize.Tokenizer \ -input=$IVORYDIR/data/tokenizer/$lang.stop \ -output=$IVORYDIR/data/tokenizer/$lang.stop.stemmed \ -stem=true -lang=$lang \ -model=$IVORYDIR/data/tokenizer/$lang-token.bin
Finally, if the language is going to be used in cross-lingual pairwise similarity, vocabulary and translation table files are required in Hooka format under the same folder (see Hooka tutorial on how to create these files).