Ivory: Tokenization

Tokenization in Ivory

As of 1/13/2013, Ivory supports tokenization in the following languages: English, German, Spanish, Chinese, French, Arabic, Czech, and Turkish. Tokenizer classes in Ivory are based on Lucene for Spanish, Arabic, Czech, and Turkish, OpenNLP for English, German, and French, and Stanford Chinese Segmenter for Chinese.

Tokenizing text in Ivory is done as follows:

$IVORYDIR/etc/run.sh ivory.core.tokenize.Tokenizer \
	-input=text.$lang \
	-output=text.$lang.tok \
	-stem=false \
	-lang=$lang \
	-stopword=$IVORYDIR/data/tokenizer/$lang.stop \
	-stemmed_stopword=$IVORYDIR/data/tokenizer/$lang.stop.stemmed \
	-model=$IVORYDIR/data/tokenizer/$lang-token.bin

Info	Stemming is on by default, so you should include `-stem=false` if it is not desired.

Info	For stopword removal, include additional options `-stopword=$IVORYDIR/data/tokenizer/$lang.stop` and `-stemmed_stopword=$IVORYDIR/data/tokenizer/$lang.stop.stemmed`. Either of these are used under different settings, so we require both files to be present.

Info	A tokenizer model path is required for certain classes, such as `OpenNLPTokenizer` or `StanfordChineseTokenizer`. The `-model` option can be omitted in other cases. If there is no model file and language is set to English, a Galago-based tokenizer is used.

Adding a new language

In order to add support for a new language in Ivory, you can extend the ivory.core.tokenize.Tokenizer class. A few other changes are required for full support:

add average #tokens/sentence in PreprocessHelper (make a guess-timate for now, can be updated from data)

add to acceptedLanguages in TokenizerFactory

update getTokenizerClass method in TokenizerFactory

The following language-specific data are assumed to be present under $IVORYDIR/data/tokenizer:

Tokenizer models for each language (dummy file even if model not required by Tokenizer class) named $lang-token.bin

Stopwords (one stopword per line in text format) under $lang.stop

Also, we require the stemmed version of the stopwords file. Once the tokenizer is set up, it can be used to generate the stemmed stopwords file as follows:

$IVORYDIR/etc/run.sh ivory.core.tokenize.Tokenizer \
	-input=$IVORYDIR/data/tokenizer/$lang.stop \
	-output=$IVORYDIR/data/tokenizer/$lang.stop.stemmed \
	-stem=true -lang=$lang \
	-model=$IVORYDIR/data/tokenizer/$lang-token.bin

Finally, if the language is going to be used in cross-lingual pairwise similarity, vocabulary and translation table files are required in Hooka format under the same folder (see Hooka tutorial on how to create these files).