Tokenizer

java.lang.Object
- ivory.core.tokenize.Tokenizer

Direct Known Subclasses:

BigramChineseTokenizer, GalagoTokenizer, LuceneAnalyzer, LuceneArabicAnalyzer, OpenNLPTokenizer, StanfordChineseTokenizer
```
public abstract class Tokenizer
extends Object
```

Constructor Summary

Constructors
Constructor and Description

Tokenizer()

Constructors
Constructor and Description
`Tokenizer()`

Method Summary

Methods
Modifier and Type	Method and Description
`abstract void`	`configure(Configuration conf)`
`abstract void`	`configure(Configuration conf, FileSystem fs)`
`int`	`getNumberTokens(String text)` Method to return number of tokens in text.
`float`	`getOOVRate(String text, VocabularyWritable vocab)`
`Map<String,String>`	`getStem2NonStemMapping(String text)`
`String`	`getUTF8(String token)`
`VocabularyWritable`	`getVocab()`
`boolean`	`isDiscard(boolean isStemmed, String token)`
`boolean`	`isDiscard(String token)`
`boolean`	`isStemming()`
`boolean`	`isStopWord(boolean isStemmed, String token)` Overrided by applicable implementing classes.
`boolean`	`isStopWord(String token)` Overrided by applicable implementing classes.
`boolean`	`isStopwordRemoval()`
`static void`	`main(String[] args)`
`static String`	`normalizeFrench(String text)` Check for the character (looks like reversed `) and normalize it to standard apostrophe
`abstract String[]`	`processContent(String text)`
`String`	`removeBorderStopWords(String tokenizedText)` Deprecated.
`static String`	`removeNonUnicodeChars(String token)` Method to remove non-unicode characters from token, to prevent errors in the preprocessing pipeline.
`void`	`setVocab(VocabularyWritable v)` Discard tokens not in the provided vocabulary.
`String`	`stem(String token)`

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - Tokenizer
```
public Tokenizer()
```
- Method Detail
  - configure
```
public abstract void configure(Configuration conf)
```
  - configure
```
public abstract void configure(Configuration conf,
             FileSystem fs)
```
  - getNumberTokens
```
public int getNumberTokens(String text)
```
    Method to return number of tokens in text. Subclasses may override for more efficient implementations.
    
    Parameters:
    text - text to be processed.
    
    Returns:
    number of tokens in text.
  - getOOVRate
```
public float getOOVRate(String text,
               VocabularyWritable vocab)
```
  - getStem2NonStemMapping
```
public Map<String,String> getStem2NonStemMapping(String text)
```
  - getUTF8
```
public String getUTF8(String token)
```
  - getVocab
```
public VocabularyWritable getVocab()
```
  - isDiscard
```
public boolean isDiscard(boolean isStemmed,
                String token)
```
  - isDiscard
```
public boolean isDiscard(String token)
```
  - isStemming
```
public boolean isStemming()
```
  - isStopWord
```
public boolean isStopWord(boolean isStemmed,
                 String token)
```
    Overrided by applicable implementing classes.
    
    Parameters:
    isStemmed - true if token has been stemmed, false otherwise
    token -
    
    Returns:
    true if token is a stopword, false otherwise
  - isStopWord
```
public boolean isStopWord(String token)
```
    Overrided by applicable implementing classes.
    
    Parameters:
    token -
    
    Returns:
    true if parameter is a stopword, false otherwise
  - isStopwordRemoval
```
public boolean isStopwordRemoval()
```
  - main
```
public static void main(String[] args)
```
  - normalizeFrench
```
public static String normalizeFrench(String text)
```
    Check for the character (looks like reversed `) and normalize it to standard apostrophe
    
    Parameters:
    text - French text
    
    Returns:
    fixed version of the text
  - processContent
```
public abstract String[] processContent(String text)
```
  - removeBorderStopWords
```
@Deprecated
public String removeBorderStopWords(String tokenizedText)
```
    Deprecated.
    
    Remove stop words from text that has been tokenized. Useful when postprocessing output of MT system, which is tokenized but not stopword'ed.
    
    Parameters:
    tokenizedText - input text, assumed to be tokenized.
    
    Returns:
    same text without the stop words.
  - removeNonUnicodeChars
```
public static String removeNonUnicodeChars(String token)
```
    Method to remove non-unicode characters from token, to prevent errors in the preprocessing pipeline. Such cases exist in German Wikipedia.
    
    Parameters:
    token - token to check for non-unicode character
    
    Returns:
    token without the non-unicode characters
  - setVocab
```
public void setVocab(VocabularyWritable v)
```
    Discard tokens not in the provided vocabulary.
    
    Parameters:
    v - vocabulary for tokenizer
  - stem
```
public String stem(String token)
```

Class Tokenizer

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

Tokenizer

Method Detail

configure

configure

getNumberTokens

getOOVRate

getStem2NonStemMapping

getUTF8

getVocab

isDiscard

isDiscard

isStemming

isStopWord

isStopWord

isStopwordRemoval

main

normalizeFrench

processContent

removeBorderStopWords

removeNonUnicodeChars

setVocab

stem