public class OpenNLPTokenizer extends Tokenizer
| Constructor and Description |
|---|
OpenNLPTokenizer() |
| Modifier and Type | Method and Description |
|---|---|
void |
configure(Configuration conf) |
void |
configure(Configuration conf,
FileSystem fs) |
int |
getNumberTokens(String string)
Method to return number of tokens in text.
|
float |
getOOVRate(String text,
VocabularyWritable vocab) |
Map<String,String> |
getStem2NonStemMapping(String text) |
String[] |
processContent(String text) |
void |
setLanguage(String l) |
void |
setLanguageAndStemmer(String l) |
void |
setTokenizer(FileSystem fs,
Path p) |
String |
stem(String token) |
getUTF8, getVocab, isDiscard, isDiscard, isStemming, isStopWord, isStopWord, isStopwordRemoval, main, normalizeFrench, removeBorderStopWords, removeNonUnicodeChars, setVocabpublic void configure(Configuration conf)
public void configure(Configuration conf, FileSystem fs)
public int getNumberTokens(String string)
TokenizergetNumberTokens in class Tokenizerstring - text to be processed.public float getOOVRate(String text, VocabularyWritable vocab)
getOOVRate in class Tokenizerpublic Map<String,String> getStem2NonStemMapping(String text)
getStem2NonStemMapping in class Tokenizerpublic String[] processContent(String text)
processContent in class Tokenizerpublic void setLanguage(String l)
public void setLanguageAndStemmer(String l)
public void setTokenizer(FileSystem fs, Path p)