public abstract class Tokenizer extends Object
Constructor and Description |
---|
Tokenizer() |
Modifier and Type | Method and Description |
---|---|
abstract void |
configure(Configuration conf) |
abstract void |
configure(Configuration conf,
FileSystem fs) |
int |
getNumberTokens(String text)
Method to return number of tokens in text.
|
float |
getOOVRate(String text,
VocabularyWritable vocab) |
Map<String,String> |
getStem2NonStemMapping(String text) |
String |
getUTF8(String token) |
VocabularyWritable |
getVocab() |
boolean |
isDiscard(boolean isStemmed,
String token) |
boolean |
isDiscard(String token) |
boolean |
isStemming() |
boolean |
isStopWord(boolean isStemmed,
String token)
Overrided by applicable implementing classes.
|
boolean |
isStopWord(String token)
Overrided by applicable implementing classes.
|
boolean |
isStopwordRemoval() |
static void |
main(String[] args) |
static String |
normalizeFrench(String text)
Check for the character (looks like reversed `) and normalize it to standard apostrophe
|
abstract String[] |
processContent(String text) |
String |
removeBorderStopWords(String tokenizedText)
Deprecated.
|
static String |
removeNonUnicodeChars(String token)
Method to remove non-unicode characters from token, to prevent errors in the preprocessing pipeline.
|
void |
setVocab(VocabularyWritable v)
Discard tokens not in the provided vocabulary.
|
String |
stem(String token) |
public abstract void configure(Configuration conf)
public abstract void configure(Configuration conf, FileSystem fs)
public int getNumberTokens(String text)
text
- text to be processed.public float getOOVRate(String text, VocabularyWritable vocab)
public VocabularyWritable getVocab()
public boolean isDiscard(boolean isStemmed, String token)
public boolean isDiscard(String token)
public boolean isStemming()
public boolean isStopWord(boolean isStemmed, String token)
isStemmed
- true if token has been stemmed, false otherwisetoken
- public boolean isStopWord(String token)
token
- public boolean isStopwordRemoval()
public static void main(String[] args)
public static String normalizeFrench(String text)
text
- French text@Deprecated public String removeBorderStopWords(String tokenizedText)
tokenizedText
- input text, assumed to be tokenized.public static String removeNonUnicodeChars(String token)
token
- token to check for non-unicode characterpublic void setVocab(VocabularyWritable v)
v
- vocabulary for tokenizer