public class CLIRUtils extends Configured
F is the "foreign" language, the language in which non-translated documents are written.
E is the "non-foreign" language, the language into which documents are translated.
Required files:
ttable E-->F (i.e., Pr(f|e))
ttable F-->E (i.e., Pr(e|f))
Pair of vocabulary files for each ttable
V_E & V_F for E-->F
V_E & V_F for F-->E
Modifier and Type | Field and Description |
---|---|
static String |
BitextSeparator |
static int |
E |
static int |
F |
static Pattern |
isNumber |
static int |
MinSentenceLength |
static int |
MinVectorTerms |
Constructor and Description |
---|
CLIRUtils() |
Modifier and Type | Method and Description |
---|---|
static void |
addToTable(int curIndex,
TreeSet<PairOfFloatString> topTrans,
float cumProb,
TTable_monolithic_IFAs table,
Vocab trgVocab,
float cumProbThreshold,
HookaStats stats) |
static String[] |
computeFeatures(int featSet,
String fSentence,
String eSentence,
Tokenizer fTokenizer,
Tokenizer eTokenizer,
HMapSIW eSrcTfs,
HMapSFW eVector,
HMapSIW fSrcTfs,
HMapSFW translatedFVector,
float eSentLength,
float fSentLength,
Vocab eVocabSrc,
Vocab eVocabTrg,
Vocab fVocabSrc,
Vocab fVocabTrg,
TTable_monolithic_IFAs e2f_Probs,
TTable_monolithic_IFAs f2e_Probs,
float prob) |
static String[] |
computeFeatures(int featSet,
String fSentence,
String eSentence,
Tokenizer fTokenizer,
Tokenizer eTokenizer,
HMapSIW eSrcTfs,
HMapSFW eVector,
HMapSIW fSrcTfs,
HMapSFW translatedFVector,
float eSentLength,
float fSentLength,
Vocab eVocabSrc,
Vocab eVocabTrg,
Vocab fVocabSrc,
Vocab fVocabTrg,
TTable_monolithic_IFAs e2f_Probs,
TTable_monolithic_IFAs f2e_Probs,
float prob,
org.apache.log4j.Logger sLogger) |
static String[] |
computeFeaturesF1(HMapSFW eVector,
HMapSFW translatedFVector,
float eSentLength,
float fSentLength)
Bitext extraction helper functions
|
static String[] |
computeFeaturesF2(HMapSIW eSrcTfs,
HMapSFW eVector,
HMapSIW fSrcTfs,
HMapSFW translatedFVector,
float eSentLength,
float fSentLength,
Vocab eVocabSrc,
Vocab eVocabTrg,
Vocab fVocabSrc,
Vocab fVocabTrg,
TTable_monolithic_IFAs e2f_Probs,
TTable_monolithic_IFAs f2e_Probs,
float prob) |
static String[] |
computeFeaturesF3(String fSentence,
String eSentence,
Tokenizer fTokenizer,
Tokenizer eTokenizer,
HMapSIW eSrcTfs,
HMapSFW eVector,
HMapSIW fSrcTfs,
HMapSFW translatedFVector,
float eSentLength,
float fSentLength,
Vocab eVocabSrc,
Vocab eVocabTrg,
Vocab fVocabSrc,
Vocab fVocabTrg,
TTable_monolithic_IFAs e2f_Probs,
TTable_monolithic_IFAs f2e_Probs,
float prob) |
static float |
cosine(HMapIFW vectorA,
HMapIFW vectorB) |
static float |
cosine(HMapSFW vectorA,
HMapSFW vectorB) |
static float |
cosineNormalized(HMapSFW vectorA,
HMapSFW vectorB) |
static HMapSFW |
createTermDocVector(int docLen,
HMapIFW tfTable,
Vocab eVocab,
ScoringModel scoringModel,
FrequencySortedDictionary dict,
DfTableArray dfTable,
boolean isNormalize,
org.apache.log4j.Logger sLogger)
Given the TF, DF values, doc length, scoring model, this method creates the term doc vector for a document.
|
static HMapSFW |
createTermDocVector(int docLen,
HMapIFW tfTable,
Vocab eVocab,
ScoringModel scoringModel,
HMapIFW dfTable,
boolean isNormalize,
org.apache.log4j.Logger sLogger)
Given the TF, DF values, doc length, scoring model, this method creates the term doc vector for a document.
|
static HMapSFW |
createTermDocVector(int docLen,
HMapIFW tfTable,
Vocab eVocab,
ScoringModel scoringModel,
HMapSIW dfTable,
boolean isNormalize,
org.apache.log4j.Logger sLogger)
called by BitextClassifierUtils
|
static HMapSFW |
createTermDocVector(int docLen,
HMapSIW tfTable,
ScoringModel scoringModel,
FrequencySortedDictionary dict,
DfTableArray dfTable,
boolean isNormalize,
org.apache.log4j.Logger sLogger)
Given the TF, DF values, doc length, scoring model, this method creates the term doc vector for a document.
|
static void |
createTTableFromBerkeleyAligner(String inputFile,
String srcVocabFile,
String trgVocabFile,
String probsFile,
float probThreshold,
int numTrans,
FileSystem fs)
This method converts the output of BerkeleyAligner into a TTable_monolithic_IFAs object.
|
static void |
createTTableFromGIZA(String inputFile,
String srcVocabFile,
String trgVocabFile,
String probsFile,
float probThreshold,
int numTrans,
FileSystem fs)
This method converts the output of GIZA into a TTable_monolithic_IFAs object.
|
static void |
createTTableFromHooka(String srcVocabFile,
String trgVocabFile,
String tableFile,
String finalSrcVocabFile,
String finalTrgVocabFile,
String finalTableFile,
float probThreshold,
int numTrans,
FileSystem fs)
This method modifies the TTable_monolithic_IFAs object output by Hooka, to meet following criteria:
For each source language term, top numTrans entries (with highest translation probability) are kept, unless the top K < numTrans entries have a cumulatite probability above probThreshold.
|
static void |
main(String[] args) |
static HMapIFW |
readTransDfTable(Path path,
FileSystem fs)
Read df mapping from file.
|
static HMapIFW |
translateDFTable(Vocab eVocabSrc,
Vocab fVocabTrg,
TTable_monolithic_IFAs e2f_probs,
FrequencySortedDictionary dict,
DfTableArray dfTable)
Given a mapping from F-terms to their df values, compute a df value for each E-term using the CLIR algorithm: df(e) = sum_f{df(f)*prob(f|e)}
|
static HMapIFW |
translateDFTable(Vocab eVocabSrc,
Vocab fVocabTrg,
TTable_monolithic_IFAs e2f_probs,
HMapSIW dfs)
Given a mapping from F-terms to their df values, compute a df value for each E-term using the CLIR algorithm: df(e) = sum_f{df(f)*prob(f|e)}
|
static int |
translateTFs(HMapSIW doc,
HMapIFW tfTable,
Vocab eVocabSrc,
Vocab eVocabTrg,
Vocab fVocabSrc,
Vocab fVocabTrg,
TTable_monolithic_IFAs e2fProbs,
TTable_monolithic_IFAs f2eProbs,
Tokenizer tokenizer,
org.apache.log4j.Logger sLogger)
Given a document in F, and its tf mapping, compute a tf value for each term in E using the CLIR algorithm: tf(e) = sum_f{tf(f)*prob(e|f)}
|
static int |
translateTFs(TermDocVector doc,
HMapIFW tfTable,
Vocab eVocabSrc,
Vocab eVocabTrg,
Vocab fVocabSrc,
Vocab fVocabTrg,
TTable_monolithic_IFAs e2fProbs,
TTable_monolithic_IFAs f2eProbs,
Tokenizer tokenizer,
org.apache.log4j.Logger sLogger)
Given a document in F, and its tf mapping, compute a tf value for each term in E using the CLIR algorithm: tf(e) = sum_f{tf(f)*prob(e|f)}
|
static HMapIFW |
updateTFsByTerm(String fTerm,
int tf,
HMapIFW tfTable,
Vocab eVocabSrc,
Vocab eVocabTrg,
Vocab fVocabSrc,
Vocab fVocabTrg,
TTable_monolithic_IFAs e2fProbs,
TTable_monolithic_IFAs f2eProbs,
Tokenizer tokenizer,
org.apache.log4j.Logger sLogger)
Given a term in a document in F, and its tf value, update the computed tf value for each term in E using the CLIR algorithm: tf(e) = sum_f{tf(f)*prob(e|f)}
|
getConf, setConf
public static final String BitextSeparator
public static final int E
public static final int F
public static Pattern isNumber
public static final int MinSentenceLength
public static final int MinVectorTerms
public static void addToTable(int curIndex, TreeSet<PairOfFloatString> topTrans, float cumProb, TTable_monolithic_IFAs table, Vocab trgVocab, float cumProbThreshold, HookaStats stats)
public static String[] computeFeatures(int featSet, String fSentence, String eSentence, Tokenizer fTokenizer, Tokenizer eTokenizer, HMapSIW eSrcTfs, HMapSFW eVector, HMapSIW fSrcTfs, HMapSFW translatedFVector, float eSentLength, float fSentLength, Vocab eVocabSrc, Vocab eVocabTrg, Vocab fVocabSrc, Vocab fVocabTrg, TTable_monolithic_IFAs e2f_Probs, TTable_monolithic_IFAs f2e_Probs, float prob)
public static String[] computeFeatures(int featSet, String fSentence, String eSentence, Tokenizer fTokenizer, Tokenizer eTokenizer, HMapSIW eSrcTfs, HMapSFW eVector, HMapSIW fSrcTfs, HMapSFW translatedFVector, float eSentLength, float fSentLength, Vocab eVocabSrc, Vocab eVocabTrg, Vocab fVocabSrc, Vocab fVocabTrg, TTable_monolithic_IFAs e2f_Probs, TTable_monolithic_IFAs f2e_Probs, float prob, org.apache.log4j.Logger sLogger)
public static String[] computeFeaturesF1(HMapSFW eVector, HMapSFW translatedFVector, float eSentLength, float fSentLength)
public static String[] computeFeaturesF2(HMapSIW eSrcTfs, HMapSFW eVector, HMapSIW fSrcTfs, HMapSFW translatedFVector, float eSentLength, float fSentLength, Vocab eVocabSrc, Vocab eVocabTrg, Vocab fVocabSrc, Vocab fVocabTrg, TTable_monolithic_IFAs e2f_Probs, TTable_monolithic_IFAs f2e_Probs, float prob)
public static String[] computeFeaturesF3(String fSentence, String eSentence, Tokenizer fTokenizer, Tokenizer eTokenizer, HMapSIW eSrcTfs, HMapSFW eVector, HMapSIW fSrcTfs, HMapSFW translatedFVector, float eSentLength, float fSentLength, Vocab eVocabSrc, Vocab eVocabTrg, Vocab fVocabSrc, Vocab fVocabTrg, TTable_monolithic_IFAs e2f_Probs, TTable_monolithic_IFAs f2e_Probs, float prob)
public static float cosine(HMapIFW vectorA, HMapIFW vectorB)
vectorA
- a term document vectorvectorB
- another term document vectorpublic static float cosine(HMapSFW vectorA, HMapSFW vectorB)
vectorA
- a term document vectorvectorB
- another term document vectorpublic static float cosineNormalized(HMapSFW vectorA, HMapSFW vectorB)
vectorA
- a normalized term document vectorvectorB
- another normalized term document vectorpublic static HMapSFW createTermDocVector(int docLen, HMapIFW tfTable, Vocab eVocab, ScoringModel scoringModel, FrequencySortedDictionary dict, DfTableArray dfTable, boolean isNormalize, org.apache.log4j.Logger sLogger)
docLen
- doc lengthtfTable
- mapping from term id to tf valueseVocab
- vocabulary object for final doc vector languagescoringModel
- modeldfTable
- mapping from term id to df valuesisNormalize
- indicating whether to normalize the doc vector weights or notsLogger
- Logger object for log outputpublic static HMapSFW createTermDocVector(int docLen, HMapIFW tfTable, Vocab eVocab, ScoringModel scoringModel, HMapIFW dfTable, boolean isNormalize, org.apache.log4j.Logger sLogger)
docLen
- doc lengthtfTable
- mapping from term id to tf valueseVocab
- vocabulary object for final doc vector languagescoringModel
- modeldfTable
- mapping from term id to df valuesisNormalize
- indicating whether to normalize the doc vector weights or notsLogger
- Logger object for log outputpublic static HMapSFW createTermDocVector(int docLen, HMapIFW tfTable, Vocab eVocab, ScoringModel scoringModel, HMapSIW dfTable, boolean isNormalize, org.apache.log4j.Logger sLogger)
public static HMapSFW createTermDocVector(int docLen, HMapSIW tfTable, ScoringModel scoringModel, FrequencySortedDictionary dict, DfTableArray dfTable, boolean isNormalize, org.apache.log4j.Logger sLogger)
docLen
- doc lengthtfTable
- mapping from term string to tf valuesscoringModel
- modeldfTable
- mapping from term id to df valuesisNormalize
- indicating whether to normalize the doc vector weights or notsLogger
- Logger object for log outputpublic static void createTTableFromBerkeleyAligner(String inputFile, String srcVocabFile, String trgVocabFile, String probsFile, float probThreshold, int numTrans, FileSystem fs) throws IOException
inputFile
- output of Berkeley Aligner (probability values from source language to target language). Format should be:
[source-word] entropy ... nTrans ... sum 1.000000
[target-word1]: [prob1]
[target-word2]: [prob2]
..srcVocabFile
- path where created source vocabulary (VocabularyWritable) will be writtentrgVocabFile
- path where created target vocabulary (VocabularyWritable) will be writtenprobsFile
- path where created probability table (TTable_monolithic_IFAs) will be writtenfs
- FileSystem objectIOException
public static void createTTableFromGIZA(String inputFile, String srcVocabFile, String trgVocabFile, String probsFile, float probThreshold, int numTrans, FileSystem fs) throws IOException
filename
- output of GIZA (probability values from source language to target language. In GIZA, format of each line should be:
[target-word1] [source-word] [prob1]
[target-word2] [source-word] [prob2]
...srcVocabFile
- path where created source vocabulary (VocabularyWritable) will be writtentrgVocabFile
- path where created target vocabulary (VocabularyWritable) will be writtenprobsFile
- path where created probability table (TTable_monolithic_IFAs) will be writtenfs
- FileSystem objectIOException
public static void createTTableFromHooka(String srcVocabFile, String trgVocabFile, String tableFile, String finalSrcVocabFile, String finalTrgVocabFile, String finalTableFile, float probThreshold, int numTrans, FileSystem fs) throws IOException
srcVocabFile
- path to source vocabulary file output by HookatrgVocabFile
- path to target vocabulary file output by HookatableFile
- path to ttable file output by HookafinalSrcVocabFile
- path where created source vocabulary (VocabularyWritable) will be writtenfinalTrgVocabFile
- path where created target vocabulary (VocabularyWritable) will be writtenfinalTableFile
- path where created probability table (TTable_monolithic_IFAs) will be writtenfs
- FileSystem objectIOException
public static HMapIFW readTransDfTable(Path path, FileSystem fs)
path
- path to df tablefs
- FileSystem objectpublic static HMapIFW translateDFTable(Vocab eVocabSrc, Vocab fVocabTrg, TTable_monolithic_IFAs e2f_probs, FrequencySortedDictionary dict, DfTableArray dfTable)
eVocabSrc
- source-side vocabulary of the ttable E-->F (i.e., Pr(f|e))fVocabTrg
- target-side vocabulary of the ttable E-->F (i.e., Pr(f|e))e2f_probs
- ttable E-->F (i.e., Pr(f|e))public static HMapIFW translateDFTable(Vocab eVocabSrc, Vocab fVocabTrg, TTable_monolithic_IFAs e2f_probs, HMapSIW dfs)
eVocabSrc
- source-side vocabulary of the ttable E-->F (i.e., Pr(f|e))fVocabTrg
- target-side vocabulary of the ttable E-->F (i.e., Pr(f|e))e2f_probs
- ttable E-->F (i.e., Pr(f|e))dfs
- mapping from F-terms to their df valuespublic static int translateTFs(HMapSIW doc, HMapIFW tfTable, Vocab eVocabSrc, Vocab eVocabTrg, Vocab fVocabSrc, Vocab fVocabTrg, TTable_monolithic_IFAs e2fProbs, TTable_monolithic_IFAs f2eProbs, Tokenizer tokenizer, org.apache.log4j.Logger sLogger) throws IOException
doc
- mapping from F-term strings to tf valuestfTable
- to be returned, a mapping from E-term ids to tf valueseVocabSrc
- source-side vocabulary of the ttable E-->F (i.e., Pr(f|e))eVocabTrg
- target-side vocabulary of the ttable F-->E (i.e., Pr(f|e))fVocabSrc
- source-side vocabulary of the ttable F-->E (i.e., Pr(e|f))fVocabTrg
- target-side vocabulary of the ttable E-->F (i.e., Pr(f|e))e2fProbs
- ttable E-->F (i.e., Pr(f|e))f2eProbs
- ttable F-->E (i.e., Pr(e|f))sLogger
- Logger object for log outputIOException
public static int translateTFs(TermDocVector doc, HMapIFW tfTable, Vocab eVocabSrc, Vocab eVocabTrg, Vocab fVocabSrc, Vocab fVocabTrg, TTable_monolithic_IFAs e2fProbs, TTable_monolithic_IFAs f2eProbs, Tokenizer tokenizer, org.apache.log4j.Logger sLogger) throws IOException
doc
- mapping from F-term strings to tf valuestfTable
- to be returned, a mapping from E-term ids to tf valueseVocabSrc
- source-side vocabulary of the ttable E-->F (i.e., Pr(f|e))eVocabTrg
- target-side vocabulary of the ttable F-->E (i.e., Pr(f|e))fVocabSrc
- source-side vocabulary of the ttable F-->E (i.e., Pr(e|f))fVocabTrg
- target-side vocabulary of the ttable E-->F (i.e., Pr(f|e))e2fProbs
- ttable E-->F (i.e., Pr(f|e))f2eProbs
- ttable F-->E (i.e., Pr(e|f))sLogger
- Logger object for log outputIOException
public static HMapIFW updateTFsByTerm(String fTerm, int tf, HMapIFW tfTable, Vocab eVocabSrc, Vocab eVocabTrg, Vocab fVocabSrc, Vocab fVocabTrg, TTable_monolithic_IFAs e2fProbs, TTable_monolithic_IFAs f2eProbs, Tokenizer tokenizer, org.apache.log4j.Logger sLogger)
Calling this method computes a single summand of the above equation.
fTerm
- term in a document in Ftf
- term frequency of fTermtfTable
- to be updated, a mapping from E-term ids to tf valueseVocabSrc
- source-side vocabulary of the ttable E-->F (i.e., Pr(f|e))eVocabTrg
- target-side vocabulary of the ttable F-->E (i.e., Pr(f|e))fVocabSrc
- source-side vocabulary of the ttable F-->E (i.e., Pr(e|f))fVocabTrg
- target-side vocabulary of the ttable E-->F (i.e., Pr(f|e))e2fProbs
- ttable E-->F (i.e., Pr(f|e))f2eProbs
- ttable F-->E (i.e., Pr(e|f))sLogger
- Logger object for log outputIOException