CLIRUtils

java.lang.Object
- org.apache.hadoop.conf.Configured
- - ivory.core.util.CLIRUtils

All Implemented Interfaces:

Configurable
```
public class CLIRUtils
extends Configured
```
Algorithms used in our CLIR approach to convert doc vectors from one language into another. See SIGIR'11 paper for details.

F is the "foreign" language, the language in which non-translated documents are written.
E is the "non-foreign" language, the language into which documents are translated.

Required files:
ttable E-->F (i.e., Pr(f|e))
ttable F-->E (i.e., Pr(e|f))
Pair of vocabulary files for each ttable
V_E & V_F for E-->F
V_E & V_F for F-->E

Author:

ferhanture

Field Summary

Fields
Modifier and Type	Field and Description
`static String`	`BitextSeparator`
`static int`	`E`
`static int`	`F`
`static Pattern`	`isNumber`
`static int`	`MinSentenceLength`
`static int`	`MinVectorTerms`

Constructor Summary

Constructors
Constructor and Description

CLIRUtils()

Constructors
Constructor and Description
`CLIRUtils()`

Method Summary

Methods
Modifier and Type	Method and Description
`static void`	`addToTable(int curIndex, TreeSet<PairOfFloatString> topTrans, float cumProb, TTable_monolithic_IFAs table, Vocab trgVocab, float cumProbThreshold, HookaStats stats)`
`static String[]`	`computeFeatures(int featSet, String fSentence, String eSentence, Tokenizer fTokenizer, Tokenizer eTokenizer, HMapSIW eSrcTfs, HMapSFW eVector, HMapSIW fSrcTfs, HMapSFW translatedFVector, float eSentLength, float fSentLength, Vocab eVocabSrc, Vocab eVocabTrg, Vocab fVocabSrc, Vocab fVocabTrg, TTable_monolithic_IFAs e2f_Probs, TTable_monolithic_IFAs f2e_Probs, float prob)`
`static String[]`	`computeFeatures(int featSet, String fSentence, String eSentence, Tokenizer fTokenizer, Tokenizer eTokenizer, HMapSIW eSrcTfs, HMapSFW eVector, HMapSIW fSrcTfs, HMapSFW translatedFVector, float eSentLength, float fSentLength, Vocab eVocabSrc, Vocab eVocabTrg, Vocab fVocabSrc, Vocab fVocabTrg, TTable_monolithic_IFAs e2f_Probs, TTable_monolithic_IFAs f2e_Probs, float prob, org.apache.log4j.Logger sLogger)`
`static String[]`	`computeFeaturesF1(HMapSFW eVector, HMapSFW translatedFVector, float eSentLength, float fSentLength)` Bitext extraction helper functions
`static String[]`	`computeFeaturesF2(HMapSIW eSrcTfs, HMapSFW eVector, HMapSIW fSrcTfs, HMapSFW translatedFVector, float eSentLength, float fSentLength, Vocab eVocabSrc, Vocab eVocabTrg, Vocab fVocabSrc, Vocab fVocabTrg, TTable_monolithic_IFAs e2f_Probs, TTable_monolithic_IFAs f2e_Probs, float prob)`
`static String[]`	`computeFeaturesF3(String fSentence, String eSentence, Tokenizer fTokenizer, Tokenizer eTokenizer, HMapSIW eSrcTfs, HMapSFW eVector, HMapSIW fSrcTfs, HMapSFW translatedFVector, float eSentLength, float fSentLength, Vocab eVocabSrc, Vocab eVocabTrg, Vocab fVocabSrc, Vocab fVocabTrg, TTable_monolithic_IFAs e2f_Probs, TTable_monolithic_IFAs f2e_Probs, float prob)`
`static float`	`cosine(HMapIFW vectorA, HMapIFW vectorB)`
`static float`	`cosine(HMapSFW vectorA, HMapSFW vectorB)`
`static float`	`cosineNormalized(HMapSFW vectorA, HMapSFW vectorB)`
`static HMapSFW`	`createTermDocVector(int docLen, HMapIFW tfTable, Vocab eVocab, ScoringModel scoringModel, FrequencySortedDictionary dict, DfTableArray dfTable, boolean isNormalize, org.apache.log4j.Logger sLogger)` Given the TF, DF values, doc length, scoring model, this method creates the term doc vector for a document.
`static HMapSFW`	`createTermDocVector(int docLen, HMapIFW tfTable, Vocab eVocab, ScoringModel scoringModel, HMapIFW dfTable, boolean isNormalize, org.apache.log4j.Logger sLogger)` Given the TF, DF values, doc length, scoring model, this method creates the term doc vector for a document.
`static HMapSFW`	`createTermDocVector(int docLen, HMapIFW tfTable, Vocab eVocab, ScoringModel scoringModel, HMapSIW dfTable, boolean isNormalize, org.apache.log4j.Logger sLogger)` called by BitextClassifierUtils
`static HMapSFW`	`createTermDocVector(int docLen, HMapSIW tfTable, ScoringModel scoringModel, FrequencySortedDictionary dict, DfTableArray dfTable, boolean isNormalize, org.apache.log4j.Logger sLogger)` Given the TF, DF values, doc length, scoring model, this method creates the term doc vector for a document.
`static void`	`createTTableFromBerkeleyAligner(String inputFile, String srcVocabFile, String trgVocabFile, String probsFile, float probThreshold, int numTrans, FileSystem fs)` This method converts the output of BerkeleyAligner into a TTable_monolithic_IFAs object.
`static void`	`createTTableFromGIZA(String inputFile, String srcVocabFile, String trgVocabFile, String probsFile, float probThreshold, int numTrans, FileSystem fs)` This method converts the output of GIZA into a TTable_monolithic_IFAs object.
`static void`	`createTTableFromHooka(String srcVocabFile, String trgVocabFile, String tableFile, String finalSrcVocabFile, String finalTrgVocabFile, String finalTableFile, float probThreshold, int numTrans, FileSystem fs)` This method modifies the TTable_monolithic_IFAs object output by Hooka, to meet following criteria: For each source language term, top numTrans entries (with highest translation probability) are kept, unless the top K < numTrans entries have a cumulatite probability above probThreshold.
`static void`	`main(String[] args)`
`static HMapIFW`	`readTransDfTable(Path path, FileSystem fs)` Read df mapping from file.
`static HMapIFW`	`translateDFTable(Vocab eVocabSrc, Vocab fVocabTrg, TTable_monolithic_IFAs e2f_probs, FrequencySortedDictionary dict, DfTableArray dfTable)` Given a mapping from F-terms to their df values, compute a df value for each E-term using the CLIR algorithm: df(e) = sum_f{df(f)*prob(f\|e)}
`static HMapIFW`	`translateDFTable(Vocab eVocabSrc, Vocab fVocabTrg, TTable_monolithic_IFAs e2f_probs, HMapSIW dfs)` Given a mapping from F-terms to their df values, compute a df value for each E-term using the CLIR algorithm: df(e) = sum_f{df(f)*prob(f\|e)}
`static int`	`translateTFs(HMapSIW doc, HMapIFW tfTable, Vocab eVocabSrc, Vocab eVocabTrg, Vocab fVocabSrc, Vocab fVocabTrg, TTable_monolithic_IFAs e2fProbs, TTable_monolithic_IFAs f2eProbs, Tokenizer tokenizer, org.apache.log4j.Logger sLogger)` Given a document in F, and its tf mapping, compute a tf value for each term in E using the CLIR algorithm: tf(e) = sum_f{tf(f)*prob(e\|f)}
`static int`	`translateTFs(TermDocVector doc, HMapIFW tfTable, Vocab eVocabSrc, Vocab eVocabTrg, Vocab fVocabSrc, Vocab fVocabTrg, TTable_monolithic_IFAs e2fProbs, TTable_monolithic_IFAs f2eProbs, Tokenizer tokenizer, org.apache.log4j.Logger sLogger)` Given a document in F, and its tf mapping, compute a tf value for each term in E using the CLIR algorithm: tf(e) = sum_f{tf(f)*prob(e\|f)}
`static HMapIFW`	`updateTFsByTerm(String fTerm, int tf, HMapIFW tfTable, Vocab eVocabSrc, Vocab eVocabTrg, Vocab fVocabSrc, Vocab fVocabTrg, TTable_monolithic_IFAs e2fProbs, TTable_monolithic_IFAs f2eProbs, Tokenizer tokenizer, org.apache.log4j.Logger sLogger)` Given a term in a document in F, and its tf value, update the computed tf value for each term in E using the CLIR algorithm: tf(e) = sum_f{tf(f)*prob(e\|f)}

Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail
- BitextSeparator
```
public static final String BitextSeparator
```
  See Also:
  Constant Field Values
- E
```
public static final int E
```
  See Also:
  Constant Field Values
- F
```
public static final int F
```
  See Also:
  Constant Field Values
- isNumber
```
public static Pattern isNumber
```
- MinSentenceLength
```
public static final int MinSentenceLength
```
  See Also:
  Constant Field Values
- MinVectorTerms
```
public static final int MinVectorTerms
```
  See Also:
  Constant Field Values

Constructor Detail
- CLIRUtils
```
public CLIRUtils()
```

Method Detail

addToTable

public static void addToTable(int curIndex,
              TreeSet<PairOfFloatString> topTrans,
              float cumProb,
              TTable_monolithic_IFAs table,
              Vocab trgVocab,
              float cumProbThreshold,
              HookaStats stats)

computeFeatures

public static String[] computeFeatures(int featSet,
                       String fSentence,
                       String eSentence,
                       Tokenizer fTokenizer,
                       Tokenizer eTokenizer,
                       HMapSIW eSrcTfs,
                       HMapSFW eVector,
                       HMapSIW fSrcTfs,
                       HMapSFW translatedFVector,
                       float eSentLength,
                       float fSentLength,
                       Vocab eVocabSrc,
                       Vocab eVocabTrg,
                       Vocab fVocabSrc,
                       Vocab fVocabTrg,
                       TTable_monolithic_IFAs e2f_Probs,
                       TTable_monolithic_IFAs f2e_Probs,
                       float prob)

computeFeatures

public static String[] computeFeatures(int featSet,
                       String fSentence,
                       String eSentence,
                       Tokenizer fTokenizer,
                       Tokenizer eTokenizer,
                       HMapSIW eSrcTfs,
                       HMapSFW eVector,
                       HMapSIW fSrcTfs,
                       HMapSFW translatedFVector,
                       float eSentLength,
                       float fSentLength,
                       Vocab eVocabSrc,
                       Vocab eVocabTrg,
                       Vocab fVocabSrc,
                       Vocab fVocabTrg,
                       TTable_monolithic_IFAs e2f_Probs,
                       TTable_monolithic_IFAs f2e_Probs,
                       float prob,
                       org.apache.log4j.Logger sLogger)

computeFeaturesF1

public static String[] computeFeaturesF1(HMapSFW eVector,
                         HMapSFW translatedFVector,
                         float eSentLength,
                         float fSentLength)

Bitext extraction helper functions

computeFeaturesF2

public static String[] computeFeaturesF2(HMapSIW eSrcTfs,
                         HMapSFW eVector,
                         HMapSIW fSrcTfs,
                         HMapSFW translatedFVector,
                         float eSentLength,
                         float fSentLength,
                         Vocab eVocabSrc,
                         Vocab eVocabTrg,
                         Vocab fVocabSrc,
                         Vocab fVocabTrg,
                         TTable_monolithic_IFAs e2f_Probs,
                         TTable_monolithic_IFAs f2e_Probs,
                         float prob)

computeFeaturesF3

public static String[] computeFeaturesF3(String fSentence,
                         String eSentence,
                         Tokenizer fTokenizer,
                         Tokenizer eTokenizer,
                         HMapSIW eSrcTfs,
                         HMapSFW eVector,
                         HMapSIW fSrcTfs,
                         HMapSFW translatedFVector,
                         float eSentLength,
                         float fSentLength,
                         Vocab eVocabSrc,
                         Vocab eVocabTrg,
                         Vocab fVocabSrc,
                         Vocab fVocabTrg,
                         TTable_monolithic_IFAs e2f_Probs,
                         TTable_monolithic_IFAs f2e_Probs,
                         float prob)

cosine
```
public static float cosine(HMapIFW vectorA,
           HMapIFW vectorB)
```
Parameters:
vectorA - a term document vector
vectorB - another term document vector

Returns:
cosine score

cosine
```
public static float cosine(HMapSFW vectorA,
           HMapSFW vectorB)
```
Parameters:
vectorA - a term document vector
vectorB - another term document vector

Returns:
cosine score

cosineNormalized
```
public static float cosineNormalized(HMapSFW vectorA,
                     HMapSFW vectorB)
```
Parameters:
vectorA - a normalized term document vector
vectorB - another normalized term document vector

Returns:
cosine score

createTermDocVector

public static HMapSFW createTermDocVector(int docLen,
                          HMapIFW tfTable,
                          Vocab eVocab,
                          ScoringModel scoringModel,
                          FrequencySortedDictionary dict,
                          DfTableArray dfTable,
                          boolean isNormalize,
                          org.apache.log4j.Logger sLogger)

Given the TF, DF values, doc length, scoring model, this method creates the term doc vector for a document.

Parameters:: docLen - doc length; tfTable - mapping from term id to tf values; eVocab - vocabulary object for final doc vector language; scoringModel - model; dfTable - mapping from term id to df values; isNormalize - indicating whether to normalize the doc vector weights or not; sLogger - Logger object for log output
Returns:: term doc vector representing the document

createTermDocVector
```
public static HMapSFW createTermDocVector(int docLen,
                          HMapIFW tfTable,
                          Vocab eVocab,
                          ScoringModel scoringModel,
                          HMapIFW dfTable,
                          boolean isNormalize,
                          org.apache.log4j.Logger sLogger)
```
Given the TF, DF values, doc length, scoring model, this method creates the term doc vector for a document.

Parameters:
docLen - doc length
tfTable - mapping from term id to tf values
eVocab - vocabulary object for final doc vector language
scoringModel - model
dfTable - mapping from term id to df values
isNormalize - indicating whether to normalize the doc vector weights or not
sLogger - Logger object for log output

Returns:
Term doc vector representing the document

createTermDocVector

public static HMapSFW createTermDocVector(int docLen,
                          HMapIFW tfTable,
                          Vocab eVocab,
                          ScoringModel scoringModel,
                          HMapSIW dfTable,
                          boolean isNormalize,
                          org.apache.log4j.Logger sLogger)

called by BitextClassifierUtils

createTermDocVector

public static HMapSFW createTermDocVector(int docLen,
                          HMapSIW tfTable,
                          ScoringModel scoringModel,
                          FrequencySortedDictionary dict,
                          DfTableArray dfTable,
                          boolean isNormalize,
                          org.apache.log4j.Logger sLogger)

Given the TF, DF values, doc length, scoring model, this method creates the term doc vector for a document.

Parameters:: docLen - doc length; tfTable - mapping from term string to tf values; scoringModel - model; dfTable - mapping from term id to df values; isNormalize - indicating whether to normalize the doc vector weights or not; sLogger - Logger object for log output
Returns:: Term doc vector representing the document

createTTableFromBerkeleyAligner
```
public static void createTTableFromBerkeleyAligner(String inputFile,
                                   String srcVocabFile,
                                   String trgVocabFile,
                                   String probsFile,
                                   float probThreshold,
                                   int numTrans,
                                   FileSystem fs)
                                            throws IOException
```
This method converts the output of BerkeleyAligner into a TTable_monolithic_IFAs object. For each source language term, top numTrans entries (with highest translation probability) are kept, unless the top K < numTrans entries have a cumulatite probability above PROB_THRESHOLD.

Parameters:
inputFile - output of Berkeley Aligner (probability values from source language to target language). Format should be: [source-word] entropy ... nTrans ... sum 1.000000 [target-word1]: [prob1] [target-word2]: [prob2] ..
srcVocabFile - path where created source vocabulary (VocabularyWritable) will be written
trgVocabFile - path where created target vocabulary (VocabularyWritable) will be written
probsFile - path where created probability table (TTable_monolithic_IFAs) will be written
fs - FileSystem object

Throws:

IOException

createTTableFromGIZA
```
public static void createTTableFromGIZA(String inputFile,
                        String srcVocabFile,
                        String trgVocabFile,
                        String probsFile,
                        float probThreshold,
                        int numTrans,
                        FileSystem fs)
                                 throws IOException
```
This method converts the output of GIZA into a TTable_monolithic_IFAs object. For each source language term, top numTrans entries (with highest translation probability) are kept, unless the top K < numTrans entries have a cumulatite probability above probThreshold.

Parameters:
filename - output of GIZA (probability values from source language to target language. In GIZA, format of each line should be: [target-word1] [source-word] [prob1] [target-word2] [source-word] [prob2] ...
srcVocabFile - path where created source vocabulary (VocabularyWritable) will be written
trgVocabFile - path where created target vocabulary (VocabularyWritable) will be written
probsFile - path where created probability table (TTable_monolithic_IFAs) will be written
fs - FileSystem object

Throws:

IOException

createTTableFromHooka
```
public static void createTTableFromHooka(String srcVocabFile,
                         String trgVocabFile,
                         String tableFile,
                         String finalSrcVocabFile,
                         String finalTrgVocabFile,
                         String finalTableFile,
                         float probThreshold,
                         int numTrans,
                         FileSystem fs)
                                  throws IOException
```
This method modifies the TTable_monolithic_IFAs object output by Hooka, to meet following criteria: For each source language term, top numTrans entries (with highest translation probability) are kept, unless the top K < numTrans entries have a cumulatite probability above probThreshold.

Parameters:
srcVocabFile - path to source vocabulary file output by Hooka
trgVocabFile - path to target vocabulary file output by Hooka
tableFile - path to ttable file output by Hooka
finalSrcVocabFile - path where created source vocabulary (VocabularyWritable) will be written
finalTrgVocabFile - path where created target vocabulary (VocabularyWritable) will be written
finalTableFile - path where created probability table (TTable_monolithic_IFAs) will be written
fs - FileSystem object

Throws:

IOException

main

public static void main(String[] args)
                 throws Exception

Throws:: Exception

readTransDfTable
```
public static HMapIFW readTransDfTable(Path path,
                       FileSystem fs)
```
Read df mapping from file.

Parameters:
path - path to df table
fs - FileSystem object

Returns:
mapping from term ids to df values

translateDFTable
```
public static HMapIFW translateDFTable(Vocab eVocabSrc,
                       Vocab fVocabTrg,
                       TTable_monolithic_IFAs e2f_probs,
                       FrequencySortedDictionary dict,
                       DfTableArray dfTable)
```
Given a mapping from F-terms to their df values, compute a df value for each E-term using the CLIR algorithm: df(e) = sum_f{df(f)*prob(f|e)}

Parameters:
eVocabSrc - source-side vocabulary of the ttable E-->F (i.e., Pr(f|e))
fVocabTrg - target-side vocabulary of the ttable E-->F (i.e., Pr(f|e))
e2f_probs - ttable E-->F (i.e., Pr(f|e))

Returns:
mapping from E-terms to their computed df values

translateDFTable
```
public static HMapIFW translateDFTable(Vocab eVocabSrc,
                       Vocab fVocabTrg,
                       TTable_monolithic_IFAs e2f_probs,
                       HMapSIW dfs)
```
Given a mapping from F-terms to their df values, compute a df value for each E-term using the CLIR algorithm: df(e) = sum_f{df(f)*prob(f|e)}

Parameters:
eVocabSrc - source-side vocabulary of the ttable E-->F (i.e., Pr(f|e))
fVocabTrg - target-side vocabulary of the ttable E-->F (i.e., Pr(f|e))
e2f_probs - ttable E-->F (i.e., Pr(f|e))
dfs - mapping from F-terms to their df values

Returns:
mapping from E-terms to their computed df values

translateTFs
```
public static int translateTFs(HMapSIW doc,
               HMapIFW tfTable,
               Vocab eVocabSrc,
               Vocab eVocabTrg,
               Vocab fVocabSrc,
               Vocab fVocabTrg,
               TTable_monolithic_IFAs e2fProbs,
               TTable_monolithic_IFAs f2eProbs,
               Tokenizer tokenizer,
               org.apache.log4j.Logger sLogger)
                        throws IOException
```
Given a document in F, and its tf mapping, compute a tf value for each term in E using the CLIR algorithm: tf(e) = sum_f{tf(f)*prob(e|f)}

Parameters:
doc - mapping from F-term strings to tf values
tfTable - to be returned, a mapping from E-term ids to tf values
eVocabSrc - source-side vocabulary of the ttable E-->F (i.e., Pr(f|e))
eVocabTrg - target-side vocabulary of the ttable F-->E (i.e., Pr(f|e))
fVocabSrc - source-side vocabulary of the ttable F-->E (i.e., Pr(e|f))
fVocabTrg - target-side vocabulary of the ttable E-->F (i.e., Pr(f|e))
e2fProbs - ttable E-->F (i.e., Pr(f|e))
f2eProbs - ttable F-->E (i.e., Pr(e|f))
sLogger - Logger object for log output

Throws:

IOException

translateTFs
```
public static int translateTFs(TermDocVector doc,
               HMapIFW tfTable,
               Vocab eVocabSrc,
               Vocab eVocabTrg,
               Vocab fVocabSrc,
               Vocab fVocabTrg,
               TTable_monolithic_IFAs e2fProbs,
               TTable_monolithic_IFAs f2eProbs,
               Tokenizer tokenizer,
               org.apache.log4j.Logger sLogger)
                        throws IOException
```
Given a document in F, and its tf mapping, compute a tf value for each term in E using the CLIR algorithm: tf(e) = sum_f{tf(f)*prob(e|f)}

Parameters:
doc - mapping from F-term strings to tf values
tfTable - to be returned, a mapping from E-term ids to tf values
eVocabSrc - source-side vocabulary of the ttable E-->F (i.e., Pr(f|e))
eVocabTrg - target-side vocabulary of the ttable F-->E (i.e., Pr(f|e))
fVocabSrc - source-side vocabulary of the ttable F-->E (i.e., Pr(e|f))
fVocabTrg - target-side vocabulary of the ttable E-->F (i.e., Pr(f|e))
e2fProbs - ttable E-->F (i.e., Pr(f|e))
f2eProbs - ttable F-->E (i.e., Pr(e|f))
sLogger - Logger object for log output

Throws:

IOException

updateTFsByTerm
```
public static HMapIFW updateTFsByTerm(String fTerm,
                      int tf,
                      HMapIFW tfTable,
                      Vocab eVocabSrc,
                      Vocab eVocabTrg,
                      Vocab fVocabSrc,
                      Vocab fVocabTrg,
                      TTable_monolithic_IFAs e2fProbs,
                      TTable_monolithic_IFAs f2eProbs,
                      Tokenizer tokenizer,
                      org.apache.log4j.Logger sLogger)
```
Given a term in a document in F, and its tf value, update the computed tf value for each term in E using the CLIR algorithm: tf(e) = sum_f{tf(f)*prob(e|f)}
Calling this method computes a single summand of the above equation.

Parameters:
fTerm - term in a document in F
tf - term frequency of fTerm
tfTable - to be updated, a mapping from E-term ids to tf values
eVocabSrc - source-side vocabulary of the ttable E-->F (i.e., Pr(f|e))
eVocabTrg - target-side vocabulary of the ttable F-->E (i.e., Pr(f|e))
fVocabSrc - source-side vocabulary of the ttable F-->E (i.e., Pr(e|f))
fVocabTrg - target-side vocabulary of the ttable E-->F (i.e., Pr(f|e))
e2fProbs - ttable E-->F (i.e., Pr(f|e))
f2eProbs - ttable F-->E (i.e., Pr(e|f))
sLogger - Logger object for log output

Returns:
updated mapping from E-term ids to tf values

Throws:

IOException

Class CLIRUtils

Field Summary

Constructor Summary

Method Summary

Methods inherited from class org.apache.hadoop.conf.Configured

Methods inherited from class java.lang.Object

Field Detail

BitextSeparator

E

F

isNumber

MinSentenceLength

MinVectorTerms

Constructor Detail

CLIRUtils

Method Detail

addToTable

computeFeatures

computeFeatures

computeFeaturesF1

computeFeaturesF2

computeFeaturesF3

cosine

cosine

cosineNormalized

createTermDocVector

createTermDocVector

createTermDocVector

createTermDocVector

createTTableFromBerkeleyAligner

createTTableFromGIZA

createTTableFromHooka

main

readTransDfTable

translateDFTable

translateDFTable

translateTFs

translateTFs

updateTFsByTerm