Utils

java.lang.Object
- ivory.sqe.querygenerator.Utils

```
public class Utils
extends Object
```

Constructor Summary

Constructors
Constructor and Description

Utils()

Constructors
Constructor and Description
`Utils()`

Method Summary

Methods
Modifier and Type	Method and Description
`static HMapSFW`	`combineProbMaps(float threshold, float scale, List<PairOfFloatMap> probMaps)` Take a weighted average of a given list of prob.
`static com.google.gson.JsonArray`	`createJsonArray(String[] elements)`
`static com.google.gson.JsonArray`	`createJsonArrayFromProbabilities(HMapSFW probMap)` Convert prob.
`static String[]`	`extractPhrases(String[] tokens, int windowSize)`
`static void`	`filter(HMapSFW probMap, float lexProbThreshold)`
`static Map<String,HMapSFW>`	`generateTranslationTable(FileSystem fs, Configuration conf, String grammarFile, Tokenizer queryLangTokenizer, Tokenizer docLangTokenizer)` Read SCFG (synchronous context-free grammar) and convert into a set of probability distributions, one per source token that appear on LHS of any rule in the grammar
`static Set<PairOfStrings>`	`getPairsInSCFG(FileSystem fs, String grammarFile)`
`static String`	`getSetting(Configuration conf)`
`static Map<String,String>`	`getStemMapping(String origQuery, Tokenizer queryLangTokenizer, Tokenizer docLangTokenizer)` Create a mapping between query-language stemming and document-language stemming (if both are turned on).
`static String`	`ivory2Indri(String structuredQuery)`
`static void`	`normalize(HMapSFW probMap)` L1-normalization
`static void`	`normalize(Map<String,HMapSFW> probMap, float lexProbThreshold, float cumProbThreshold, int maxNumTrans)` Given a distribution of probabilities, normalize so that sum of prob.s is exactly 1.0 or cumProbThreshold (if lower than 1.0).
`static void`	`processRule(int isOne2Many, boolean isMany2Many, float score, String rule, Set<String> bagOfTargetTokens, Map<String,HMapSFW> probDist, HMapSFW phraseDist, HMapSIW srcTokenCnt, Tokenizer queryLangTokenizer, Tokenizer docLangTokenizer, Map<String,String> stemmed2Stemmed, Set<String> unknownWords)`
`static List<String>`	`readOriginalQueries(FileSystem fs, String originalQueriesFile)`
`static Set<String>`	`readUnknowns(FileSystem fs, String unkFile)`
`static String`	`removeBorderStopWords(Tokenizer tokenizer, String tokenizedText)` Remove stop words from text that has been tokenized.
`static HMapSFW`	`scaleProbMap(float threshold, float scale, HMapSFW probMap)` Scale a probability distribution (multiply each entry with scale), then filter out entries below threshold

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Detail
- Utils
```
public Utils()
```

Method Detail

combineProbMaps
```
public static HMapSFW combineProbMaps(float threshold,
                      float scale,
                      List<PairOfFloatMap> probMaps)
```
Take a weighted average of a given list of prob. distributions.

Parameters:
threshold - we can put a lowerbound on final probability of entries
scale - value between 0 and 1 that determines total probability in final distribution (e.g., 0.2 scale will scale [0.8 0.1 0.1] into [0.16 0.02 0.02])
probMaps - list of probability distributions

createJsonArray

public static com.google.gson.JsonArray createJsonArray(String[] elements)

createJsonArrayFromProbabilities
```
public static com.google.gson.JsonArray createJsonArrayFromProbabilities(HMapSFW probMap)
```
Convert prob. distribution to JSONArray in which float at position 2k corresponds to probabilities of term at position 2k+1, k=0...(n/2-1)

Parameters:
probMap -

extractPhrases
```
public static String[] extractPhrases(String[] tokens,
                      int windowSize)
```
Parameters:
tokens - tokens of query
windowSize - window size of each "phrase" to be extracted

Returns:
all consecutive token sequences of windowSize length

filter

public static void filter(HMapSFW probMap,
          float lexProbThreshold)

generateTranslationTable

public static Map<String,HMapSFW> generateTranslationTable(FileSystem fs,
                                           Configuration conf,
                                           String grammarFile,
                                           Tokenizer queryLangTokenizer,
                                           Tokenizer docLangTokenizer)

Read SCFG (synchronous context-free grammar) and convert into a set of probability distributions, one per source token that appear on LHS of any rule in the grammar

Parameters:: conf - read grammar file from Configuration object; docLangTokenizer - to check for stopwords on RHS

getPairsInSCFG

public static Set<PairOfStrings> getPairsInSCFG(FileSystem fs,
                                String grammarFile)

getSetting

public static String getSetting(Configuration conf)

getStemMapping
```
public static Map<String,String> getStemMapping(String origQuery,
                                Tokenizer queryLangTokenizer,
                                Tokenizer docLangTokenizer)
```
Create a mapping between query-language stemming and document-language stemming (if both are turned on). If there is a query token for which we do not have any translation, it is helpful to search for that token in documents. However, since we perform stemming on documents with doc-language stemmer, we might miss some. Example: In query 'emmy award', if we dont know how to translate emmy, we should search for 'emmy' in French documents, instead of 'emmi', which is how it's stemmed in English.

Parameters:
origQuery -
queryLangTokenizer - no stemming or stopword removal
docLangTokenizer - no stopword removal, stemming enabled

ivory2Indri

public static String ivory2Indri(String structuredQuery)

normalize

public static void normalize(HMapSFW probMap)

L1-normalization

Parameters:: probMap -

normalize
```
public static void normalize(Map<String,HMapSFW> probMap,
             float lexProbThreshold,
             float cumProbThreshold,
             int maxNumTrans)
```
Given a distribution of probabilities, normalize so that sum of prob.s is exactly 1.0 or cumProbThreshold (if lower than 1.0). If we want to discard entries with prob. below lexProbThreshold, we do that after initial normalization, then re-normalize before cumulative thresholding. If we want to keep at most maxNumTrans translations in final distribution, it can be specified.

Parameters:
probMap -
lexProbThreshold -
cumProbThreshold -
maxNumTrans -

processRule

public static void processRule(int isOne2Many,
               boolean isMany2Many,
               float score,
               String rule,
               Set<String> bagOfTargetTokens,
               Map<String,HMapSFW> probDist,
               HMapSFW phraseDist,
               HMapSIW srcTokenCnt,
               Tokenizer queryLangTokenizer,
               Tokenizer docLangTokenizer,
               Map<String,String> stemmed2Stemmed,
               Set<String> unknownWords)

readOriginalQueries

public static List<String> readOriginalQueries(FileSystem fs,
                               String originalQueriesFile)

readUnknowns

public static Set<String> readUnknowns(FileSystem fs,
                       String unkFile)

removeBorderStopWords
```
public static String removeBorderStopWords(Tokenizer tokenizer,
                           String tokenizedText)
```
Remove stop words from text that has been tokenized. Useful when postprocessing output of MT system, which is tokenized but not stopword'ed.

Parameters:
tokenizedText - input text, assumed to be tokenized.

Returns:
same text without the stop words.

scaleProbMap
```
public static HMapSFW scaleProbMap(float threshold,
                   float scale,
                   HMapSFW probMap)
```
Scale a probability distribution (multiply each entry with scale), then filter out entries below threshold

Parameters:
threshold -
scale -
probMap -

Class Utils

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

Utils

Method Detail

combineProbMaps

createJsonArray

createJsonArrayFromProbabilities

extractPhrases

filter

generateTranslationTable

getPairsInSCFG

getSetting

getStemMapping

ivory2Indri

normalize

normalize

processRule

readOriginalQueries

readUnknowns

removeBorderStopWords

scaleProbMap