public class Utils extends Object
Constructor and Description |
---|
Utils() |
Modifier and Type | Method and Description |
---|---|
static HMapSFW |
combineProbMaps(float threshold,
float scale,
List<PairOfFloatMap> probMaps)
Take a weighted average of a given list of prob.
|
static com.google.gson.JsonArray |
createJsonArray(String[] elements) |
static com.google.gson.JsonArray |
createJsonArrayFromProbabilities(HMapSFW probMap)
Convert prob.
|
static String[] |
extractPhrases(String[] tokens,
int windowSize) |
static void |
filter(HMapSFW probMap,
float lexProbThreshold) |
static Map<String,HMapSFW> |
generateTranslationTable(FileSystem fs,
Configuration conf,
String grammarFile,
Tokenizer queryLangTokenizer,
Tokenizer docLangTokenizer)
Read SCFG (synchronous context-free grammar) and convert into a set of probability distributions, one per source token that appear on LHS of any rule in the grammar
|
static Set<PairOfStrings> |
getPairsInSCFG(FileSystem fs,
String grammarFile) |
static String |
getSetting(Configuration conf) |
static Map<String,String> |
getStemMapping(String origQuery,
Tokenizer queryLangTokenizer,
Tokenizer docLangTokenizer)
Create a mapping between query-language stemming and document-language stemming (if both are turned on).
|
static String |
ivory2Indri(String structuredQuery) |
static void |
normalize(HMapSFW probMap)
L1-normalization
|
static void |
normalize(Map<String,HMapSFW> probMap,
float lexProbThreshold,
float cumProbThreshold,
int maxNumTrans)
Given a distribution of probabilities, normalize so that sum of prob.s is exactly 1.0 or cumProbThreshold (if lower than 1.0).
|
static void |
processRule(int isOne2Many,
boolean isMany2Many,
float score,
String rule,
Set<String> bagOfTargetTokens,
Map<String,HMapSFW> probDist,
HMapSFW phraseDist,
HMapSIW srcTokenCnt,
Tokenizer queryLangTokenizer,
Tokenizer docLangTokenizer,
Map<String,String> stemmed2Stemmed,
Set<String> unknownWords) |
static List<String> |
readOriginalQueries(FileSystem fs,
String originalQueriesFile) |
static Set<String> |
readUnknowns(FileSystem fs,
String unkFile) |
static String |
removeBorderStopWords(Tokenizer tokenizer,
String tokenizedText)
Remove stop words from text that has been tokenized.
|
static HMapSFW |
scaleProbMap(float threshold,
float scale,
HMapSFW probMap)
Scale a probability distribution (multiply each entry with scale), then filter out entries below threshold
|
public static HMapSFW combineProbMaps(float threshold, float scale, List<PairOfFloatMap> probMaps)
threshold
- we can put a lowerbound on final probability of entriesscale
- value between 0 and 1 that determines total probability in final distribution (e.g., 0.2 scale will scale [0.8 0.1 0.1] into [0.16 0.02 0.02])probMaps
- list of probability distributionspublic static com.google.gson.JsonArray createJsonArray(String[] elements)
public static com.google.gson.JsonArray createJsonArrayFromProbabilities(HMapSFW probMap)
probMap
- public static String[] extractPhrases(String[] tokens, int windowSize)
tokens
- tokens of querywindowSize
- window size of each "phrase" to be extractedpublic static void filter(HMapSFW probMap, float lexProbThreshold)
public static Map<String,HMapSFW> generateTranslationTable(FileSystem fs, Configuration conf, String grammarFile, Tokenizer queryLangTokenizer, Tokenizer docLangTokenizer)
conf
- read grammar file from Configuration objectdocLangTokenizer
- to check for stopwords on RHSpublic static Set<PairOfStrings> getPairsInSCFG(FileSystem fs, String grammarFile)
public static String getSetting(Configuration conf)
public static Map<String,String> getStemMapping(String origQuery, Tokenizer queryLangTokenizer, Tokenizer docLangTokenizer)
origQuery
- queryLangTokenizer
- no stemming or stopword removaldocLangTokenizer
- no stopword removal, stemming enabledpublic static void normalize(HMapSFW probMap)
probMap
- public static void normalize(Map<String,HMapSFW> probMap, float lexProbThreshold, float cumProbThreshold, int maxNumTrans)
probMap
- lexProbThreshold
- cumProbThreshold
- maxNumTrans
- public static void processRule(int isOne2Many, boolean isMany2Many, float score, String rule, Set<String> bagOfTargetTokens, Map<String,HMapSFW> probDist, HMapSFW phraseDist, HMapSIW srcTokenCnt, Tokenizer queryLangTokenizer, Tokenizer docLangTokenizer, Map<String,String> stemmed2Stemmed, Set<String> unknownWords)
public static List<String> readOriginalQueries(FileSystem fs, String originalQueriesFile)
public static Set<String> readUnknowns(FileSystem fs, String unkFile)
public static String removeBorderStopWords(Tokenizer tokenizer, String tokenizedText)
tokenizedText
- input text, assumed to be tokenized.