At the heart of an empirical, data driven approach to text processing is "counting stuff", or, more formally, managing frequency distributions. A closely related task is managing feature vectors. As it turns out, both tasks can take advantage of similar underlying data structures. This page focuses on efficient implementations of such data structures.
Let us focus on two cases: first, keeping track of term counts (i.e., where events whose frequencies we want to track are Strings); second, keeping track of integer counts. The second case represents a common optimization in text processing applications where terms have been replaced with unique integer ids.
A first stab might be to use HashMap<String, Integer> and HashMap<Integer, Integer>, respectively, for the two cases above. The biggest issue with this implementation is that the value is Integer, rather than int, since only objects can be stuffed inside Java collections (same in the case of int keys). Autoboxing is slow, and storing Integer objects increases the memory footprint.
To get around this, Cloud9 implements type-specific versions of the standard Java HashMap class. The fastutil package (included in the distribution) provides similar replacements:
"native" | fastutil | Standard Java |
HMapKI<K> | Object2IntOpenHashMap<K> | HashMap<K, Integer> |
HMapII | Int2IntOpenHashMap | HashMap<Integer, Integer> |
Well, how much better are these alternatives? The following microbenchmarks provide some answers.