Cloud9: A MapReduce Library for Hadoop » Efficient frequency distributions and feature vectors

Efficient frequency distributions and feature vectors

At the heart of an empirical, data driven approach to text processing is "counting stuff", or, more formally, managing frequency distributions. A closely related task is managing feature vectors. As it turns out, both tasks can take advantage of similar underlying data structures. This page focuses on efficient implementations of such data structures.

Let us focus on two cases: first, keeping track of term counts (i.e., where events whose frequencies we want to track are Strings); second, keeping track of integer counts. The second case represents a common optimization in text processing applications where terms have been replaced with unique integer ids.

A first stab might be to use HashMap<String, Integer> and HashMap<Integer, Integer>, respectively, for the two cases above. The biggest issue with this implementation is that the value is Integer, rather than int, since only objects can be stuffed inside Java collections (same in the case of int keys). Autoboxing is slow, and storing Integer objects increases the memory footprint.

To get around this, Cloud⁹ implements type-specific versions of the standard Java HashMap class. The fastutil package (included in the distribution) provides similar replacements:

"native"	fastutil	Standard Java
HMapKI<K>	Object2IntOpenHashMap<K>	HashMap<K, Integer>
HMapII	Int2IntOpenHashMap	HashMap<Integer, Integer>

Well, how much better are these alternatives? The following microbenchmarks provide some answers.

Raw insertion and lookup

First, a description of the setup:

MacBook Pro (2.66 GHz Core i7, 8 GB RAM)
Mac OS X 10.6.5
Java 1.6u22
All experimental results averaged over five trials

In the first benchmark, we inserts 5 million random ints, and then lookup all 5 million ints. The following shows running times and the memory footprint with Strings as the keys:

	Insertion (total ms)	Lookup (total ms)	Memory (bytes per entry)
HashMap<String, Integer>	6684	1058	179
HMapKI<String>	5467 (-18%)	1009 (-5%)	142 (-21%)
Object2IntOpenHashMap<String>	3846 (-42%)	1424 (+34%)	114 (-36%)

And for ints as the keys:

	Insertion (total ms)	Lookup (total ms)	Memory (bytes per entry)
HashMap<Integer, Integer>	3966	417	123
HMapII	1722 (-57%)	106 (-75%)	60 (-51%)
Int2IntOpenHashMap	411 (-90%)	55 (-87%)	22 (-82%)

The fastutil classes are clear winners, right? Hang on...

Updates and removes

For the second benchmark, the same setup as before. The benchmark consists of 10 million rounds. At each round, a random int is generated between 0 and 999; this serves as the key. Then a random boolean is generated. If the boolean is true, then the key is incremented; false, the key is decremented. If the value associated with the key is zero, the key is removed from the map. One might consider this a "random walk" with mostly value updates, plus some insertions and removals of keys.

The following shows results, where the keys are left as is (as ints):

	Total time (ms)
HashMap<Integer, Integer>	813
HMapII	366 (-55%)
Int2IntOpenHashMap	417 (-49%)

The following shows results, where the keys are treated as Strings:

	Total time (ms)
HashMap<String, Integer>	1420
HMapKI<String>	1325 (-7%)
Object2IntOpenHashMap<String>	2364 (+66%)

It appears that the fastutil classes are poor at support removals!

Discussion

An understanding of the underlying implementations helps to explain these results:

The default Java HashMap backs every key-value pair with an Entry object. This goes naturally with the use of chaining for hash collisions and supports iteration by keys, values, and entries.
The Cloud⁹ versions follow this basic pattern, but get rid of autoboxing by introducing type-specific variants.
The fastutil classes use open addressing and therefore do not need Entry objects to back up key-value pairs.

This explains why the fastutil classes are so fast and so memory efficient (no need for Entry objects and thus far less object creation and garbage collection). It also explains why removals are problematic for the fastutil classes, since with open addressing removal entails marking the slot as "deleted" (but the slot remains "occupied").

We've so far only talked about Strings or ints being keys and ints as values. Of course, we'll need other type-specific classes. Here are all the ones currently implemented (with Writable variants for use in Hadoop):

Class	Replaces	Implemented interface	Hadoop Writable
HMapKI<K>	HashMap<K, Integer>	MapKI<K>	HMapKIW<K>
HMapKF<K>	HashMap<K, Float>	MapKF<K>	HMapKFW<K>
HMapKL<K>	HashMap<K, Long>	MapKL<K>
HMapII	HashMap<Integer, Integer>	MapII	HMapIIW
HMapIF	HashMap<Integer, Float>	MapIF	HMapIFW
HMapIV	HashMap<Integer, V>	MapIV	HMapIVW<V>

Note that for Hadoop Writables, the key type must be a WritableComparable, and the value must be a Writable. The variants with floats as values are especially useful as feature vectors. There are additionally two specialized classes, HMapSFW extends HMapKF<String> and HMapSIW extends HMapKI<String>. These save the need to wrap a String in a Hadoop Text object.

Postscript

Knuth's warning about premature optimization being the root of all evil is worth considering here. There are two response: First, since these alternative classes are close to being drop-in replacements for standard Java classes, their use doesn't count as "premature optimization". Second, the spirit of Knuth's quote is absolutely correct. If a program only spends 5% of its time accessing and manipulating frequency distributions, then even doubling that speed would yield at most a 2.5% performance improvement overall (not a big deal). However, is often is the case that many text processing applications spend a lot of their time manipulating counts of various events (think, for example, language modeling)—in which case, faster data structures make a world of difference. It all comes down to the programmer recognizing when it's one case and when it's the other.