Cloud9

A Hadoop toolkit for working with big data

These are the solutions to the bigram counting exercise which takes advantage of the "order inversion" design pattern to compute bigram relative frequencies. This design pattern is covered in Chapter 3 of Lin and Dyer's book "Data-Intensive Text Processing with MapReduce".

Part I: Count the bigrams

The solution can be found in edu.umd.cloud9.example.bigram.BigramCount. This implementation is quite simple: just like word count, except that we're counting bigrams instead of words. The keys emitted by the mapper are (a, b) pairs representing the observed bigrams, while the values are their counts. The reducer simply sums these counts.

Here's the command-line invocation (should run in local mode or on a cluster, just make sure paths are correct):

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.bigram.BigramCount \
   -input data/bible+shakes.nopunc.gz -output bigram -numReducers 1

Answers to the questions: For analyzing the results, run edu.umd.cloud9.example.bigram.AnalyzeBigramCounts. Sample invocation:

$ mvn exec:java -Dexec.mainClass=edu.umd.cloud9.example.bigram.AnalyzeBigramCount \
   -Dexec.args="-input bigram"

Note we specify the entire output directory. You'll see the following output:

total number of unique bigrams: 424462
total number of bigrams: 1578220
number of bigrams that appear only once: 296139

ten most frequent bigrams:
of the     13037
and the     7034
the lord    7017
in the      6738
to the      3799
i will      3470
and he      3020
shall be    3013
all the     2714
i have      2666

So, we see that the number of bigrams that appear only once is 296,139. The fraction of all bigrams occurrences the top ten bigrams account for: 52508/1578220 = 3.33%

Part II: From bigram counts to relative frequencies

The solution can be found in edu.umd.cloud9.example.bigram.BigramRelativeFrequency. This program takes advantage of the "order inversion" design pattern, which is covered in Chapter 3 of Lin and Dyer's book "Data-Intensive Text Processing with MapReduce". Here's how it works, very briefly: in addition to each bigram count, the mapper emits an special key (word, *) which contributes to the marginal count of the word. We define the sort order so that this special key always comes before the bigram counts, so that by preserving state across multiple reduce method calls,we can compute the relative frequency directly by simple division. We also need to define a custom partitioner to send all keys with the same left element to the same reducer.

Here's the command-line invocation:

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.bigram.BigramRelativeFrequency \
   -input data/bible+shakes.nopunc.gz -output bigram -numReducers 1

For analyzing the results, run edu.umd.cloud9.example.bigram.AnalyzeBigramRelativeFrequency. Sample invocation:

$ mvn exec:java -Dexec.mainClass=edu.umd.cloud9.example.bigram.AnalyzeBigramRelativeFrequency \
   -Dexec.args="-input bigram"

Question 1: What are the five most frequent words following the word "light"? What is the frequency of observing each word?

(light, *)	454.0
(light, and)	0.12555066
(light, of)	0.116740085
(light, to)	0.04185022
(light, in)	0.033039648
(light, on)	0.033039648
(light, is)	0.028634362
(light, upon)	0.026431719
(light, that)	0.02202643
(light, a)	0.019823788

Question 2: Same question, except for the word "contain".

(contain, *)            19.0
(contain, a)            0.15789473
(contain, the)          0.15789473
(contain, thee)         0.15789473
(contain, and)          0.05263158
(contain, celestial)    0.05263158
(contain, him)          0.05263158
(contain, let)          0.05263158
(contain, ourselves)    0.05263158
(contain, that)         0.05263158

If there are a total of N words in your vocabulary, then there are a total of N2 possible values for F(Wn|Wn-1)—in theory, every word can follow every other word (including itself). What fraction of these values are non-zero?

Question 3:

424462 / 417882 = 0.0002430724

Alternative Implementations

This is a good opportunity to introduce different Hadoop data types. Here is an implementation uses JsonWritable:

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.bigram.BigramRelativeFrequencyJson \
   -input data/bible+shakes.nopunc.gz -output bigram-json -numReducers 1

And how to look at the results:

$ mvn exec:java -Dexec.mainClass=edu.umd.cloud9.example.bigram.AnalyzeBigramRelativeFrequencyJson \
   -Dexec.args="-input bigram-json"

Here is an implementation uses Pig tuples:

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.bigram.BigramRelativeFrequencyTuple \
   -input data/bible+shakes.nopunc.gz -output bigram-tuple -numReducers 1

And how to look at the results:

$ mvn exec:java -Dexec.mainClass=edu.umd.cloud9.example.bigram.AnalyzeBigramRelativeFrequencyTuple \
   -Dexec.args="-input bigram-tuple"