A Hadoop toolkit for working with big data
These are the solutions to the bigram counting exercise which takes advantage of the "order inversion" design pattern to compute bigram relative frequencies. This design pattern is covered in Chapter 3 of Lin and Dyer's book "Data-Intensive Text Processing with MapReduce".
The solution can be found
in edu.umd.cloud9.example.bigram.BigramCount
. This
implementation is quite simple: just like word count, except that
we're counting bigrams instead of words. The keys emitted by the
mapper are (a, b)
pairs representing the observed
bigrams, while the values are their counts. The reducer simply sums
these counts.
Here's the command-line invocation (should run in local mode or on a cluster, just make sure paths are correct):
$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.bigram.BigramCount \ -input data/bible+shakes.nopunc.gz -output bigram -numReducers 1
Answers to the questions: For analyzing the results, run
edu.umd.cloud9.example.bigram.AnalyzeBigramCounts
. Sample invocation:
$ mvn exec:java -Dexec.mainClass=edu.umd.cloud9.example.bigram.AnalyzeBigramCount \ -Dexec.args="-input bigram"
Note we specify the entire output directory. You'll see the following output:
total number of unique bigrams: 424462 total number of bigrams: 1578220 number of bigrams that appear only once: 296139 ten most frequent bigrams: of the 13037 and the 7034 the lord 7017 in the 6738 to the 3799 i will 3470 and he 3020 shall be 3013 all the 2714 i have 2666
So, we see that the number of bigrams that appear only once is 296,139. The fraction of all bigrams occurrences the top ten bigrams account for: 52508/1578220 = 3.33%
The solution can be found
in edu.umd.cloud9.example.bigram.BigramRelativeFrequency
.
This program takes advantage of the "order inversion" design
pattern, which is covered in Chapter 3 of Lin and Dyer's book
"Data-Intensive Text Processing with
MapReduce". Here's how it works, very briefly: in addition to
each bigram count, the mapper emits an special key (word, *) which
contributes to the marginal count of the word. We define the sort
order so that this special key always comes before the bigram counts,
so that by preserving state across multiple reduce method calls,we can
compute the relative frequency directly by simple division. We also
need to define a custom partitioner to send all keys with the same
left element to the same reducer.
Here's the command-line invocation:
$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.bigram.BigramRelativeFrequency \ -input data/bible+shakes.nopunc.gz -output bigram -numReducers 1
For analyzing the results, run
edu.umd.cloud9.example.bigram.AnalyzeBigramRelativeFrequency
.
Sample invocation:
$ mvn exec:java -Dexec.mainClass=edu.umd.cloud9.example.bigram.AnalyzeBigramRelativeFrequency \ -Dexec.args="-input bigram"
Question 1: What are the five most frequent words following the word "light"? What is the frequency of observing each word?
(light, *) 454.0 (light, and) 0.12555066 (light, of) 0.116740085 (light, to) 0.04185022 (light, in) 0.033039648 (light, on) 0.033039648 (light, is) 0.028634362 (light, upon) 0.026431719 (light, that) 0.02202643 (light, a) 0.019823788
Question 2: Same question, except for the word "contain".
(contain, *) 19.0 (contain, a) 0.15789473 (contain, the) 0.15789473 (contain, thee) 0.15789473 (contain, and) 0.05263158 (contain, celestial) 0.05263158 (contain, him) 0.05263158 (contain, let) 0.05263158 (contain, ourselves) 0.05263158 (contain, that) 0.05263158
If there are a total of N words in your vocabulary, then there are a total of N2 possible values for F(Wn|Wn-1)—in theory, every word can follow every other word (including itself). What fraction of these values are non-zero?
Question 3:
424462 / 417882 = 0.0002430724
This is a good opportunity to introduce different Hadoop data
types. Here is an implementation uses JsonWritable
:
$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.bigram.BigramRelativeFrequencyJson \ -input data/bible+shakes.nopunc.gz -output bigram-json -numReducers 1
And how to look at the results:
$ mvn exec:java -Dexec.mainClass=edu.umd.cloud9.example.bigram.AnalyzeBigramRelativeFrequencyJson \ -Dexec.args="-input bigram-json"
Here is an implementation uses Pig tuples:
$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.bigram.BigramRelativeFrequencyTuple \ -input data/bible+shakes.nopunc.gz -output bigram-tuple -numReducers 1
And how to look at the results:
$ mvn exec:java -Dexec.mainClass=edu.umd.cloud9.example.bigram.AnalyzeBigramRelativeFrequencyTuple \ -Dexec.args="-input bigram-tuple"