A Hadoop toolkit for working with big data
Warning | It is strongly recommended that you first complete the word count tutorial before trying this exercise. |
This exercise is a simple extension of the word count demo: in the
first part of the exercise, you'll count bigrams, and in the
second part of the exercise, you'll compute bigram relative
frequencies. For both parts, feel free to use Hadoop data types in
the lintools-datatypes
package here.
Take the word count example
edu.umd.cloud9.example.simple.DemoWordCount
and extend it to count
bigrams. Bigrams
are simply sequences of two consecutive words. For example, the
previous sentence contains the following bigrams: "Bigrams are", "are
simply", "simply sequences", "sequence of", etc.
Work with the sample collection included in
Cloud9, the
Bible and the complete works of Shakespeare. Don't worry about
doing anything fancy in terms of tokenization; it's fine to continue
using Java's StringTokenizer
.
Questions to answer:
Extend your program to compute bigram relative frequencies, i.e., how likely you are to observe a word given the preceding word. The output of the code should be a table of values for F(Wn|Wn-1).
Hint: to compute F(B|A), count up the number of occurrences of the bigram "A B", and then divide by the number of occurrences of all the bigrams that start with "A".
Questions to answer:
When you're ready, the solutions to this exercise are located here.