A Hadoop toolkit for working with big data

"Pairs" and "stripes" are two design patterns introduced in Chapter 3 for computing the word co-occurrence matrix of a large text collection. With pairs, each co-occurring word pair is stored separately; with stripes, all words co-occurring with a conditioning word are stored together in an associative array.

These two design patterns are illustrated by the following Cloud9 classes:

Here's an invocation of the pairs algorithm on the sample dataset:

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.cooccur.ComputeCooccurrenceMatrixPairs \
   -input bible+shakes.nopunc.gz -output cooccur -numReducers 5

Let's find what words co-occur with the word "ant":

$ hadoop fs -cat 'cooccur/part-*' | grep "(ant, "
(ant, an)       1
(ant, and)      1
(ant, teach)    1
(ant, to)       3
(ant, thou)     1
(ant, sluggard) 1
(ant, the)      2

Are the results correct? Let's go back to the original text collection to find out:

$ gunzip -c data/bible+shakes.nopunc.gz | grep " ant "
go to the ant thou sluggard consider her ways and be wise
fool we'll set thee to school to an ant to teach thee

Seems right! And here's an invocation of the stripes algorithm on the same dataset:

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.cooccur.ComputeCooccurrenceMatrixStripes \
   -input bible+shakes.nopunc.gz -output cooccur -numReducers 5

And indeed, we get the same results:

$ hadoop fs -cat 'cooccur/part-*' | egrep "^ant\t"
ant     {thou=1, and=1, an=1, to=3, sluggard=1, the=2, teach=1}