A Hadoop toolkit for working with big data

This page describes the Cloud9 reference implementation of parallel breadth-first search, as described in Chapter 5 of Data-Intensive Text Processing with MapReduce. Note that the book does not describe more recently developed design patterns for graph algorithms; see this page for more information.

We use a very simple text format for specifying the graph structure. A graph with n nodes is represented in a text file with n arbitrarily-ordered lines. Each line begins with a nodeid (numeric identifier for the node), followed by its adjacency list, which specifies neighbors reachable via outgoing edges. The adjacency list is tab separated. Note that if a node does not contain an outgoing edges, you still need a line in the file to represent it. Here's a simple example (tab replaced with spaces for presentation reasons):

1    3    4
2    1
4    2    3

This represents a graph with four nodes: nodeid 1 points to 3 and 4; nodeid 2 points to 1, nodeid 3 is a dangling node (no neighbors); and nodeid 4 points to nodes 2 and 3.

Here, we'll be running parallel breadth-first search on Wikipedia. Refer to this page on working with Wikipedia. It contains instructions for packing Wikipedia pages from the raw XML distribution into block-compressed SequenceFiles for convenient access. Briefly, here are the steps:

First, build a docno mapping, making sure that you include all pages with the -keep_all option:

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.collection.wikipedia.WikipediaDocnoMappingBuilder \
   -input /shared/collections/wikipedia/raw/enwiki-20121201-pages-articles \
   -output_file enwiki-20121201-docno.dat -wiki_language en -keep_all

Next, repack the collection into block-compressed SequenceFiles:

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.collection.wikipedia.RepackWikipedia \
   -input /shared/collections/wikipedia/raw/enwiki-20121201-pages-articles -output enwiki-20121201.block \
   -mapping_file enwiki-20121201-docno.dat -wiki_language en -compression_type block

Once you've done that, extract the link graph:

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.collection.wikipedia.graph.ExtractWikipediaLinkGraph \
   -input enwiki-20121201.block -edges_output enwiki-20121201.edges -adjacency_list_output enwiki-20121201.adj -num_partitions 10

From the counters you'll see:


Which provides some statistics on the size of the graph.

After extracting the link structure, we take the plain-text representation and pack it into appropriate Hadoop data structures. At the same time, we specify the source of the breadth-first search:

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.bfs.EncodeBfsGraph \
   -input enwiki-20121201.adj -output enwiki-20121201.bfs/iter0000 -src 12

We'll start from nodeid 12, Anarchism, which is the first full article in the dataset. The number of nodes in the link graph is stored in a counter and can be read either from the command line or from the jobtracker.

This handy program can be used to find all reachable nodes and print them in plain text:

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.bfs.FindReachableNodes \
   -input enwiki-20121201.bfs/iter0000 -output enwiki-20121201.bfs-reachable/iter0000

As a sanity check, we see that only the source node is reachable.

Now let's run one iteration of parallel breadth-first search:

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.bfs.IterateBfs \
   -input enwiki-20121201.bfs/iter0000 -output enwiki-20121201.bfs/iter0001 -num_partitions 10

The final argument is the number of partitions (ten in our case). We can pull out and examine the reachable nodes:

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.bfs.FindReachableNodes \
   -input enwiki-20121201.bfs/iter0001 -output enwiki-20121201.bfs-reachable/iter0001

Running more iterations, we'll get the following results:

Iteration Reachable nodes
0 1
1 573
2 37,733
3 845,452

A few observations: only about 45% of Wikipedia pages are reachable, but the vast majority of reachable pages are only a few hops away. But why are there pages so far away? (The algorithm still had not converged at iteration 25, but I gave up trying.) To investigate, we can use a handy tool to extract pages that are at a specific distance (e.g., 25 hops away):

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.bfs.FindNodeAtDistance \
   -input enwiki-20121201.bfs/iter0025 -output enwiki-20121201.bfs-d/iter0025 -distance 25

As it turns out, there are really long chains of articles in Wikipedia. Here's an example:

What?! Digging around a bit, we find that Lenart Praunsperger was the mayor of Ljubljana (Solvenia) in 1506. His short bio page links to a page for each of his successors, each of whom have a short bio page.

Another example:

This goes on until the early 13th century... So, no, there isn't a bug in the code. Wikipedia is simply idiosyncratic.