A Hadoop toolkit for working with big data
This page describes the Cloud9 reference implementation of parallel breadth-first search, as described in Chapter 5 of Data-Intensive Text Processing with MapReduce. Note that the book does not describe more recently developed design patterns for graph algorithms; see this page for more information.
We use a very simple text format for specifying the graph structure. A graph with n nodes is represented in a text file with n arbitrarily-ordered lines. Each line begins with a nodeid (numeric identifier for the node), followed by its adjacency list, which specifies neighbors reachable via outgoing edges. The adjacency list is tab separated. Note that if a node does not contain an outgoing edges, you still need a line in the file to represent it. Here's a simple example (tab replaced with spaces for presentation reasons):
1 3 4 2 1 3 4 2 3
This represents a graph with four nodes: nodeid 1 points to 3 and 4; nodeid 2 points to 1, nodeid 3 is a dangling node (no neighbors); and nodeid 4 points to nodes 2 and 3.
Here, we'll be running parallel breadth-first search on Wikipedia.
Refer to this page on working with
Wikipedia. It contains instructions for packing Wikipedia pages
from the raw XML distribution into block-compressed SequenceFile
s for
convenient access. Briefly, here are the steps:
First, build a docno mapping, making sure that you include all
pages with the
-keep_all
option:
$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.collection.wikipedia.WikipediaDocnoMappingBuilder \ -input /shared/collections/wikipedia/raw/enwiki-20121201-pages-articles \ -output_file enwiki-20121201-docno.dat -wiki_language en -keep_all
Next, repack the collection into block-compressed
SequenceFile
s:
$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.collection.wikipedia.RepackWikipedia \ -input /shared/collections/wikipedia/raw/enwiki-20121201-pages-articles -output enwiki-20121201.block \ -mapping_file enwiki-20121201-docno.dat -wiki_language en -compression_type block
Once you've done that, extract the link graph:
$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.collection.wikipedia.graph.ExtractWikipediaLinkGraph \ -input enwiki-20121201.block -edges_output enwiki-20121201.edges -adjacency_list_output enwiki-20121201.adj -num_partitions 10
From the counters you'll see:
edu.umd.cloud9.collection.wikipedia.graph.ExtractWikipediaLinkGraph$GraphInfo EDGES=121762273 TOTAL_VERTICES=12961996 VERTICIES_WITH_OUTLINKS=10813673
Which provides some statistics on the size of the graph.
After extracting the link structure, we take the plain-text representation and pack it into appropriate Hadoop data structures. At the same time, we specify the source of the breadth-first search:
$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.bfs.EncodeBfsGraph \ -input enwiki-20121201.adj -output enwiki-20121201.bfs/iter0000 -src 12
We'll start from nodeid 12, Anarchism, which is the first full article in the dataset. The number of nodes in the link graph is stored in a counter and can be read either from the command line or from the jobtracker.
This handy program can be used to find all reachable nodes and print them in plain text:
$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.bfs.FindReachableNodes \ -input enwiki-20121201.bfs/iter0000 -output enwiki-20121201.bfs-reachable/iter0000
As a sanity check, we see that only the source node is reachable.
Now let's run one iteration of parallel breadth-first search:
$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.bfs.IterateBfs \ -input enwiki-20121201.bfs/iter0000 -output enwiki-20121201.bfs/iter0001 -num_partitions 10
The final argument is the number of partitions (ten in our case). We can pull out and examine the reachable nodes:
$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.bfs.FindReachableNodes \ -input enwiki-20121201.bfs/iter0001 -output enwiki-20121201.bfs-reachable/iter0001
Running more iterations, we'll get the following results:
Iteration | Reachable nodes |
0 | 1 |
1 | 573 |
2 | 37,733 |
3 | 845,452 |
4 | 3,596,247 |
5 | 5,236,564 |
6 | 5,640,663 |
7 | 5,719,546 |
8 | 5,738,196 |
9 | 5,742,452 |
10 | 5,743,930 |
11 | 5,744,670 |
12 | 5,744,969 |
13 | 5,745,165 |
14 | 5,745,283 |
15 | 5,745,373 |
16 | 5,745,439 |
17 | 5,745,488 |
18 | 5,745,533 |
19 | 5,745,570 |
20 | 5,745,608 |
21 | 5,745,642 |
22 | 5,745,663 |
23 | 5,745,683 |
24 | 5,745,699 |
25 | 5,745,711 |
A few observations: only about 45% of Wikipedia pages are reachable, but the vast majority of reachable pages are only a few hops away. But why are there pages so far away? (The algorithm still had not converged at iteration 25, but I gave up trying.) To investigate, we can use a handy tool to extract pages that are at a specific distance (e.g., 25 hops away):
$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.bfs.FindNodeAtDistance \ -input enwiki-20121201.bfs/iter0025 -output enwiki-20121201.bfs-d/iter0025 -distance 25
As it turns out, there are really long chains of articles in Wikipedia. Here's an example:
What?! Digging around a bit, we find that Lenart Praunsperger was the mayor of Ljubljana (Solvenia) in 1506. His short bio page links to a page for each of his successors, each of whom have a short bio page.
Another example:
This goes on until the early 13th century... So, no, there isn't a bug in the code. Wikipedia is simply idiosyncratic.