A Hadoop toolkit for working with big data

This page describes the Cloud^{9} reference
implementation of parallel breadth-first search, as described in
Chapter 5 of Data-Intensive Text
Processing with MapReduce. Note that the book does not describe
more recently developed design patterns for graph algorithms;
see this page for more information.

We use a very simple text format for specifying the graph
structure. A graph with *n* nodes is represented in a text file
with *n* arbitrarily-ordered lines. Each line begins with a
nodeid (numeric identifier for the node), followed by its adjacency
list, which specifies neighbors reachable via outgoing edges. The
adjacency list is tab separated. Note that if a node does not contain
an outgoing edges, you still need a line in the file to represent it.
Here's a simple example (tab replaced with spaces for presentation
reasons):

1 3 4 2 1 3 4 2 3

This represents a graph with four nodes: nodeid 1 points to 3 and 4; nodeid 2 points to 1, nodeid 3 is a dangling node (no neighbors); and nodeid 4 points to nodes 2 and 3.

Here, we'll be running parallel breadth-first search on Wikipedia.
Refer to this page on working with
Wikipedia. It contains instructions for packing Wikipedia pages
from the raw XML distribution into block-compressed `SequenceFile`

s for
convenient access. Briefly, here are the steps:

First, build a docno mapping, making sure that you include all
pages with the
`-keep_all`

option:

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.collection.wikipedia.WikipediaDocnoMappingBuilder \ -input /shared/collections/wikipedia/raw/enwiki-20121201-pages-articles \ -output_file enwiki-20121201-docno.dat -wiki_language en -keep_all

Next, repack the collection into block-compressed
`SequenceFile`

s:

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.collection.wikipedia.RepackWikipedia \ -input /shared/collections/wikipedia/raw/enwiki-20121201-pages-articles -output enwiki-20121201.block \ -mapping_file enwiki-20121201-docno.dat -wiki_language en -compression_type block

Once you've done that, extract the link graph:

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.collection.wikipedia.graph.ExtractWikipediaLinkGraph \ -input enwiki-20121201.block -edges_output enwiki-20121201.edges -adjacency_list_output enwiki-20121201.adj -num_partitions 10

From the counters you'll see:

edu.umd.cloud9.collection.wikipedia.graph.ExtractWikipediaLinkGraph$GraphInfo EDGES=121762273 TOTAL_VERTICES=12961996 VERTICIES_WITH_OUTLINKS=10813673

Which provides some statistics on the size of the graph.

After extracting the link structure, we take the plain-text representation and pack it into appropriate Hadoop data structures. At the same time, we specify the source of the breadth-first search:

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.bfs.EncodeBfsGraph \ -input enwiki-20121201.adj -output enwiki-20121201.bfs/iter0000 -src 12

We'll start from nodeid 12, Anarchism, which is the first full article in the dataset. The number of nodes in the link graph is stored in a counter and can be read either from the command line or from the jobtracker.

This handy program can be used to find all reachable nodes and print them in plain text:

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.bfs.FindReachableNodes \ -input enwiki-20121201.bfs/iter0000 -output enwiki-20121201.bfs-reachable/iter0000

As a sanity check, we see that only the source node is reachable.

Now let's run one iteration of parallel breadth-first search:

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.bfs.IterateBfs \ -input enwiki-20121201.bfs/iter0000 -output enwiki-20121201.bfs/iter0001 -num_partitions 10

The final argument is the number of partitions (ten in our case). We can pull out and examine the reachable nodes:

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.bfs.FindReachableNodes \ -input enwiki-20121201.bfs/iter0001 -output enwiki-20121201.bfs-reachable/iter0001

Running more iterations, we'll get the following results:

Iteration |
Reachable nodes |

0 | 1 |

1 | 573 |

2 | 37,733 |

3 | 845,452 |

4 | 3,596,247 |

5 | 5,236,564 |

6 | 5,640,663 |

7 | 5,719,546 |

8 | 5,738,196 |

9 | 5,742,452 |

10 | 5,743,930 |

11 | 5,744,670 |

12 | 5,744,969 |

13 | 5,745,165 |

14 | 5,745,283 |

15 | 5,745,373 |

16 | 5,745,439 |

17 | 5,745,488 |

18 | 5,745,533 |

19 | 5,745,570 |

20 | 5,745,608 |

21 | 5,745,642 |

22 | 5,745,663 |

23 | 5,745,683 |

24 | 5,745,699 |

25 | 5,745,711 |

A few observations: only about 45% of Wikipedia pages are reachable, but the vast majority of reachable pages are only a few hops away. But why are there pages so far away? (The algorithm still had not converged at iteration 25, but I gave up trying.) To investigate, we can use a handy tool to extract pages that are at a specific distance (e.g., 25 hops away):

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.bfs.FindNodeAtDistance \ -input enwiki-20121201.bfs/iter0025 -output enwiki-20121201.bfs-d/iter0025 -distance 25

As it turns out, there are really long chains of articles in Wikipedia. Here's an example:

- Lenart Praunsperger :: Jakob Stettenfelder :: Janez Lindauer :: Volk Meditsch :: Matevz Frang :: Jurij Tazel :: Anton Lantheri :: ...

What?! Digging around a bit, we find that Lenart Praunsperger was the mayor of Ljubljana (Solvenia) in 1506. His short bio page links to a page for each of his successors, each of whom have a short bio page.

Another example:

- List of peers 1060–1069 :: List of peers 1070–1079 :: List of peers 1080–1089 :: List of peers 1090–1099 :: ...

This goes on until the early 13th century... So, no, there isn't a bug in the code. Wikipedia is simply idiosyncratic.