A Hadoop toolkit for working with big data

Warning It is strongly recommended that you first complete the word count tutorial before trying this exercise.

In this exercise, you are going to implement PageRank in MapReduce. Here are three separate graphs to work with:

Of course, these are just toy datasets, since 1458 nodes wouldn't qualify as "large" by any stretch of the imagination...

The files are tab-delimited adjacency list representations of the graphs. The first token on each line represents the unique id of the source node, and the rest of the tokens represent the target nodes (i.e., outlinks from the source node). If a node does not have any outlinks, its corresponding line will contain only one token (the source node id).

To make your lives easier, I've created a small demo program that computes PageRank using the JUNG package (2.0.1). The relevant class is edu.umd.cloud9.example.pagerank.SequentialPageRank, which takes two command-line arguments: the first is a file containing the graph (one of the ones above), and the second is the random jump factor (a typical setting is 0.15).

You can launch the program from a shell. Go to the base project directory and type ant to build the necessary jar. You should then be able to run the following command (which uses an automatically-generated launch script that properly constructs the correct classpath):

$ mvn exec:java -Dexec.mainClass=edu.umd.cloud9.example.pagerank.SequentialPageRank \
   -Dexec.args="-input docs/exercises/sample-small.txt -jump 0.15"

And here's the sample output:

Number of components: 1
Number of edges: 195
Number of nodes: 93
Random jump factor: 0.15

PageRank of nodes, in descending order:
0.04821884416235794	8709207
0.03471312258467498	11287582
0.03438927990335218	9650960
0.033945207898198146	12610128
0.032378178767292154	8553535

To help in your implementation, I've captured the complete output of edu.umd.cloud9.example.pagerank.SequentialPageRank for each of the above graphs:


If you're stuck, you might want to take a look at the source code of the JUNG implementation.

In the networks above, there are a relatively large number of dangling nodes, i.e., nodes with no outlinks. This frequently happens in the Web context also; for example, pages the crawler hasn't downloaded yet would appear as "dangling". For these nodes, you'll need to figure out how to "spread" its PageRank score to the other nodes.

Here's the a good order in which to tackle the various issues:

  1. Build an implementation that does not handle the random jump factor and does not handling dangling nodes.
  2. Add in support for dangling nodes.
  3. Add in support for the random jump factor.


If you're curious: the nodes in these graphs represent MEDLINE records (abstracts in the life sciences domain). The links represent content-similarity links, i.e., pairs of abstracts that are similar in the words they contain. For example, consider pmid (unique identifier in the MEDLINE collection) 8709207. See the "Related Links" panel on the right hand side of the browser? The data provided above represent instances of graphs defined by such links.


When you're ready, the solutions to this exercise are located here.