A Hadoop toolkit for working with big data
Warning | It is strongly recommended that you first complete the word count tutorial before trying this exercise. |
In this exercise, you are going to implement PageRank in MapReduce. Here are three separate graphs to work with:
Of course, these are just toy datasets, since 1458 nodes wouldn't qualify as "large" by any stretch of the imagination...
The files are tab-delimited adjacency list representations of the graphs. The first token on each line represents the unique id of the source node, and the rest of the tokens represent the target nodes (i.e., outlinks from the source node). If a node does not have any outlinks, its corresponding line will contain only one token (the source node id).
To make your lives easier, I've created a small demo program that
computes PageRank using
the JUNG package (2.0.1).
The relevant class
is edu.umd.cloud9.example.pagerank.SequentialPageRank
, which takes
two command-line arguments: the first is a file containing the graph
(one of the ones above), and the second is the random jump factor (a
typical setting is 0.15).
You can launch the program from a shell. Go to the base project
directory and type ant
to build the necessary jar. You
should then be able to run the following command (which uses an
automatically-generated launch script that properly constructs the
correct classpath):
$ mvn exec:java -Dexec.mainClass=edu.umd.cloud9.example.pagerank.SequentialPageRank \ -Dexec.args="-input docs/exercises/sample-small.txt -jump 0.15"
And here's the sample output:
Number of components: 1 Number of edges: 195 Number of nodes: 93 Random jump factor: 0.15 PageRank of nodes, in descending order: 0.04821884416235794 8709207 0.03471312258467498 11287582 0.03438927990335218 9650960 0.033945207898198146 12610128 0.032378178767292154 8553535 ...
To help in your implementation, I've captured the complete output
of edu.umd.cloud9.example.pagerank.SequentialPageRank
for each of the
above graphs:
If you're stuck, you might want to take a look at the source code of the JUNG implementation.
In the networks above, there are a relatively large number of dangling nodes, i.e., nodes with no outlinks. This frequently happens in the Web context also; for example, pages the crawler hasn't downloaded yet would appear as "dangling". For these nodes, you'll need to figure out how to "spread" its PageRank score to the other nodes.
Here's the a good order in which to tackle the various issues:
If you're curious: the nodes in these graphs represent MEDLINE records (abstracts in the life sciences domain). The links represent content-similarity links, i.e., pairs of abstracts that are similar in the words they contain. For example, consider pmid (unique identifier in the MEDLINE collection) 8709207. See the "Related Links" panel on the right hand side of the browser? The data provided above represent instances of graphs defined by such links.
When you're ready, the solutions to this exercise are located here.