A Hadoop toolkit for working with big data
This page presents solutions for the PageRank exercise. Code is described in a separate page on design patterns for graph algorithms in MapReduce. Here, we show how the code can be applied to solve this exercise. Command-line invocations are shown for the "large" graph, but are similar for the other graphs.
Assuming that you've already loaded the adjacency list information
docs/exercises/sample-large.txt
to HDFS, the first step
is to take the plain-text graph structure and pack it into the
appropriate Hadoop data structures:
$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.pagerank.BuildPageRankRecords \ -input sample-large.txt -output sample-large-PageRankRecords -numNodes 1458
Next, let's partition the graph:
$ hadoop fs -mkdir sample-large-PageRank $ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.pagerank.PartitionGraph \ -input sample-large-PageRankRecords -output sample-large-PageRank/iter0000 -numPartitions 5 -numNodes 1458
Now we're ready to run PageRank. Let's run 10 iterations:
$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.pagerank.RunPageRankBasic \ -base sample-large-PageRank -numNodes 1458 -start 0 -end 10 -useCombiner
We're basically done! Let's use this utility program to fetch the top 10 nodes by PageRank:
$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.pagerank.FindMaxPageRankNodes \ -input sample-large-PageRank/iter0010 -output sample-large-PageRank-top10 -top 10
The results:
$ hadoop fs -cat sample-large-PageRank-top10/part-r-00000 9369084 -4.3875346 8669492 -4.4548674 12486146 -4.7748857 9265639 -4.855565 10912914 -4.8680253 8614504 -4.8873844 12787320 -4.897782 11775232 -4.917762 8069300 -5.0934334 9520006 -5.094012
The PageRank values are expressed as log probabilities, and do indeed match results from the JUNG reference implementation.