A Hadoop toolkit for working with big data
This page describes the code used to run experiments in the following paper:
Jimmy Lin and Michael Schatz. Design Patterns for Efficient Graph Algorithms in MapReduce. Proceedings of the 2010 Workshop on Mining and Learning with Graphs Workshop (MLG-2010), July 2010, Washington, D.C.
There's code in Cloud9 that illustrates three different design patterns for graph algorithms in MapReduce using PageRank: in-mapper combining for faster local aggregation, range partitioning to create more opportunities for local aggregation, and Schimmy to avoid shuffling the graph structure from iteration to iteration.
We use a very simple text format for specifying the graph structure. A graph with n nodes is represented in a text file with n arbitrarily-ordered lines. Each line begins with a nodeid (numeric identifier for the node), followed by its adjacency list, which specifies neighbors reachable via outgoing edges. The adjacency list is tab separated. Note that if a node does not contain an outgoing edges, you still need a line in the file to represent it. Here's a simple example (tab replaced with spaces for presentation reasons):
1 3 4 2 1 3 4 2 3
This represents a graph with four nodes: nodeid 1 points to 3 and 4; nodeid 2 points to 1, nodeid 3 is a dangling node (no neighbors); and nodeid 4 points to nodes 2 and 3.
The adjacency list representation of the webgraph extracted from the first English segment of ClueWeb09 is available here. You'll want to clone the repo and uncompress the text file.
Here are the relevant driver classes for running PageRank:
edu.umd.cloud9.example.pagerank.BuildPageRankRecords
edu.umd.cloud9.example.pagerank.PartitionGraph
edu.umd.cloud9.example.pagerank.RunPageRankBasic
edu.umd.cloud9.example.pagerank.RunPageRankSchimmy
Below, we present step-by-step instructions for replicating the
various experiments presented in the paper. The first step is to
create PageRank records from the text-based representation
using edu.umd.cloud9.example.pagerank.BuildPageRankRecords
:
$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.pagerank.BuildPageRankRecords \ -input clueweb09en01-webgraph-adjacency.txt -output clueweb09en01-PageRankRecords -numNodes 50220423
Each PageRank record represents a graph vertex: each data structure holds the current PageRank value and the vertex's adjacency list.
Before we run PageRank, we have to partition the graph. We'll start
with hash partitioning,
using edu.umd.cloud9.example.pagerank.PartitionGraph
:
$ hadoop fs -mkdir clueweb09en01-PageRank.hash.basic $ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.pagerank.PartitionGraph \ -input clueweb09en01-PageRankRecords -output clueweb09en01-PageRank.hash.basic/iter0000 \ -numPartitions 200 -numNodes 50220423
We can now run the basic version of PageRank,
using edu.umd.cloud9.example.pagerank.RunPageRankBasic
:
$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.pagerank.RunPageRankBasic \ -base clueweb09en01-PageRank.hash.basic -numNodes 50220423 -start 0 -end 10 -useInMapperCombiner
The -start
and -end
options specify the
start and end iterations. The -useCombiner
option enables
combiners, and the -useInMapperCombiner
option enables
the in-mapper combiners (as above).
Once the run completes, you can extract the top 10 nodes with the highest PageRank values:
$ hadoop jar target/cloud9-2.0.1-SNAPSHOT-fatjar.jar edu.umd.cloud9.example.pagerank.FindMaxPageRankNodes \ -input clueweb09en01-PageRank.hash.basic/iter0010 -output clueweb09en01-PageRank.hash.basic-top10 -top 10 $ hadoop fs -cat clueweb09en01-PageRank.hash.basic-top10/part-r-00000
To run the Schimmy implementation,
use edu.umd.cloud9.example.pagerank.RunPageRankSchimmy
:
$ hadoop fs -mkdir clueweb09en01-PageRank.hash.schimmy $ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.pagerank.PartitionGraph \ -input clueweb09en01-PageRankRecords -output clueweb09en01-PageRank.hash.schimmy/iter0000 \ -numPartitions 200 -numNodes 50220423 $ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.pagerank.RunPageRankSchimmy \ -base clueweb09en01-PageRank.hash.schimmy -numNodes 50220423 -start 0 -end 10 -useInMapperCombiner
The options are exactly the same.
Similarly, to extract the top 10 nodes with the highest PageRank values:
$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.pagerank.FindMaxPageRankNodes \ -input clueweb09en01-PageRank.hash.schimmy/iter0010 -output clueweb09en01-PageRank.hash.schimmy-top10 -top 10 $ hadoop fs -cat clueweb09en01-PageRank.hash.schimmy-top10/part-r-00000
Let's now switch to range partitioning, also using
edu.umd.cloud9.example.pagerank.PartitionGraph
, but with
the -range
option:
$ hadoop fs -mkdir clueweb09en01-PageRank.range.basic $ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.pagerank.PartitionGraph \ -input clueweb09en01-PageRankRecords -output clueweb09en01-PageRank.range.basic/iter0000 \ -numPartitions 200 -numNodes 50220423 -range $ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.pagerank.RunPageRankBasic \ -base clueweb09en01-PageRank.range.basic -numNodes 50220423 -start 0 -end 10 -useInMapperCombiner -range
And to extract the top 10 nodes with the highest PageRank values:
$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.pagerank.FindMaxPageRankNodes \ -input clueweb09en01-PageRank.range.basic/iter0010 -output clueweb09en01-PageRank.range.basic-top10 -top 10 $ hadoop fs -cat clueweb09en01-PageRank.range.basic-top10/part-r-00000
To run the Schimmy implementation, also use the -range
option:
$ hadoop fs -mkdir clueweb09en01-PageRank.range.schimmy $ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.pagerank.PartitionGraph \ -input clueweb09en01-PageRankRecords -output clueweb09en01-PageRank.range.schimmy/iter0000 \ -numPartitions 200 -numNodes 50220423 -range $ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.pagerank.RunPageRankSchimmy \ -base clueweb09en01-PageRank.range.schimmy -numNodes 50220423 -start 0 -end 10 -useInMapperCombiner -range
Similarly, to extract the top 10 nodes with the highest PageRank values:
$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.pagerank.FindMaxPageRankNodes \ -input clueweb09en01-PageRank.range.schimmy/iter0010 -output clueweb09en01-PageRank.range.schimmy-top10 -top 10 $ hadoop fs -cat clueweb09en01-PageRank.range.schimmy-top10/part-r-00000
That's it!