Cloud9: A Hadoop toolkit for working with big data

This page describes the code used to run experiments in the following paper:

Jimmy Lin and Michael Schatz. Design Patterns for Efficient Graph Algorithms in MapReduce. Proceedings of the 2010 Workshop on Mining and Learning with Graphs Workshop (MLG-2010), July 2010, Washington, D.C.

There's code in Cloud⁹ that illustrates three different design patterns for graph algorithms in MapReduce using PageRank: in-mapper combining for faster local aggregation, range partitioning to create more opportunities for local aggregation, and Schimmy to avoid shuffling the graph structure from iteration to iteration.

We use a very simple text format for specifying the graph structure. A graph with n nodes is represented in a text file with n arbitrarily-ordered lines. Each line begins with a nodeid (numeric identifier for the node), followed by its adjacency list, which specifies neighbors reachable via outgoing edges. The adjacency list is tab separated. Note that if a node does not contain an outgoing edges, you still need a line in the file to represent it. Here's a simple example (tab replaced with spaces for presentation reasons):

1    3    4
2    1
3
4    2    3

This represents a graph with four nodes: nodeid 1 points to 3 and 4; nodeid 2 points to 1, nodeid 3 is a dangling node (no neighbors); and nodeid 4 points to nodes 2 and 3.

The adjacency list representation of the webgraph extracted from the first English segment of ClueWeb09 is available here. You'll want to clone the repo and uncompress the text file.

Here are the relevant driver classes for running PageRank:

edu.umd.cloud9.example.pagerank.BuildPageRankRecords
edu.umd.cloud9.example.pagerank.PartitionGraph
edu.umd.cloud9.example.pagerank.RunPageRankBasic
edu.umd.cloud9.example.pagerank.RunPageRankSchimmy

Below, we present step-by-step instructions for replicating the various experiments presented in the paper. The first step is to create PageRank records from the text-based representation using edu.umd.cloud9.example.pagerank.BuildPageRankRecords:

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.pagerank.BuildPageRankRecords \
   -input clueweb09en01-webgraph-adjacency.txt -output clueweb09en01-PageRankRecords -numNodes 50220423

Each PageRank record represents a graph vertex: each data structure holds the current PageRank value and the vertex's adjacency list.

Hash Partitioning

Before we run PageRank, we have to partition the graph. We'll start with hash partitioning, using edu.umd.cloud9.example.pagerank.PartitionGraph:

$ hadoop fs -mkdir clueweb09en01-PageRank.hash.basic

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.pagerank.PartitionGraph \
   -input clueweb09en01-PageRankRecords -output clueweb09en01-PageRank.hash.basic/iter0000 \
   -numPartitions 200 -numNodes 50220423

We can now run the basic version of PageRank, using edu.umd.cloud9.example.pagerank.RunPageRankBasic:

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.pagerank.RunPageRankBasic \
   -base clueweb09en01-PageRank.hash.basic -numNodes 50220423 -start 0 -end 10 -useInMapperCombiner

The -start and -end options specify the start and end iterations. The -useCombiner option enables combiners, and the -useInMapperCombiner option enables the in-mapper combiners (as above).

Once the run completes, you can extract the top 10 nodes with the highest PageRank values:

$ hadoop jar target/cloud9-2.0.1-SNAPSHOT-fatjar.jar edu.umd.cloud9.example.pagerank.FindMaxPageRankNodes \
   -input clueweb09en01-PageRank.hash.basic/iter0010 -output clueweb09en01-PageRank.hash.basic-top10 -top 10

$ hadoop fs -cat clueweb09en01-PageRank.hash.basic-top10/part-r-00000

To run the Schimmy implementation, use edu.umd.cloud9.example.pagerank.RunPageRankSchimmy:

$ hadoop fs -mkdir clueweb09en01-PageRank.hash.schimmy

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.pagerank.PartitionGraph \
   -input clueweb09en01-PageRankRecords -output clueweb09en01-PageRank.hash.schimmy/iter0000 \
   -numPartitions 200 -numNodes 50220423

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.pagerank.RunPageRankSchimmy \
   -base clueweb09en01-PageRank.hash.schimmy -numNodes 50220423 -start 0 -end 10 -useInMapperCombiner

The options are exactly the same.

Similarly, to extract the top 10 nodes with the highest PageRank values:

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.pagerank.FindMaxPageRankNodes \
   -input clueweb09en01-PageRank.hash.schimmy/iter0010 -output clueweb09en01-PageRank.hash.schimmy-top10 -top 10

$ hadoop fs -cat clueweb09en01-PageRank.hash.schimmy-top10/part-r-00000

Range Partitioning

Let's now switch to range partitioning, also using edu.umd.cloud9.example.pagerank.PartitionGraph, but with the -range option:

$ hadoop fs -mkdir clueweb09en01-PageRank.range.basic

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.pagerank.PartitionGraph \
   -input clueweb09en01-PageRankRecords -output clueweb09en01-PageRank.range.basic/iter0000 \
   -numPartitions 200 -numNodes 50220423 -range

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.pagerank.RunPageRankBasic \
   -base clueweb09en01-PageRank.range.basic -numNodes 50220423 -start 0 -end 10 -useInMapperCombiner -range

And to extract the top 10 nodes with the highest PageRank values:

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.pagerank.FindMaxPageRankNodes \
   -input clueweb09en01-PageRank.range.basic/iter0010 -output clueweb09en01-PageRank.range.basic-top10 -top 10

$ hadoop fs -cat clueweb09en01-PageRank.range.basic-top10/part-r-00000

To run the Schimmy implementation, also use the -range option:

$ hadoop fs -mkdir clueweb09en01-PageRank.range.schimmy

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.pagerank.PartitionGraph \
   -input clueweb09en01-PageRankRecords -output clueweb09en01-PageRank.range.schimmy/iter0000 \
   -numPartitions 200 -numNodes 50220423 -range

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.pagerank.RunPageRankSchimmy \
   -base clueweb09en01-PageRank.range.schimmy -numNodes 50220423 -start 0 -end 10 -useInMapperCombiner -range

Similarly, to extract the top 10 nodes with the highest PageRank values:

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.example.pagerank.FindMaxPageRankNodes \
   -input clueweb09en01-PageRank.range.schimmy/iter0010 -output clueweb09en01-PageRank.range.schimmy-top10 -top 10

$ hadoop fs -cat clueweb09en01-PageRank.range.schimmy-top10/part-r-00000

That's it!

Reference Implementations: PageRank

Hash Partitioning

Range Partitioning