Cloud9: A Hadoop toolkit for working with big data

For several reasons, Wikipedia is among the most popular datasets to process with Hadoop: it's big, it contains a lot of text, and it has a rich link structure. And perhaps most important of all, Wikipedia can be freely downloaded by anyone. Where do you get it? See this page for information about raw Wikipedia dumps in XML. Direct access to English Wikipedia dumps can be found here.

In this guide, we'll be working with the XML dump of English Wikipedia dated May 3, 2013 (enwiki-20130503-pages-articles.xml.bz2). Note that Wikipedia is distributed as a single, large XML file compressed with bzip2. We'll want to repack the dataset as block-compressed SequenceFiles (see below), so you should uncompress the file before putting it into HDFS.

As a sanity check, the first thing you might want to do is count how many pages there are in this particular version of Wikipedia:

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.collection.wikipedia.CountWikipediaPages \
   -input /shared/collections/wikipedia/raw/enwiki-20130503-pages-articles.xml -wiki_language en

Conveniently enough, Cloud⁹ provides a WikipediaPageInputFormat that automatically parses the XML dump file, so that your mapper will be presented WikipediaPage objects to process.

After running the above program, you'll see the follow statistics stored in counters:

Type	Count
Redirect pages	5,998,969
Disambiguation pages	126,046
Empty pages	566
Stub articles	1,740,514
Total articles (including stubs)	4,293,113
Other	3,031,844
Total	13,450,538

So far, so good!

The remainder of this guide will provide a more principled approach to working with Wikipedia. However, for something "quick-and-dirty", we can dump all articles into a flat text file with the following command:

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.collection.wikipedia.DumpWikipediaToPlainText \
   -input /shared/collections/wikipedia/raw/enwiki-20130503-pages-articles.xml -wiki_language en -output enwiki-20130503.txt

The output consist of text files, one article per line (article title and contents, separated by a tab). Note that DumpWikipediaToPlainText discards all non-articles.

Sequentially-Numbered Docnos

Many information retrieval and other text processing tasks require that all documents in the collection be sequentially numbered, from 1 ... n. Typically, you'll want to start with document 1 as opposed to 0 because it is not possible to represent 0 with many standard compression schemes used in information retrieval (i.e., gamma codes).

The next step is to create a mapping from Wikipedia internal document ids (docids) to these sequentially-numbered ids (docnos). Although, in general, docids can be arbitrary alphanumeric strings, we'll be using Wikipedia internal ids (which happend to be integers, but we'll treat as strings) as the docids. This is a bit confusing, so we'll illustrate. The first full article in our XML dump file is:

  ...
  <page>
    <title>Anarchism</title>
    <ns>0</ns>
    <id>12</id>
  ...

So this page gets docid "12" (the Wikipedia internal id) and docno 1 (the sequentially-numbered id). We can build the mapping between docids and docnos by the following command:

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.collection.wikipedia.WikipediaDocnoMappingBuilder \
   -input /shared/collections/wikipedia/raw/enwiki-20130503-pages-articles.xml \
   -output_file enwiki-20130503-docno.dat -wiki_language en -keep_all

The mapping file enwiki-20130503-docno.dat is used by the class WikipediaDocnoMapping, which provides a nice API for mapping back and forth between docnos and docids. Note that in this process, we're discarding all non-articles.

Aside: Why not use the article titles are docids? We could, but since we need to maintain the docid to docno mapping in memory (for fast lookup), using article titles as docids would be expensive (lots of strings to store in memory).

Note that by default, WikipediaDocnoMappingBuilder discards non-articles (e.g., disambiguation pages, redirects, etc.). To retain all pages, use the -keep_all option.

Repacking the Records

So far, we've been directly processing the raw uncompressed Wikipedia data dump in XML. To gain the benefits of compression (within the Hadoop framework), we'll now repack the XML file into block-compressed SequenceFiles, where the keys are the docnos, and the values are WikipediaPage objects. This will make subsequent processing much easier.

Here's the command-line invocation:

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.collection.wikipedia.RepackWikipedia \
   -input /shared/collections/wikipedia/raw/enwiki-20130503-pages-articles.xml \
   -output enwiki-20130503.block -wiki_language en \
   -mapping_file enwiki-20130503-docno.dat -compression_type block

In addition to block-compression, the program RepackWikipedia also supports record-level compression as well as no compression:

# Record-level compression
$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.collection.wikipedia.RepackWikipedia \
   -input /shared/collections/wikipedia/raw/enwiki-20130503-pages-articles.xml \
   -output enwiki-20130503.record -wiki_language en \
   -mapping_file enwiki-20130503-docno.dat -compression_type record

# No compression
$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.collection.wikipedia.RepackWikipedia \
   -input /shared/collections/wikipedia/raw/enwiki-20130503-pages-articles.xml \
   -output enwiki-20130503.seq -wiki_language en \
   -mapping_file enwiki-20130503-docno.dat -compression_type none

Here's how the various methods stack up:

Format	Size (MB)
Original XML dump (bzip2-compressed)	9,814
Original XML dump (uncompressed)	43,777
`SequenceFile` (block-compressed)	12,463
`SequenceFile` (record-compressed)	17,523
`SequenceFile` (no compression)	44,120

Block-compressed SequenceFiles are fairly space efficient. Record-compressed SequenceFiles are less space-efficient, but retain the ability to directly seek to any WikipediaPage object (whereas with block compression you can only seek to block boundaries). In most usage scenarios, block-compressed SequenceFiles are preferred.

Supporting Random Access

Cloud⁹ provides utilities for random access to Wikipedia articles. First, we need to build a forward index:

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.collection.wikipedia.WikipediaForwardIndexBuilder \
   -input /user/jimmylin/enwiki-20130503.block -index_file enwiki-20130503.findex.dat

Note: It mostly doesn't matter elsewhere, but here -input must take an absolute path.

With this forward index, we can now programmatically access Wikipedia articles given docno or docid. Here's a snippet of code that does this:

Configuration conf = getConf();
WikipediaForwardIndex f = new WikipediaForwardIndex(conf);
f.loadIndex(new Path("enwiki-20130503.findex.dat"), new Path("enwiki-20130503-docno.dat"), FileSystem.get(conf));

WikipediaPage page;

// fetch docno
page = f.getDocument(1000);
System.out.println(page.getDocid() + ": " + page.getTitle());

// fetch docid
page = f.getDocument("1875");
System.out.println(page.getDocid() + ": " + page.getTitle());

This is illustrated by a simple command-line program that allows you to find out the title of a page given either the docno or the docid:

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.collection.wikipedia.LookupWikipediaArticle \
   enwiki-20130503.findex.dat enwiki-20130503-docno.dat

Finally, there's also a web interface for browsing, which can be started with the following invocation:

$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.collection.DocumentForwardIndexHttpServer \
   -docnoMapping enwiki-20130503-docno.dat -index enwiki-20130503.findex.dat

When the server starts up, it'll report back the host information of the node it's running on (along with the port). You can then direct your browser at the relevant URL and gain access to the webapp.

Working with Wikipedia

Sequentially-Numbered Docnos

Repacking the Records

Supporting Random Access