A Hadoop toolkit for working with big data
For several reasons, Wikipedia is among the most popular datasets to process with Hadoop: it's big, it contains a lot of text, and it has a rich link structure. And perhaps most important of all, Wikipedia can be freely downloaded by anyone. Where do you get it? See this page for information about raw Wikipedia dumps in XML. Direct access to English Wikipedia dumps can be found here.
In this guide, we'll be working with the XML dump of English
Wikipedia dated May 3, 2013
(enwiki-20130503-pages-articles.xml.bz2
). Note that
Wikipedia is distributed as a single, large XML file compressed with
bzip2. We'll want to repack the dataset as
block-compressed SequenceFile
s (see below), so you should
uncompress the file before putting it into HDFS.
As a sanity check, the first thing you might want to do is count how many pages there are in this particular version of Wikipedia:
$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.collection.wikipedia.CountWikipediaPages \ -input /shared/collections/wikipedia/raw/enwiki-20130503-pages-articles.xml -wiki_language en
Conveniently enough, Cloud9 provides a
WikipediaPageInputFormat
that automatically parses the
XML dump file, so that your mapper will be
presented WikipediaPage
objects to process.
After running the above program, you'll see the follow statistics stored in counters:
Type | Count |
Redirect pages | 5,998,969 |
Disambiguation pages | 126,046 |
Empty pages | 566 |
Stub articles | 1,740,514 |
Total articles (including stubs) | 4,293,113 |
Other | 3,031,844 |
Total | 13,450,538 |
So far, so good!
The remainder of this guide will provide a more principled approach to working with Wikipedia. However, for something "quick-and-dirty", we can dump all articles into a flat text file with the following command:
$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.collection.wikipedia.DumpWikipediaToPlainText \ -input /shared/collections/wikipedia/raw/enwiki-20130503-pages-articles.xml -wiki_language en -output enwiki-20130503.txt
The output consist of text files, one article per line (article
title and contents, separated by a tab). Note
that DumpWikipediaToPlainText
discards all
non-articles.
Many information retrieval and other text processing tasks require that all documents in the collection be sequentially numbered, from 1 ... n. Typically, you'll want to start with document 1 as opposed to 0 because it is not possible to represent 0 with many standard compression schemes used in information retrieval (i.e., gamma codes).
The next step is to create a mapping from Wikipedia internal document ids (docids) to these sequentially-numbered ids (docnos). Although, in general, docids can be arbitrary alphanumeric strings, we'll be using Wikipedia internal ids (which happend to be integers, but we'll treat as strings) as the docids. This is a bit confusing, so we'll illustrate. The first full article in our XML dump file is:
... <page> <title>Anarchism</title> <ns>0</ns> <id>12</id> ...
So this page gets docid "12" (the Wikipedia internal id) and docno 1 (the sequentially-numbered id). We can build the mapping between docids and docnos by the following command:
$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.collection.wikipedia.WikipediaDocnoMappingBuilder \ -input /shared/collections/wikipedia/raw/enwiki-20130503-pages-articles.xml \ -output_file enwiki-20130503-docno.dat -wiki_language en -keep_all
The mapping file enwiki-20130503-docno.dat
is used by the
class WikipediaDocnoMapping
, which provides a nice API
for mapping back and forth between docnos and docids. Note that in
this process, we're discarding all non-articles.
Aside: Why not use the article titles are docids? We could, but since we need to maintain the docid to docno mapping in memory (for fast lookup), using article titles as docids would be expensive (lots of strings to store in memory).
Note that by default, WikipediaDocnoMappingBuilder
discards non-articles (e.g., disambiguation pages, redirects,
etc.). To retain all pages, use the -keep_all
option.
So far, we've been directly processing the raw uncompressed
Wikipedia data dump in XML. To gain the benefits of compression
(within the Hadoop framework), we'll now repack the XML file into
block-compressed SequenceFile
s, where the keys are the docnos, and the
values are WikipediaPage
objects. This will make subsequent
processing much easier.
Here's the command-line invocation:
$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.collection.wikipedia.RepackWikipedia \ -input /shared/collections/wikipedia/raw/enwiki-20130503-pages-articles.xml \ -output enwiki-20130503.block -wiki_language en \ -mapping_file enwiki-20130503-docno.dat -compression_type block
In addition to block-compression, the program RepackWikipedia
also
supports record-level compression as well as no compression:
# Record-level compression $ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.collection.wikipedia.RepackWikipedia \ -input /shared/collections/wikipedia/raw/enwiki-20130503-pages-articles.xml \ -output enwiki-20130503.record -wiki_language en \ -mapping_file enwiki-20130503-docno.dat -compression_type record # No compression $ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.collection.wikipedia.RepackWikipedia \ -input /shared/collections/wikipedia/raw/enwiki-20130503-pages-articles.xml \ -output enwiki-20130503.seq -wiki_language en \ -mapping_file enwiki-20130503-docno.dat -compression_type none
Here's how the various methods stack up:
Format | Size (MB) |
Original XML dump (bzip2-compressed) | 9,814 |
Original XML dump (uncompressed) | 43,777 |
SequenceFile (block-compressed) |
12,463 |
SequenceFile (record-compressed) |
17,523 |
SequenceFile (no compression) |
44,120 |
Block-compressed SequenceFile
s are fairly space
efficient. Record-compressed SequenceFile
s are less
space-efficient, but retain the ability to directly seek to any
WikipediaPage
object (whereas with block compression you
can only seek to block boundaries). In most usage scenarios,
block-compressed SequenceFile
s are preferred.
Cloud9 provides utilities for random access to Wikipedia articles. First, we need to build a forward index:
$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.collection.wikipedia.WikipediaForwardIndexBuilder \ -input /user/jimmylin/enwiki-20130503.block -index_file enwiki-20130503.findex.dat
Note: It mostly doesn't matter elsewhere, but
here -input
must take an absolute path.
With this forward index, we can now programmatically access Wikipedia articles given docno or docid. Here's a snippet of code that does this:
Configuration conf = getConf(); WikipediaForwardIndex f = new WikipediaForwardIndex(conf); f.loadIndex(new Path("enwiki-20130503.findex.dat"), new Path("enwiki-20130503-docno.dat"), FileSystem.get(conf)); WikipediaPage page; // fetch docno page = f.getDocument(1000); System.out.println(page.getDocid() + ": " + page.getTitle()); // fetch docid page = f.getDocument("1875"); System.out.println(page.getDocid() + ": " + page.getTitle());
This is illustrated by a simple command-line program that allows you to find out the title of a page given either the docno or the docid:
$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.collection.wikipedia.LookupWikipediaArticle \ enwiki-20130503.findex.dat enwiki-20130503-docno.dat
Finally, there's also a web interface for browsing, which can be started with the following invocation:
$ hadoop jar target/cloud9-X.Y.Z-fatjar.jar edu.umd.cloud9.collection.DocumentForwardIndexHttpServer \ -docnoMapping enwiki-20130503-docno.dat -index enwiki-20130503.findex.dat
When the server starts up, it'll report back the host information of the node it's running on (along with the port). You can then direct your browser at the relevant URL and gain access to the webapp.