Random access to ClueWeb09 WARC records

This page discusses the problem of random access to ClueWeb09 WARC records, i.e., how to fetch individual web pages from the collection quickly. See my guide on working with the ClueWeb09 collection for a general introduction. The collection is distributed as a number of gzipped WARC files, each containing approximately 40k pages (about a TB uncompressed). The well-known problem with gzip files is that the format lacks an efficient way of seeking to some point in the middle of a compressed stream—you have to read through the entire archive each time. Therefore, if the web page being retrieved resides at the end of the gzipped WARC file, you'll have to basically read through the entire file.

For information retrieval experiments, there are several workarounds. Often, the documents of interest are known in advance, in which case, a not unreasonable solution is to perform a sequential scan through the entire collection and pull out documents of interest. Obviously, this isn't going to work for interactive retrieval.

Another solution is to simply uncompress the collection, which works well if you have lots of disk space—in fact, that's how we've previously worked with the collection. However, this seems like an awful waste of space, especially since the WARC files achieve about a five to one compression ratio.

Compression Schemes

The solution is fairly obvious: let's repack the gzipped WARC files in a block-compressed format. The tradeoff is between space and latency: with smaller block sizes, we achieve lower latencies, at the cost of more space (lower compression ratio). In the limit, we arrive at record-level compression, where each web page is separately compressed—the lowest possible latency (essentially, just a seek), but space-inefficient.

As it turns out, SequenceFiles in Hadoop already support block- and record-level compression. We simply need to run some experiments to empirically determine what the tradeoffs are. The following experiments specifically focus on the first WARC file in the first English segment of ClueWeb09 (i.e., ClueWeb09_English_1/en0000/00.war.gz). The SequenceFile contains IntWritable keys and ClueWarcRecord values. Here is the disk usage based on different compression schemes:

compression scheme bytes difference
original gzipped WARC file 168967171
uncompressed SequenceFile 1047281773 +520%
record-compressed SequenceFile 244575860 +44.7%
block-compressed SequenceFile (block=1000000) 171185069 +1.3%
block-compressed SequenceFile (block=500000) 172885152 +2.3%
block-compressed SequenceFile (block=100000) 185094067 +9.5%

We see that web pages compress very well; here, slightly better than five to one. As expected, record-level compression isn't very space-efficient. It took a while to figure out, but the compression block size is controlled by the obscure Hadoop parameter "io.seqfile.compress.blocksize" (measured in bytes), with a default of 1000000. The default value seems to work well—barely a loss in compression efficiency (and this includes SequenceFile space overhead).

What about latency? These experiments were done on a local machine, so latency measurements aren't going to be particularly meaningful, since in the end we're going to be fetching from SequenceFiles stored in HDFS (actual end-to-end latency results will be presented later). For now, computing the block size (in number of records) will at least give us a sense of how much sequential reading can be expected under different conditions. This particular WARC file contains 35582 web pages. Results are shown below:

compression scheme # blocks avg. pages/block
block-compressed SequenceFile (block=1000000) 1024 35
block-compressed SequenceFile (block=500000) 2004 18
block-compressed SequenceFile (block=100000) 8673 4

These experiments were performed with ScanBlockCompressedSequenceFile in edu.umd.cloud9.collection.clue, which scans a block-compressed SequenceFile and outputs the block boundaries.

Note that the Hadoop SequenceFile block-compression scheme is parameterized by block size (which is the correct design decision), not number of records, so number of pages per block will depend on sizes of individual pages. These results show that, with the default block size, accessing a random page will on average require sequentially reading through 17 other pages. There is an additional tradeoff to consider: for random access, it is necessary to hold all block pointers (offsets) in memory. The smaller the block size, the more the blocks, and hence the larger memory footprint of structures necessary to support random access. All things considered, the default block size appears to be a good choice.

Repacking the WARC Files

Cloud9 comes with a program for repacking the original ClueWeb09 gzipped WARC files into block-compressed SequenceFiles. Sample invocation:

hadoop jar cloud9.jar edu.umd.cloud9.collection.clue.RepackClueWarcRecords \
/shared/ClueWeb09/collection.raw /shared/ClueWeb09/collection.compressed.block/en.01 1 \
/shared/ClueWeb09/docno-mapping.dat block

The first argument is the path of the collection, the second is the output directory, the third is the segment number, the fourth is the docno mapping data file (which is required since the keys in the SequenceFiles contain docnos), and the final argument is the string "block" (for block-level compression), "record" (for record-level compression), and "none" (for no compression). Since we have the luxury of a large cluster (a few hundred machines), repacking the first English segment of ClueWeb09 takes about twenty minutes (with Java built-in compression; with native libraries this should be even faster).

A sample result: the first English segment of ClueWeb09 weighs in at 247,363,677,391 bytes in its original distribution. After repacking, the size expands ever so slightly to 250,062,048,064 byte. For a tiny cost in space, we get random access...

Supporting Random Access

A forward index that will support random access to ClueWeb09 web pages is as simple as noting where all the block boundaries are. Cloud9 has a indexer to do exactly that. Sample invocation:

hadoop jar cloud9.jar edu.umd.cloud9.collection.clue.BuildClueWarcForwardIndex \
/shared/ClueWeb09/collection.compressed.block/en.01 /tmp/findex/ \
/shared/ClueWeb09/collection.compressed.block/findex.en.01.dat

The first argument is the location of the repacked block-compressed SequenceFiles, the second is a temporary path for MapReduce output, and the third is the location of the index file to be written.

The index file contains all the block locations in a binary-encoded format. Each block location is a triple of docno, block byte offset, and file number. To support random access, all block locations are loaded into memory in an array. Given the docno of a page, fetching it involves performing binary search over the block locations to find the correct block, opening up the proper file, seeking to the block byte offset, and reading through records until the desired docno is encountered. This is handled by the class ClueWarcForwardIndex in edu.umd.cloud9.collection.clue, which implements exactly the algorithm sketched above and provides an abstraction for fetching ClueWeb09 web pages.

Finally, some empirical results: on our cluster running Hadoop 0.20.1, with block-compressed SequenceFiles stored in HDFS, we get random access latencies in the 100-150ms range. The configuration of the cluster at the time of the experiment was 99 nodes, 198 cores, two 400 GB disks per node. In a bandwidth saturated scenario (e.g., while the cluster is running a distcp) latency drops to the 150-200ms range, which is still acceptable. Note that in these experiments we were using built-in Java compression; once again, with native libraries this should be substantially faster). This appears to be acceptable for interactive retrieval, especially considering that end-to-end latency is dominated by other things like fetching page images remotely.

Random Access Webapp

Finally, as a "cute" hack, we've developed a webapp for accessing ClueWeb09 pages within Hadoop itself. The lightweight HTTP server Jetty is already included in the Hadoop distribution (it's what runs the jobtracker, namenode, and other webapps). What we've done is folded a Jetty server into a mapper, so you can fire up the webapp in the same way you start a Hadoop job. In this case, the Hadoop job has only one mapper, and the mapper starts up a Jetty server. The webapp support fetching pages by both docno and docid. Here's a sample invocation:

hadoop jar cloud9.jar edu.umd.cloud9.collection.DocumentForwardIndexHttpServer \
/shared/ClueWeb09/collection.compressed.block/findex.en.01.dat \
/shared/ClueWeb09/docno-mapping.dat

The first argument is the location of the forward index file, and the second argument is the location of the docno mapping file. This webapp has been tested on Hadoop 0.20.1 and there are known issues with earlier Hadoop versions (since they used a different version of Jetty with an incompatible API).

When the server starts up, it'll report back the host information of the node it's running on (along with the port). You can then direct your browser at the relevant URL. The webapp looks something like this:

Screenshot of interface for accessing documents in ClueWeb09

And that's it! Have fun! Please give us feedback if you find any issues, find any bugs, etc.