The solution is fairly obvious: let's repack the gzipped WARC files
in a block-compressed format. The tradeoff is between space and
latency: with smaller block sizes, we achieve lower latencies, at the
cost of more space (lower compression ratio). In the limit, we arrive
at record-level compression, where each web page is separately
compressed—the lowest possible latency (essentially, just a
seek), but space-inefficient.
As it turns out, SequenceFiles in Hadoop already support block- and
record-level compression. We simply need to run some experiments to
empirically determine what the tradeoffs are. The following
experiments specifically focus on the first WARC file in the first
English segment of ClueWeb09
(i.e., ClueWeb09_English_1/en0000/00.war.gz
). The
SequenceFile contains IntWritable keys and ClueWarcRecord values.
Here is the disk usage based on different compression schemes:
original gzipped WARC file |
168967171 |
|
uncompressed SequenceFile |
1047281773 |
+520% |
record-compressed SequenceFile |
244575860 |
+44.7% |
block-compressed SequenceFile (block=1000000) |
171185069 |
+1.3% |
block-compressed SequenceFile (block=500000) |
172885152 |
+2.3% |
block-compressed SequenceFile (block=100000) |
185094067 |
+9.5% |
We see that web pages compress very well; here, slightly better
than five to one. As expected, record-level compression isn't very
space-efficient. It took a while to figure out, but the compression
block size is controlled by the obscure Hadoop parameter
"io.seqfile.compress.blocksize" (measured in bytes), with a default of
1000000. The default value seems to work well—barely a loss in
compression efficiency (and this includes SequenceFile space
overhead).
What about latency? These experiments were done on a local
machine, so latency measurements aren't going to be particularly
meaningful, since in the end we're going to be fetching from
SequenceFiles stored in HDFS (actual end-to-end latency results will
be presented later). For now, computing the block size (in number of
records) will at least give us a sense of how much sequential reading
can be expected under different conditions. This particular WARC file
contains 35582 web pages. Results are shown below:
block-compressed SequenceFile (block=1000000) |
1024 |
35 |
block-compressed SequenceFile (block=500000) |
2004 |
18 |
block-compressed SequenceFile (block=100000) |
8673 |
4 |
These experiments were performed with
ScanBlockCompressedSequenceFile in edu.umd.cloud9.collection.clue,
which scans a block-compressed SequenceFile and outputs the block
boundaries.
Note that the Hadoop SequenceFile block-compression scheme is
parameterized by block size (which is the correct design decision),
not number of records, so number of pages per block will depend on
sizes of individual pages. These results show that, with the default
block size, accessing a random page will on average require
sequentially reading through 17 other pages. There is an additional
tradeoff to consider: for random access, it is necessary to hold all
block pointers (offsets) in memory. The smaller the block size, the
more the blocks, and hence the larger memory footprint of structures
necessary to support random access. All things considered, the
default block size appears to be a good choice.