Working with the ClueWeb09 Collection

The ClueWeb09 collection consists of one billion web pages (5 TB compressed, 25 TB uncompressed), in ten languages, crawled in January and February 2009. Its creation, supported by U.S. National Science Foundation (NSF), was led by Jamie Callan of the Language Technologies Institute at Carnegie Mellon University to support research on information retrieval and related human language technologies. The entire collection is available for research purposes. This guide provides instructions on processing the English portion of the Clue Web collection using Hadoop with Cloud9, building on code developed by Mark Hoy at CMU. For now, this page deal exclusively with the English portion of the collection.

Topics that this page covers:

See the "official" dataset information page for the definitive description of the collection. Some content on this page simply mirrors information on that page for convenience.

In total, there are 503,903,810 pages in the English portion of the ClueWeb09 collection (2.08 TB compressed, 13.4 TB uncompressed). The English data is distributed in ten parts (called segments), each corresponding to a directory. Here are the page counts for each segment:

ClueWeb09_English_1    50,220,423 pages
ClueWeb09_English_2    51,577,077 pages
ClueWeb09_English_3    50,547,493 pages
ClueWeb09_English_4    52,311,060 pages
ClueWeb09_English_5    50,756,858 pages
ClueWeb09_English_6    50,559,093 pages
ClueWeb09_English_7    52,472,358 pages
ClueWeb09_English_8    49,545,346 pages
ClueWeb09_English_9    50,738,874 pages
ClueWeb09_English_10   45,175,228 pages

Each segment contains a number of sub-directories, each of which contains a number of compressed WARC files. There is no official name for these sub-directories, so to be precise I'll call them sections. They are numbered from en0000 all the way through en0133. For example, en0000 to en0011 belong to segment 1. That is, the first segment contains the following sub-directories:

ClueWeb09_English_1/en0000
ClueWeb09_English_1/en0001
...
ClueWeb09_English_1/en0011

In addition, the first segment contains three special sections, enwp00, enwp01, enwp02, enwp03. These sections hold a version of English Wikipedia. Also, sections en0083, en0084, and en0097 do not exist (or are empty, depending on the distribution).

Counting the Records

Once you've gotten your hands on the collection, the first thing you might want to do is run some sanity checks. The simplest sanity check is to rea through all the records and count them. This functionality is provided by a demo program in Cloud9. Here's the command-line invocation:

hadoop jar cloud9.jar edu.umd.cloud9.collection.clue.DemoCountClueWarcRecords \
original /shared/ClueWeb09/collection.raw 1 /shared/ClueWeb09/docno-mapping.dat

The first command-line argument indicates whether your counting records in the original distribution ("original") or repacked SequenceFiles ("repacked"); for the second condition, see details below. The second argument is the base path of your ClueWeb09 distribution. The third command-line argument is the segment number (1 through 10). The final argument is the location of the docno mapping (see details below). If you run the demo program on all ten segments, you should get the following results:

segment # files bytes off disk # of records # of pages total size
1 1492 246,838,508,311 50,221,915 50,220,423 1,527,155,667,036
2 1416 224,505,694,289 51,578,493 51,577,077 1,435,415,062,235
3 1375 217,428,760,570 50,548,868 50,547,493 1,392,234,129,944
4 1363 213,615,715,952 52,312,423 52,311,060 1,379,063,022,766
5 1322 205,092,621,204 50,758,180 50,756,858 1,333,142,147,191
6 1302 203,616,661,324 50,560,395 50,559,093 1,314,228,067,242
7 1358 213,335,482,896 52,473,716 52,472,358 1,366,774,469,429
8 1295 199,607,688,405 49,546,641 49,545,346 1,308,432,844,339
9 1306 204,295,706,812 50,740,180 50,738,874 1,331,922,112,879
10 988 150,042,155,900 45,176,216 45,175,228 983,120,555,934
all 13,217 2,078,378,995,663 503,917,027 503,903,810 13,371,488,078,995

Description of the columns:

  • segment: the segment in question.
  • # files: number files (compressed WARC files) in the segment.
  • bytes off disk: compressed size of the compressed WARC files.
  • # of records: number of WARC records.
  • # of pages: number of actual HTML pages.
  • total size: uncompressed size of all records.

As a note, each compressed WARC file has a header followed by the actual HTML pages, so number of records should be equal to number of files plus number of pages.

Repacking the Records

In some ways, the original WARC files are awkward to work with. There is, for example, no simple way to quickly access an individual record that lies in the middle of a gzipped file. A good solution is to repack the collection into block-compressed SequenceFiles. This is described in a separate page on providing random access to the WARC records.

Sequentially-Numbered Docnos

Many information retrieval and other text processing tasks require that all documents in the collection be sequentially numbered, from 1 ... n. Typically, you'll want to start with document 1 as opposed to 0 because it is not possible to represent 0 with many standard compression schemes used in information retrieval (i.e., Golomb codes). For clarity, I call these sequentially-numbered document ids docnos, whereas I call the original ids docids. (This is a bit confusing as in previous TREC collections alphanumeric document ids are tagged as DOCNOs.)

The format of a docid (WARC-TREC-ID) in the collection is clueweb09-enXXXX-YY-ZZZZZ. Due to this regular format, it is very easy to algorithmically map between docnos and docids. In Cloud9, the ClueWarcDocnoMapping class in the edu.umd.cloud9.collection.clue package provides an API for you.

Even if you don't want to use the Cloud9 API, this mappings data file should be useful. Here are the first few lines:

en0000,0,35582,1
en0000,1,28413,35583
en0000,2,36053,63996
en0000,3,36260,100049
en0000,4,34786,136309
en0000,5,33015,171095
...

It's a CSV data file, where the first column represents the enXXXX portion of the docid, the second column represents the YY portion, and the third column lists the number of pages with the enXXXX-YY prefix. Since ZZZZZ starts at zero, the last docid is the third column minus one. The fourth column tracks the cumulative count in the number of documents. With this information, mapping between docnos and docids is a simple matter of arithmetic.

Malformed Records

There are a number of malformed WARC records in the English portion of the collection (there may be malformed records in the other languages also, but I haven't analyzed them yet). The most prevalent problem is an extra newline in the WARC header. There are a few cases of other malformed headers also. See this list of docids: each docid refers to a WARC record that immediately precedes a malformed WARC record. For example, the first docid in the list is clueweb09-en0001-41-14941, which means that clueweb09-en0001-41-14942 is malformed.

Of all the malformed WARC records referenced in the file above, all except for the following are malformed in having an extra newline in the WARC header. The following docids are malformed in other ways (all cases of garbled URL):

clueweb09-en0044-01-04501
clueweb09-en0059-46-06368
clueweb09-en0117-48-12547
clueweb09-en0112-59-06118
clueweb09-en0126-33-37391
clueweb09-en0126-88-10049

This errata is provided primarily for reference. The API in Cloud9 transparently handles these malformed WARC records (thanks to code originally written by Mark Hoy from CMU).