Cloud9: A MapReduce Library for Hadoop » Working with the ClueWeb09 Collection

Working with the ClueWeb09 Collection

The ClueWeb09 collection consists of one billion web pages (5 TB compressed, 25 TB uncompressed), in ten languages, crawled in January and February 2009. Its creation, supported by U.S. National Science Foundation (NSF), was led by Jamie Callan of the Language Technologies Institute at Carnegie Mellon University to support research on information retrieval and related human language technologies. The entire collection is available for research purposes. This guide provides instructions on processing the English portion of the Clue Web collection using Hadoop with Cloud⁹, building on code developed by Mark Hoy at CMU. For now, this page deal exclusively with the English portion of the collection.

Topics that this page covers:

Counting the records
Repacking the records
Sequentially-numbered docnos
Malformed records

See the "official" dataset information page for the definitive description of the collection. Some content on this page simply mirrors information on that page for convenience.

In total, there are 503,903,810 pages in the English portion of the ClueWeb09 collection (2.08 TB compressed, 13.4 TB uncompressed). The English data is distributed in ten parts (called segments), each corresponding to a directory. Here are the page counts for each segment:

ClueWeb09_English_1    50,220,423 pages
ClueWeb09_English_2    51,577,077 pages
ClueWeb09_English_3    50,547,493 pages
ClueWeb09_English_4    52,311,060 pages
ClueWeb09_English_5    50,756,858 pages
ClueWeb09_English_6    50,559,093 pages
ClueWeb09_English_7    52,472,358 pages
ClueWeb09_English_8    49,545,346 pages
ClueWeb09_English_9    50,738,874 pages
ClueWeb09_English_10   45,175,228 pages

Each segment contains a number of sub-directories, each of which contains a number of compressed WARC files. There is no official name for these sub-directories, so to be precise I'll call them sections. They are numbered from en0000 all the way through en0133. For example, en0000 to en0011 belong to segment 1. That is, the first segment contains the following sub-directories:

ClueWeb09_English_1/en0000
ClueWeb09_English_1/en0001
...
ClueWeb09_English_1/en0011

In addition, the first segment contains three special sections, enwp00, enwp01, enwp02, enwp03. These sections hold a version of English Wikipedia. Also, sections en0083, en0084, and en0097 do not exist (or are empty, depending on the distribution).

Counting the Records

Once you've gotten your hands on the collection, the first thing you might want to do is run some sanity checks. The simplest sanity check is to rea through all the records and count them. This functionality is provided by a demo program in Cloud⁹. Here's the command-line invocation:

hadoop jar cloud9.jar edu.umd.cloud9.collection.clue.DemoCountClueWarcRecords \
original /shared/ClueWeb09/collection.raw 1 /shared/ClueWeb09/docno-mapping.dat

The first command-line argument indicates whether your counting records in the original distribution ("original") or repacked SequenceFiles ("repacked"); for the second condition, see details below. The second argument is the base path of your ClueWeb09 distribution. The third command-line argument is the segment number (1 through 10). The final argument is the location of the docno mapping (see details below). If you run the demo program on all ten segments, you should get the following results:

segment	# files	bytes off disk	# of records	# of pages	total size
1	1492	246,838,508,311	50,221,915	50,220,423	1,527,155,667,036
2	1416	224,505,694,289	51,578,493	51,577,077	1,435,415,062,235
3	1375	217,428,760,570	50,548,868	50,547,493	1,392,234,129,944
4	1363	213,615,715,952	52,312,423	52,311,060	1,379,063,022,766
5	1322	205,092,621,204	50,758,180	50,756,858	1,333,142,147,191
6	1302	203,616,661,324	50,560,395	50,559,093	1,314,228,067,242
7	1358	213,335,482,896	52,473,716	52,472,358	1,366,774,469,429
8	1295	199,607,688,405	49,546,641	49,545,346	1,308,432,844,339
9	1306	204,295,706,812	50,740,180	50,738,874	1,331,922,112,879
10	988	150,042,155,900	45,176,216	45,175,228	983,120,555,934
all	13,217	2,078,378,995,663	503,917,027	503,903,810	13,371,488,078,995

Description of the columns:

segment: the segment in question.
# files: number files (compressed WARC files) in the segment.
bytes off disk: compressed size of the compressed WARC files.
# of records: number of WARC records.
# of pages: number of actual HTML pages.
total size: uncompressed size of all records.

As a note, each compressed WARC file has a header followed by the actual HTML pages, so number of records should be equal to number of files plus number of pages.

Repacking the Records

In some ways, the original WARC files are awkward to work with. There is, for example, no simple way to quickly access an individual record that lies in the middle of a gzipped file. A good solution is to repack the collection into block-compressed SequenceFiles. This is described in a separate page on providing random access to the WARC records.

Sequentially-Numbered Docnos

Many information retrieval and other text processing tasks require that all documents in the collection be sequentially numbered, from 1 ... n. Typically, you'll want to start with document 1 as opposed to 0 because it is not possible to represent 0 with many standard compression schemes used in information retrieval (i.e., Golomb codes). For clarity, I call these sequentially-numbered document ids docnos, whereas I call the original ids docids. (This is a bit confusing as in previous TREC collections alphanumeric document ids are tagged as DOCNOs.)

The format of a docid (WARC-TREC-ID) in the collection is clueweb09-enXXXX-YY-ZZZZZ. Due to this regular format, it is very easy to algorithmically map between docnos and docids. In Cloud⁹, the ClueWarcDocnoMapping class in the edu.umd.cloud9.collection.clue package provides an API for you.

Even if you don't want to use the Cloud⁹ API, this mappings data file should be useful. Here are the first few lines:

en0000,0,35582,1
en0000,1,28413,35583
en0000,2,36053,63996
en0000,3,36260,100049
en0000,4,34786,136309
en0000,5,33015,171095
...

It's a CSV data file, where the first column represents the enXXXX portion of the docid, the second column represents the YY portion, and the third column lists the number of pages with the enXXXX-YY prefix. Since ZZZZZ starts at zero, the last docid is the third column minus one. The fourth column tracks the cumulative count in the number of documents. With this information, mapping between docnos and docids is a simple matter of arithmetic.

Malformed Records

There are a number of malformed WARC records in the English portion of the collection (there may be malformed records in the other languages also, but I haven't analyzed them yet). The most prevalent problem is an extra newline in the WARC header. There are a few cases of other malformed headers also. See this list of docids: each docid refers to a WARC record that immediately precedes a malformed WARC record. For example, the first docid in the list is clueweb09-en0001-41-14941, which means that clueweb09-en0001-41-14942 is malformed.

Of all the malformed WARC records referenced in the file above, all except for the following are malformed in having an extra newline in the WARC header. The following docids are malformed in other ways (all cases of garbled URL):

clueweb09-en0044-01-04501
clueweb09-en0059-46-06368
clueweb09-en0117-48-12547
clueweb09-en0112-59-06118
clueweb09-en0126-33-37391
clueweb09-en0126-88-10049

This errata is provided primarily for reference. The API in Cloud⁹ transparently handles these malformed WARC records (thanks to code originally written by Mark Hoy from CMU).