The ClueWeb09 collection consists of one billion web pages (5 TB compressed, 25 TB uncompressed), in ten languages, crawled in January and February 2009. Its creation, supported by U.S. National Science Foundation (NSF), was led by Jamie Callan of the Language Technologies Institute at Carnegie Mellon University to support research on information retrieval and related human language technologies. The entire collection is available for research purposes. This guide provides instructions on processing the English portion of the Clue Web collection using Hadoop with Cloud9, building on code developed by Mark Hoy at CMU. For now, this page deal exclusively with the English portion of the collection.
Topics that this page covers:
See the "official" dataset information page for the definitive description of the collection. Some content on this page simply mirrors information on that page for convenience.
In total, there are 503,903,810 pages in the English portion of the ClueWeb09 collection (2.08 TB compressed, 13.4 TB uncompressed). The English data is distributed in ten parts (called segments), each corresponding to a directory. Here are the page counts for each segment:
ClueWeb09_English_1 50,220,423 pages ClueWeb09_English_2 51,577,077 pages ClueWeb09_English_3 50,547,493 pages ClueWeb09_English_4 52,311,060 pages ClueWeb09_English_5 50,756,858 pages ClueWeb09_English_6 50,559,093 pages ClueWeb09_English_7 52,472,358 pages ClueWeb09_English_8 49,545,346 pages ClueWeb09_English_9 50,738,874 pages ClueWeb09_English_10 45,175,228 pages
Each segment contains a number of sub-directories, each of which
contains a number of compressed WARC files. There is no official name
for these sub-directories, so to be precise I'll call them sections.
They are numbered from en0000
all the way
through en0133
. For example, en0000
to en0011
belong to segment 1. That is, the first
segment contains the following sub-directories:
ClueWeb09_English_1/en0000 ClueWeb09_English_1/en0001 ... ClueWeb09_English_1/en0011
In addition, the first segment contains three special
sections, enwp00
, enwp01
, enwp02
,
enwp03
. These sections hold a version of English
Wikipedia. Also, sections en0083
, en0084
,
and en0097
do not exist (or are empty, depending on the
distribution).