Assignment 4 due 10am October 30

As we discussed in lecture, most of data science and machine learning from the data engineering perspective in industry comprises data cleaning, data manipulations, and data munging. In fact, most of "applied machine learning" in industry isn't about "machine learning" per se. This is reflected in the assignment.

In this assignment, you will build a web spam classifier based on the work described in Efficient and Effective Spam Filtering and Re-ranking for Large Web Datasets by Cormack, Smucker, and Clarke. No need to specifically read the paper, as we'll walk you through the implementation. Nevertheless, the paper might be interesting from the perspective of providing background.

At a high level, you'll start with feature files for training and test data. These files contain the feature vectors for the instances that you'll classify, as well as the "ground truth" label (either spam or ham). You'll munge these feature vectors into SciPy's csr_matrix representation, detailed here, and then train different models. Literally:

model.fit(X, y)

That's the "actual machine learning". You'll then explore model variants and perform a few common analyses. The main implementation part of the assignment is, basically, data munging. In fact, the assignment doesn't even use Spark. That is, your implementation will read files directly from the file system.

Why? The datasets we provide to you, while relatively small, already lie at the end of a data cleaning and feature extraction pipeline that starts from datasets hundreds of GBs in size. In this assignment, we're skipping that part and handing you the feature vectors directly, but those aspects of the data transformation likely require the distributed processing capabilities offered by Spark. So keep that in mind as you're doing this assignment; we're already simplifying.

Downloading Data

Download the following datasets and unpack somewhere:

FileMD5Size
spam.train.txt.gz73a823d6d3a2d9c7d2756047c0fbf92510 MB
spam.test.txt.gz 9d7c0f177a7a71b7c495711bb2335fe4485 MB

MD5 checksums are provided to confirm the integrity of what you've downloaded.

Understanding the Features

The features used to represent the web pages for the spam classifier are hashed byte 4-grams. Pages are sliced into overlapping 4-grams, and then hashed. From the paper:

That is, if the page consisted of “pq xyzzy” the features would be simply “pq x”, “q xy”, “ xyz ”, “xyzz”, and “yzzy”. Each feature was represented as a binary quantity indicating its presence or absence in the page.

To re-iterate, the features are integers. The features are binary: does the page contain the hashed 4-gram or not.

Take a look at spam.train.txt. The first line begins as follows:

clueweb09-en0094-20-13546 spam 387908 697162 426572 161118 688171 ...

In the file, each training instance is on a line. Each line begins with a document id, the string "spam" or "ham" (the label), and a list of integers (the features). Throughout this assignment, assume spam = 1 and ham = 0.

Before you can train a model, you'll need to build X and y:

model.fit(X, y)

The variable X will have type csr_matrix and the variable y will have the type ndarray; these variables are exactly those that we discussed in lecture, showing up in the slides. The shape of the csr_matrix will be (num_examples, dim), where num_examples is the number of instances (for example, in the training set). You'll have to figure out what dim should be, which is the number of dimensions in each (sparse) feature vector. You'll also have to figure out what the shape of y should be.

What's shape? If you don't already know, you'll have to figure this out yourself. Consider it Q0 of the assignment.

What's dim? Here's a hint: The features are hash values, right? Say you look through all the feature vectors, and observe the max of the hash values to be N. How many unique features are there? (Keep in mind that 0 is a possible hash value.) The number of unique features is the number of dimensions in each (sparse) feature vector, which will also determine the shape of X and y. Another hint: the training and test sets must have the same dim; otherwise, prediction just wouldn't work. (However, num_examples can be different.)

The Assignment Notebook and Submission

The Jupyter notebook that contains the actual assignment is available here. Given the setup provided above, the notebook should be self explanatory.

Like before, a GitHub invite link will be created for this assignment, coming soon!

In addition, please explicitly add the following two files in your repo (at top level):

Submit the assignment by committing your edits and pushing your repo (with the answers filled out in the notebook) back to origin.

Grading Scheme

Subtotal = 16 points

Subtotal = 25 points

Subtotal = 20 points

What does "following instructions" mean? These are "free points" if you follow the instructions provided in this assignment. These points are to handle the scenario where all your answers are correct, but you did not follow the instructions and that caused us to go out of our way to fix your submission so that it conforms to the instructions. (For example, you removed the ids that we used for tracking, which would make it much more difficult to grade.) In these and other related cases, we will dock points from this category.

Total: 70 points

Back to top