CS 451/651: Data-Intensive Distributed Computing (Fall 2025)

Assignment 4 due 10am October 30

As we discussed in lecture, most of data science and machine learning from the data engineering perspective in industry comprises data cleaning, data manipulations, and data munging. In fact, most of "applied machine learning" in industry isn't about "machine learning" per se. This is reflected in the assignment.

In this assignment, you will build a web spam classifier based on the work described in Efficient and Effective Spam Filtering and Re-ranking for Large Web Datasets by Cormack, Smucker, and Clarke. No need to specifically read the paper, as we'll walk you through the implementation. Nevertheless, the paper might be interesting from the perspective of providing background.

At a high level, you'll start with feature files for training and test data. These files contain the feature vectors for the instances that you'll classify, as well as the "ground truth" label (either spam or ham). You'll munge these feature vectors into SciPy's csr_matrix representation, detailed here, and then train different models. Literally:

model.fit(X, y)

That's the "actual machine learning". You'll then explore model variants and perform a few common analyses. The main implementation part of the assignment is, basically, data munging. In fact, the assignment doesn't even use Spark. That is, your implementation will read files directly from the file system.

Why? The datasets we provide to you, while relatively small, already lie at the end of a data cleaning and feature extraction pipeline that starts from datasets hundreds of GBs in size. In this assignment, we're skipping that part and handing you the feature vectors directly, but those aspects of the data transformation likely require the distributed processing capabilities offered by Spark. So keep that in mind as you're doing this assignment; we're already simplifying.

Downloading Data

Download the following datasets and unpack somewhere:

File	MD5	Size
`spam.train.txt.gz`	`73a823d6d3a2d9c7d2756047c0fbf925`	10 MB
`spam.test.txt.gz`	`9d7c0f177a7a71b7c495711bb2335fe4`	485 MB

MD5 checksums are provided to confirm the integrity of what you've downloaded.

Understanding the Features

The features used to represent the web pages for the spam classifier are hashed byte 4-grams. Pages are sliced into overlapping 4-grams, and then hashed. From the paper:

That is, if the page consisted of “pq xyzzy” the features would be simply “pq x”, “q xy”, “ xyz ”, “xyzz”, and “yzzy”. Each feature was represented as a binary quantity indicating its presence or absence in the page.

To re-iterate, the features are integers. The features are binary: does the page contain the hashed 4-gram or not.

Take a look at spam.train.txt. The first line begins as follows:

clueweb09-en0094-20-13546 spam 387908 697162 426572 161118 688171 ...

In the file, each training instance is on a line. Each line begins with a document id, the string "spam" or "ham" (the label), and a list of integers (the features). Throughout this assignment, assume spam = 1 and ham = 0.

Before you can train a model, you'll need to build X and y:

model.fit(X, y)

The variable X will have type csr_matrix and the variable y will have the type ndarray; these variables are exactly those that we discussed in lecture, showing up in the slides. The shape of the csr_matrix will be (num_examples, dim), where num_examples is the number of instances (for example, in the training set). You'll have to figure out what dim should be, which is the number of dimensions in each (sparse) feature vector. You'll also have to figure out what the shape of y should be.

What's shape? If you don't already know, you'll have to figure this out yourself. Consider it Q0 of the assignment.

What's dim? Here's a hint: The features are hash values, right? Say you look through all the feature vectors, and observe the max of the hash values to be N. How many unique features are there? (Keep in mind that 0 is a possible hash value.) The number of unique features is the number of dimensions in each (sparse) feature vector, which will also determine the shape of X and y. Another hint: the training and test sets must have the same dim; otherwise, prediction just wouldn't work. (However, num_examples can be different.)

The Assignment Notebook and Submission

The Jupyter notebook that contains the actual assignment is available here. Given the setup provided above, the notebook should be self explanatory.

Like before, a GitHub invite link will be created for this assignment, coming soon!

In addition, please explicitly add the following two files in your repo (at top level):

A4_load_data.py
4e.png: ROC curve of LR
4f.png: ROC curve of SGD
4g.png: ROC curve of mystery
4h.png: ROC curve of all three classifiers
6a.png: exploration of dataset size

Submit the assignment by committing your edits and pushing your repo (with the answers filled out in the notebook) back to origin.

Grading Scheme

codecell_2a: computing dim: 3 points
codecell_2b: setting dim: 1 point
A4_load_data.py: implementation 10 points
codecell_2c: shape of X_train and y_train: 2 points

Subtotal = 16 points

codecell_3a: train LR: 1 points
codecell_3b: train SGD: 1 points
codecell_3c: traing mystery: 2 points
codecell_4a: evaluate LR: 1 points
codecell_4b: evaluate SGD: 1 points
codecell_4c: evaluate mystery: 2 points
qcell_4qx1: Q4.1: 2 points
codecell_4d: implementation of plot_roc_curve: 6 points
codecell_4e: ROC curve of LR: 1 points
codecell_4f: ROC curve of SGD: 1 points
codecell_4g: ROC curve of mystery: 1 points
codecell_4h: ROC curve of all three classifiers: 4 points
qcell_4qx2: Q4.1: 2 points

Subtotal = 25 points

codecell_5a: train ensemble: 2 points
codecell_5b: evaluate ensemble: 2 points
codecell_5c: score averaging: 2 points
qcell_5qx: questions: 6 points total (3 each)
codecell_6a: dataset size exploration: 5 points
qcell_6qx: questions: 3 points

Subtotal = 20 points

Explanation of how you used generative AI for this assignment: 3 points (this goes in README.md)
Following instructions: 6 points

What does "following instructions" mean? These are "free points" if you follow the instructions provided in this assignment. These points are to handle the scenario where all your answers are correct, but you did not follow the instructions and that caused us to go out of our way to fix your submission so that it conforms to the instructions. (For example, you removed the ids that we used for tracking, which would make it much more difficult to grade.) In these and other related cases, we will dock points from this category.

Total: 70 points

Assignments CS 451/651: Data-Intensive Distributed Computing (Fall 2025)

Assignment 4 due 10am October 30

Downloading Data

Understanding the Features

The Assignment Notebook and Submission

Grading Scheme

Assignments
CS 451/651: Data-Intensive Distributed Computing (Fall 2025)