As we discussed in lecture, most of data science and machine learning from the data engineering perspective in industry comprises data cleaning, data manipulations, and data munging. In fact, most of "applied machine learning" in industry isn't about "machine learning" per se. This is reflected in the assignment.
In this assignment, you will build a web spam classifier based on the work described in Efficient and Effective Spam Filtering and Re-ranking for Large Web Datasets by Cormack, Smucker, and Clarke. No need to specifically read the paper, as we'll walk you through the implementation. Nevertheless, the paper might be interesting from the perspective of providing background.
At a high level, you'll start with feature files for training and test data.
These files contain the feature vectors for the instances that you'll classify, as well as the "ground truth" label (either spam or ham).
You'll munge these feature vectors into SciPy's csr_matrix
representation, detailed here, and then train different models.
Literally:
model.fit(X, y)
That's the "actual machine learning". You'll then explore model variants and perform a few common analyses. The main implementation part of the assignment is, basically, data munging. In fact, the assignment doesn't even use Spark. That is, your implementation will read files directly from the file system.
Why? The datasets we provide to you, while relatively small, already lie at the end of a data cleaning and feature extraction pipeline that starts from datasets hundreds of GBs in size. In this assignment, we're skipping that part and handing you the feature vectors directly, but those aspects of the data transformation likely require the distributed processing capabilities offered by Spark. So keep that in mind as you're doing this assignment; we're already simplifying.
Download the following datasets and unpack somewhere:
File | MD5 | Size |
spam.train.txt.gz | 73a823d6d3a2d9c7d2756047c0fbf925 | 10 MB |
spam.test.txt.gz | 9d7c0f177a7a71b7c495711bb2335fe4 | 485 MB |
MD5 checksums are provided to confirm the integrity of what you've downloaded.
The features used to represent the web pages for the spam classifier are hashed byte 4-grams. Pages are sliced into overlapping 4-grams, and then hashed. From the paper:
That is, if the page consisted of “pq xyzzy” the features would be simply “pq x”, “q xy”, “ xyz ”, “xyzz”, and “yzzy”. Each feature was represented as a binary quantity indicating its presence or absence in the page.
To re-iterate, the features are integers. The features are binary: does the page contain the hashed 4-gram or not.
Take a look at spam.train.txt
. The first line begins as follows:
clueweb09-en0094-20-13546 spam 387908 697162 426572 161118 688171 ...
In the file, each training instance is on a line. Each line begins with a document id, the string "spam" or "ham" (the label), and a list of integers (the features). Throughout this assignment, assume spam = 1 and ham = 0.
Before you can train a model, you'll need to build X
and y
:
model.fit(X, y)
The variable X
will have type csr_matrix
and the variable y
will have the type ndarray
; these variables are exactly those that we discussed in lecture, showing up in the slides.
The shape
of the csr_matrix
will be (num_examples
, dim
), where num_examples
is the number of instances (for example, in the training set).
You'll have to figure out what dim
should be, which is the number of dimensions in each (sparse) feature vector.
You'll also have to figure out what the shape
of y
should be.
What's shape
?
If you don't already know, you'll have to figure this out yourself.
Consider it Q0 of the assignment.
What's dim
?
Here's a hint: The features are hash values, right?
Say you look through all the feature vectors, and observe the max of the hash values to be N
.
How many unique features are there?
(Keep in mind that 0 is a possible hash value.)
The number of unique features is the number of dimensions in each (sparse) feature vector, which will also determine the shape
of X
and y
.
Another hint: the training and test sets must have the same dim
; otherwise, prediction just wouldn't work.
(However, num_examples
can be different.)
The Jupyter notebook that contains the actual assignment is available here. Given the setup provided above, the notebook should be self explanatory.
Like before, a GitHub invite link will be created for this assignment, coming soon!
In addition, please explicitly add the following two files in your repo (at top level):
A4_load_data.py
4e.png
: ROC curve of LR4f.png
: ROC curve of SGD4g.png
: ROC curve of mystery4h.png
: ROC curve of all three classifiers6a.png
: exploration of dataset sizeSubmit the assignment by committing your edits and pushing your repo (with the answers filled out in the notebook) back to origin.
codecell_2a
: computing dim
: 3 pointscodecell_2b
: setting dim
: 1 pointA4_load_data.py
: implementation 10 pointscodecell_2c
: shape of X_train
and y_train
: 2 pointsSubtotal = 16 points
codecell_3a
: train LR: 1 pointscodecell_3b
: train SGD: 1 pointscodecell_3c
: traing mystery: 2 pointscodecell_4a
: evaluate LR: 1 pointscodecell_4b
: evaluate SGD: 1 pointscodecell_4c
: evaluate mystery: 2 pointsqcell_4qx1
: Q4.1: 2 pointscodecell_4d
: implementation of plot_roc_curve
: 6 pointscodecell_4e
: ROC curve of LR: 1 pointscodecell_4f
: ROC curve of SGD: 1 pointscodecell_4g
: ROC curve of mystery: 1 pointscodecell_4h
: ROC curve of all three classifiers: 4 pointsqcell_4qx2
: Q4.1: 2 pointsSubtotal = 25 points
codecell_5a
: train ensemble: 2 pointscodecell_5b
: evaluate ensemble: 2 pointscodecell_5c
: score averaging: 2 pointsqcell_5qx
: questions: 6 points total (3 each)codecell_6a
: dataset size exploration: 5 pointsqcell_6qx
: questions: 3 pointsSubtotal = 20 points
README.md
)What does "following instructions" mean? These are "free points" if you follow the instructions provided in this assignment. These points are to handle the scenario where all your answers are correct, but you did not follow the instructions and that caused us to go out of our way to fix your submission so that it conforms to the instructions. (For example, you removed the ids that we used for tracking, which would make it much more difficult to grade.) In these and other related cases, we will dock points from this category.
Total: 70 points