
The final project is a requirement only for graduate students taking CS 651. There is no final project for CS 451.
At the highest level, I would like you to:
I'll try to articulate what I'm looking for with a running example:
(1) This is a dataset that I find interesting: some information on ACM Fellows and ACM Turing Award winners.
(2) Why? I'm interested in the science of science, a scientific exploration of how science "gets done". In particular, being a computer scientist, I am interested in an introspective examination of my own field. One way to approach this is to "start from the end", i.e., the honors that the field bestows on accomplished individuals. For computer science, looking at ACM Fellows and ACM Turing Award winners captures (some aspects of) this. And, as it turns out, there's already a dataset (albeit out of date) that captures this data. (A previous student of mine built it!)
With this dataset, one might be able to answer a number of interesting questions. For example:
An initial attempt at answering some of these questions can be found here.
Answers to these questions would reveal insights, thus addressing point (2) above. This requires data science, along the lines of everything we've discussed in this course: data cleaning, data munging, analytics over various aspects of the data, some method of presenting insights (e.g., graphs).
You'll deliver these insights in a notebook.
(3) The downside of delivering insights in a notebook is that a notebook is (mostly) static. Follow-up interactions are limited to whatever's already provided in the notebook, or I'll have to write additional code. Wouldn't it be nice if I could "talk to my dataset" via an LLM? This is what MCP (potentially) solves.
For this part, write an MCP server that exposes tools for accessing your dataset in interesting ways. These "interesting ways" might correspond to answering the questions you've posed above. Once you expose a tool via MCP, an LLM (Claude, for example), can call the server, fetch data, and compose it further as part of normal LLM responses and interactions.
As an example, we've exposed the search functionality in my group's Pyserini IR toolkit in MCP and connected it to Claude; see additional details here. Claude can call Pyserini via MCP, search a collection of documents, and then "do interesting things with the results". One example is to rewrite all the results in the style of Shakespeare's Sonnets.
I've left this intentionally vague to provide you the freedom to explore MCP.
This final project can be done individually or in pairs. For calibration, the amount of effort I expect for the project is roughly two assignments (per person). So if you're working with a partner, I would expect something like four assignments worth of effort.
In terms of the "interesting dataset", it's completely up to you. I actually wouldn't mind someone working on exactly the ACM Fellows dataset described above. Other potentially interesting datasets:
In terms of grading, half of the grade will be (2) and half the grade will be (3). Given the open-ended nature of the project, it is not possible to produce a detailed rubric, but, as part of the project checkpoint (below), we can negotiate expectations in more detail.
When you are ready, send me (jimmylin@uwaterloo.ca) an email describing what you'd like to work on.
That is: (1) what dataset, (2) what types of insights, and (3) what the MCP server will expose.
I will provide you with feedback on appropriateness and scope of your proposed project.
The "soft" deadline for this proposal is November 11.
There is no penalty if you miss this deadline, but it is in your best interest to not leave this "proposal" to the last minute.
The final project is due 10am on December 15.
The actual project deliverable is a GitHub repo. The repo may be public (if you wish) or private. In the repo, I expect at least the following:
In the process of performing data science, it is likely you'll have additional scripts for data cleaning, data munging, etc. Include these scripts in your repo also.