CS 451/651: Data-Intensive Distributed Computing (Fall 2025)

Final Project due 10am December 15

The final project is a requirement only for graduate students taking CS 651. There is no final project for CS 451.

Project Requirements

At the highest level, I would like you to:

Pick a dataset that you find interesting.
Do some interesting data science to reveal some insights.
Build an MCP server so I can "talk to the dataset".

I'll try to articulate what I'm looking for with a running example:

(1) This is a dataset that I find interesting: some information on ACM Fellows and ACM Turing Award winners.

(2) Why? I'm interested in the science of science, a scientific exploration of how science "gets done". In particular, being a computer scientist, I am interested in an introspective examination of my own field. One way to approach this is to "start from the end", i.e., the honors that the field bestows on accomplished individuals. For computer science, looking at ACM Fellows and ACM Turing Award winners captures (some aspects of) this. And, as it turns out, there's already a dataset (albeit out of date) that captures this data. (A previous student of mine built it!)

With this dataset, one might be able to answer a number of interesting questions. For example:

What sub-disciplines of computer science are represented by the ACM Fellows?
What topics are they recognized for?
How did this evolve over time?
What are the demographic characteristics of the fellows?

An initial attempt at answering some of these questions can be found here.

Answers to these questions would reveal insights, thus addressing point (2) above. This requires data science, along the lines of everything we've discussed in this course: data cleaning, data munging, analytics over various aspects of the data, some method of presenting insights (e.g., graphs).

You'll deliver these insights in a notebook.

(3) The downside of delivering insights in a notebook is that a notebook is (mostly) static. Follow-up interactions are limited to whatever's already provided in the notebook, or I'll have to write additional code. Wouldn't it be nice if I could "talk to my dataset" via an LLM? This is what MCP (potentially) solves.

For this part, write an MCP server that exposes tools for accessing your dataset in interesting ways. These "interesting ways" might correspond to answering the questions you've posed above. Once you expose a tool via MCP, an LLM (Claude, for example), can call the server, fetch data, and compose it further as part of normal LLM responses and interactions.

As an example, we've exposed the search functionality in my group's Pyserini IR toolkit in MCP and connected it to Claude; see additional details here. Claude can call Pyserini via MCP, search a collection of documents, and then "do interesting things with the results". One example is to rewrite all the results in the style of Shakespeare's Sonnets.

I've left this intentionally vague to provide you the freedom to explore MCP.

Additional Project Details

This final project can be done individually or in pairs. For calibration, the amount of effort I expect for the project is roughly two assignments (per person). So if you're working with a partner, I would expect something like four assignments worth of effort.

In terms of the "interesting dataset", it's completely up to you. I actually wouldn't mind someone working on exactly the ACM Fellows dataset described above. Other potentially interesting datasets:

In terms of grading, half of the grade will be (2) and half the grade will be (3). Given the open-ended nature of the project, it is not possible to produce a detailed rubric, but, as part of the project checkpoint (below), we can negotiate expectations in more detail.

Project Checkpoint

When you are ready, send me (jimmylin@uwaterloo.ca) an email describing what you'd like to work on. That is: (1) what dataset, (2) what types of insights, and (3) what the MCP server will expose. I will provide you with feedback on appropriateness and scope of your proposed project. The "soft" deadline for this proposal is November 11. There is no penalty if you miss this deadline, but it is in your best interest to not leave this "proposal" to the last minute.

Final Project Delivery

The final project is due 10am on December 15.

The actual project deliverable is a GitHub repo. The repo may be public (if you wish) or private. In the repo, I expect at least the following:

One or more notebooks addressing point (2).
The implementation of the MCP server addressing point (3).
One or more screen recordings showing examples of an LLM interacting with your MCP server (something like this but more detailed). This is needed because I may not be able to get your actual MCP server running.
A README that ties everything together, e.g., explaining what is what, providing pointers to relevant components, etc.

In the process of performing data science, it is likely you'll have additional scripts for data cleaning, data munging, etc. Include these scripts in your repo also.

Assignments CS 451/651: Data-Intensive Distributed Computing (Fall 2025)

Final Project due 10am December 15

Project Requirements

Additional Project Details

Project Checkpoint

Final Project Delivery

Assignments
CS 451/651: Data-Intensive Distributed Computing (Fall 2025)