CS 451/651: Data-Intensive Distributed Computing (Fall 2025)

Assignment 2: ETL due 10am September 25

This assignment covers the entire ETL process, from the operational database to the analytical database. We hand you a PostgreSQL dump of the operational database of a hypothetical e-commerce site, built on this dataset. At a high level, we ask you to:

Restore the PostgreSQL dump into a local instance of PostgreSQL so that you can issue queries against a live operational database.
Export the tables from the operational database into CSV files.
Use Spark to transform the extracted data into relations conforming to a star schema that would be suitable for a data warehouse.

Specifically, in the data warehouse, you will build:

A number of dimension tables.
A central fact table linked to the dimension tables.

Install PostgreSQL

Your first step is to install PostgreSQL on your machine. There are many ways to accomplish this, but you are expected to figure out how to do so yourself. Once you have done so, use pg_ctl to start a PostgreSQL server on your local machine.

Grab the database dump retail_schema_20250826.dump from here.

With the dump file, you can restore the database cs451 as follows:

dropdb --if-exists cs451
createdb cs451

pg_restore \
 -d cs451 \
 --jobs 4 --clean --if-exists \
 --no-owner --no-acl \
 retail_schema_20250826.dump

Launch the psql shell to connect to the cs451 database that you've restored. With the following command:

\dt retail.*

You will see all the tables under the schema retail, as follows:

cs451=# \dt retail.*
            List of relations
 Schema |     Name     | Type  |  Owner   
--------+--------------+-------+----------
 retail | brand        | table | jimmylin
 retail | category     | table | jimmylin
 retail | events       | table | jimmylin
 retail | product      | table | jimmylin
 retail | product_name | table | jimmylin
 retail | session      | table | jimmylin
 retail | user         | table | jimmylin
(7 rows)

As a sanity check, issue a simple SQL query:

cs451=# select * from retail.events limit 10;
       event_time       | event_type |              session_id              | product_id |   price   
------------------------+------------+--------------------------------------+------------+-----------
 2019-09-30 20:00:00-04 | view       | 72d76fde-8bb3-4e00-8c23-a032dfed738c |   44600062 |     35.79
 2019-09-30 20:00:00-04 | view       | 9333dfbd-b87a-4708-9857-6336556b0fcc |    3900821 |     33.20
 2019-09-30 20:00:01-04 | view       | 566511c2-e2e3-422b-b695-cf8e6e792ca8 |   17200506 |    543.10
 2019-09-30 20:00:01-04 | view       | 7c90fc70-0e80-4590-96f3-13c02c18c713 |    1307067 |    251.74
 2019-09-30 20:00:04-04 | view       | c6bd7419-2748-4c56-95b4-8cec9ff8b80d |    1004237 | 108198.00
 2019-09-30 20:00:05-04 | view       | 0d0d91c2-c9c2-4e81-90a5-86594dec0db9 |    1480613 |    908.62
 2019-09-30 20:00:08-04 | view       | 4fe811e9-91de-46da-90c3-bbd87ed3a65d |   17300353 |    380.96
 2019-09-30 20:00:08-04 | view       | 6280d577-25c8-4147-99a7-abc6048498d6 |   31500053 |     41.16
 2019-09-30 20:00:10-04 | view       | ac1cd4e5-a3ce-4224-a2d7-ff660a105880 |   28719074 |    102.71
 2019-09-30 20:00:11-04 | view       | 406c46ed-90a4-4787-a43b-59a410c1a5fb |    1004545 |    566.01
(10 rows)

If you've gotten this far, it appears that you've successfully restored the operational database! 🎉

Understand the Operational Database

Take some time to understand the operational database. There are no specific assignment questions associated with this task, but the remainder of the assignment will be a lot easier if you actually understand what you're doing...

A starting point would be the dataset this assignment was built on. For reference, these are the tables and their schemas in the PostgreSQL dump you've been given:

user: (user_id, gender, birthdate)
session: (session_id, user_id)
product: (product_id, brand, category, product_name)
product_name: (category, product_name, description)
events: (event_time, event_type, session_id, product_id, price)
category: (category, description)
brand: (brand, description)

You want to understand the real-world entities that each table is modeling, the schema of each table and how they interact (i.e., primary key—foreign key relationships). Do some exploratory data analysis, i.e., "poking around". For example, join two tables to trace the PK/FK. Join multiple tables to reconstruct a session: be able to identify what was purchased, by whom, for how much, etc.

The Actual Assignment

The Jupyter notebook that contains the actual assignment is available here.

Use this link to create an assignment repo for submission. In the assignment repo, enter your answers in assignment2.ipynb. Submit the assignment by committing your edits and pushing your repo (with the answers filled out in the notebook) back to origin.

Already stated in the software page, but worth repeating: While the assignments are to be turned in via Jupyter notebooks, it is not necessarily the case that they are the most convenient development environment. Different software capabilities can be accessed via JupyterLab, from a shell (terminal), from VS Code, etc. You might want to experiment with difference approaches for different usage scenarios.

Grading Scheme

qcell_1b76x2 SQL queries: 14 points total (3/3/3/5 for Q1, Q2, Q3, and Q4, respectively)
codecell_30z8le load in CSV files into DataFrames: 3 points
codecell_41ax14 build dim_user: 5 points
codecell_43k3n9 build dim_brand, dim_category, dim_product: 13 points (4/4/5 for each of the tables)
codecell_44qm5c build dim_date: 5 points
codecell_51ep7v cleaning: 6 points
codecell_52hg6x and codecell_52bf5d cap silly prices: 4 points
codecell_53l2kp lookup tables: 4 points
codecell_54aaaa joining everything together: 6 points
qcell_6a9876 size: 3 points
qcell_6b1234 answers: 8 points
Explanation of how you used generative AI for this assignment: 3 points (this goes in README.md)
Following instructions: 6 points

What does "following instructions" mean? These are "free points" if you follow the instructions provided in this assignment. These points are to handle the scenario where all your answers are correct, but you did not follow the instructions and that caused us to go out of our way to fix your submission so that it conforms to the instructions. (For example, you removed the ids that we used for tracking, which would make it much more difficult to grade.) In these and other related cases, we will dock points from this category.

Total: 80 points

Assignments CS 451/651: Data-Intensive Distributed Computing (Fall 2025)

Assignment 2: ETL due 10am September 25

Install PostgreSQL

Understand the Operational Database

The Actual Assignment

Grading Scheme

Assignments
CS 451/651: Data-Intensive Distributed Computing (Fall 2025)