Assignment 2: ETL due 10am September 25

‼️ This assignment isn't "ready" until the release date specified in the course syllabus. Anything (questions, grading scheme, etc.) might arbitrarily change before the release date. Start this assignment before the release date at your own risk. After the release date, the assignment can be considered "final" and will only be updated with clarifications.

This assignment covers the entire ETL process, from the operational database to the analytical database. We hand you a PostgreSQL dump of the operational database of a hypothetical e-commerce site, built on this dataset. At a high level, we ask you to:

Specifically, in the data warehouse, you will build:

Install PostgreSQL

Your first step is to install PostgreSQL on your machine. There are many ways to accomplish this, but you are expected to figure out how to do so yourself. Once you have done so, use pg_ctl to start a PostgreSQL server on your local machine.

Grab the database dump retail_schema_20250826.dump from here.

With the dump file, you can restore the database cs451 as follows:

dropdb --if-exists cs451
createdb cs451

pg_restore \
 -d cs451 \
 --jobs 4 --clean --if-exists \
 --no-owner --no-acl \
 retail_schema_20250826.dump

Launch the psql shell to connect to the cs451 database that you've restored. With the following command:

\dt retail.*

You will see all the tables under the schema retail, as follows:

cs451=# \dt retail.*
            List of relations
 Schema |     Name     | Type  |  Owner   
--------+--------------+-------+----------
 retail | brand        | table | jimmylin
 retail | category     | table | jimmylin
 retail | events       | table | jimmylin
 retail | product      | table | jimmylin
 retail | product_name | table | jimmylin
 retail | session      | table | jimmylin
 retail | user         | table | jimmylin
(7 rows)

As a sanity check, issue a simple SQL query:

cs451=# select * from retail.events limit 10;
       event_time       | event_type |              session_id              | product_id |   price   
------------------------+------------+--------------------------------------+------------+-----------
 2019-09-30 20:00:00-04 | view       | 72d76fde-8bb3-4e00-8c23-a032dfed738c |   44600062 |     35.79
 2019-09-30 20:00:00-04 | view       | 9333dfbd-b87a-4708-9857-6336556b0fcc |    3900821 |     33.20
 2019-09-30 20:00:01-04 | view       | 566511c2-e2e3-422b-b695-cf8e6e792ca8 |   17200506 |    543.10
 2019-09-30 20:00:01-04 | view       | 7c90fc70-0e80-4590-96f3-13c02c18c713 |    1307067 |    251.74
 2019-09-30 20:00:04-04 | view       | c6bd7419-2748-4c56-95b4-8cec9ff8b80d |    1004237 | 108198.00
 2019-09-30 20:00:05-04 | view       | 0d0d91c2-c9c2-4e81-90a5-86594dec0db9 |    1480613 |    908.62
 2019-09-30 20:00:08-04 | view       | 4fe811e9-91de-46da-90c3-bbd87ed3a65d |   17300353 |    380.96
 2019-09-30 20:00:08-04 | view       | 6280d577-25c8-4147-99a7-abc6048498d6 |   31500053 |     41.16
 2019-09-30 20:00:10-04 | view       | ac1cd4e5-a3ce-4224-a2d7-ff660a105880 |   28719074 |    102.71
 2019-09-30 20:00:11-04 | view       | 406c46ed-90a4-4787-a43b-59a410c1a5fb |    1004545 |    566.01
(10 rows)

If you've gotten this far, it appears that you've successfully restored the operational database! 🎉

Understand the Operational Database

Take some time to understand the operational database. There are no specific assignment questions associated with this task, but the remainder of the assignment will be a lot easier if you actually understand what you're doing...

A starting point would be the dataset this assignment was built on. For reference, these are the tables and their schemas in the PostgreSQL dump you've been given:

You want to understand the real-world entities that each table is modeling, the schema of each table and how they interact (i.e., primary key—foreign key relationships). Do some exploratory data analysis, i.e., "poking around". For example, join two tables to trace the PK/FK. Join multiple tables to reconstruct a session: be able to identify what was purchased, by whom, for how much, etc.

Let's Acutually Get Started

The Jupyter notebook that contains the actual assignment is available here.

Back to top