‼️ This assignment isn't "ready" until the release date specified in the course syllabus. Anything (questions, grading scheme, etc.) might arbitrarily change before the release date. Start this assignment before the release date at your own risk. After the release date, the assignment can be considered "final" and will only be updated with clarifications.
This assignment covers the entire ETL process, from the operational database to the analytical database. We hand you a PostgreSQL dump of the operational database of a hypothetical e-commerce site, built on this dataset. At a high level, we ask you to:
Specifically, in the data warehouse, you will build:
Your first step is to install PostgreSQL on your machine.
There are many ways to accomplish this, but you are expected to figure out how to do so yourself.
Once you have done so, use pg_ctl
to start a PostgreSQL server on your local machine.
Grab the database dump retail_schema_20250826.dump
from here.
With the dump file, you can restore the database cs451
as follows:
dropdb --if-exists cs451 createdb cs451 pg_restore \ -d cs451 \ --jobs 4 --clean --if-exists \ --no-owner --no-acl \ retail_schema_20250826.dump
Launch the psql
shell to connect to the cs451
database that you've restored.
With the following command:
\dt retail.*
You will see all the tables under the schema retail
, as follows:
cs451=# \dt retail.* List of relations Schema | Name | Type | Owner --------+--------------+-------+---------- retail | brand | table | jimmylin retail | category | table | jimmylin retail | events | table | jimmylin retail | product | table | jimmylin retail | product_name | table | jimmylin retail | session | table | jimmylin retail | user | table | jimmylin (7 rows)
As a sanity check, issue a simple SQL query:
cs451=# select * from retail.events limit 10; event_time | event_type | session_id | product_id | price ------------------------+------------+--------------------------------------+------------+----------- 2019-09-30 20:00:00-04 | view | 72d76fde-8bb3-4e00-8c23-a032dfed738c | 44600062 | 35.79 2019-09-30 20:00:00-04 | view | 9333dfbd-b87a-4708-9857-6336556b0fcc | 3900821 | 33.20 2019-09-30 20:00:01-04 | view | 566511c2-e2e3-422b-b695-cf8e6e792ca8 | 17200506 | 543.10 2019-09-30 20:00:01-04 | view | 7c90fc70-0e80-4590-96f3-13c02c18c713 | 1307067 | 251.74 2019-09-30 20:00:04-04 | view | c6bd7419-2748-4c56-95b4-8cec9ff8b80d | 1004237 | 108198.00 2019-09-30 20:00:05-04 | view | 0d0d91c2-c9c2-4e81-90a5-86594dec0db9 | 1480613 | 908.62 2019-09-30 20:00:08-04 | view | 4fe811e9-91de-46da-90c3-bbd87ed3a65d | 17300353 | 380.96 2019-09-30 20:00:08-04 | view | 6280d577-25c8-4147-99a7-abc6048498d6 | 31500053 | 41.16 2019-09-30 20:00:10-04 | view | ac1cd4e5-a3ce-4224-a2d7-ff660a105880 | 28719074 | 102.71 2019-09-30 20:00:11-04 | view | 406c46ed-90a4-4787-a43b-59a410c1a5fb | 1004545 | 566.01 (10 rows)
If you've gotten this far, it appears that you've successfully restored the operational database! 🎉
Take some time to understand the operational database. There are no specific assignment questions associated with this task, but the remainder of the assignment will be a lot easier if you actually understand what you're doing...
A starting point would be the dataset this assignment was built on. For reference, these are the tables and their schemas in the PostgreSQL dump you've been given:
user
: (user_id
, gender
, birthdate
)session
: (session_id
, user_id
)product
: (product_id
, brand
, category
, product_name
)product_name
: (category
, product_name
, description
)events
: (event_time
, event_type
, session_id
, product_id
, price
)category
: (category
, description
)brand
: (brand
, description
)You want to understand the real-world entities that each table is modeling, the schema of each table and how they interact (i.e., primary key—foreign key relationships). Do some exploratory data analysis, i.e., "poking around". For example, join two tables to trace the PK/FK. Join multiple tables to reconstruct a session: be able to identify what was purchased, by whom, for how much, etc.
The Jupyter notebook that contains the actual assignment is available here.