This assignment covers the entire ETL process, from the operational database to the analytical database. We hand you a PostgreSQL dump of the operational database of a hypothetical e-commerce site, built on this dataset. At a high level, we ask you to:
Specifically, in the data warehouse, you will build:
Your first step is to install PostgreSQL on your machine.
There are many ways to accomplish this, but you are expected to figure out how to do so yourself.
Once you have done so, use pg_ctl
to start a PostgreSQL server on your local machine.
Grab the database dump retail_schema_20250826.dump
from here.
With the dump file, you can restore the database cs451
as follows:
dropdb --if-exists cs451 createdb cs451 pg_restore \ -d cs451 \ --jobs 4 --clean --if-exists \ --no-owner --no-acl \ retail_schema_20250826.dump
Launch the psql
shell to connect to the cs451
database that you've restored.
With the following command:
\dt retail.*
You will see all the tables under the schema retail
, as follows:
cs451=# \dt retail.* List of relations Schema | Name | Type | Owner --------+--------------+-------+---------- retail | brand | table | jimmylin retail | category | table | jimmylin retail | events | table | jimmylin retail | product | table | jimmylin retail | product_name | table | jimmylin retail | session | table | jimmylin retail | user | table | jimmylin (7 rows)
As a sanity check, issue a simple SQL query:
cs451=# select * from retail.events limit 10; event_time | event_type | session_id | product_id | price ------------------------+------------+--------------------------------------+------------+----------- 2019-09-30 20:00:00-04 | view | 72d76fde-8bb3-4e00-8c23-a032dfed738c | 44600062 | 35.79 2019-09-30 20:00:00-04 | view | 9333dfbd-b87a-4708-9857-6336556b0fcc | 3900821 | 33.20 2019-09-30 20:00:01-04 | view | 566511c2-e2e3-422b-b695-cf8e6e792ca8 | 17200506 | 543.10 2019-09-30 20:00:01-04 | view | 7c90fc70-0e80-4590-96f3-13c02c18c713 | 1307067 | 251.74 2019-09-30 20:00:04-04 | view | c6bd7419-2748-4c56-95b4-8cec9ff8b80d | 1004237 | 108198.00 2019-09-30 20:00:05-04 | view | 0d0d91c2-c9c2-4e81-90a5-86594dec0db9 | 1480613 | 908.62 2019-09-30 20:00:08-04 | view | 4fe811e9-91de-46da-90c3-bbd87ed3a65d | 17300353 | 380.96 2019-09-30 20:00:08-04 | view | 6280d577-25c8-4147-99a7-abc6048498d6 | 31500053 | 41.16 2019-09-30 20:00:10-04 | view | ac1cd4e5-a3ce-4224-a2d7-ff660a105880 | 28719074 | 102.71 2019-09-30 20:00:11-04 | view | 406c46ed-90a4-4787-a43b-59a410c1a5fb | 1004545 | 566.01 (10 rows)
If you've gotten this far, it appears that you've successfully restored the operational database! 🎉
Take some time to understand the operational database. There are no specific assignment questions associated with this task, but the remainder of the assignment will be a lot easier if you actually understand what you're doing...
A starting point would be the dataset this assignment was built on. For reference, these are the tables and their schemas in the PostgreSQL dump you've been given:
user
: (user_id
, gender
, birthdate
)session
: (session_id
, user_id
)product
: (product_id
, brand
, category
, product_name
)product_name
: (category
, product_name
, description
)events
: (event_time
, event_type
, session_id
, product_id
, price
)category
: (category
, description
)brand
: (brand
, description
)You want to understand the real-world entities that each table is modeling, the schema of each table and how they interact (i.e., primary key—foreign key relationships). Do some exploratory data analysis, i.e., "poking around". For example, join two tables to trace the PK/FK. Join multiple tables to reconstruct a session: be able to identify what was purchased, by whom, for how much, etc.
The Jupyter notebook that contains the actual assignment is available here.
Use this link to create an assignment repo for submission.
In the assignment repo, enter your answers in assignment2.ipynb
.
Submit the assignment by committing your edits and pushing your repo (with the answers filled out in the notebook) back to origin.
Already stated in the software page, but worth repeating: While the assignments are to be turned in via Jupyter notebooks, it is not necessarily the case that they are the most convenient development environment. Different software capabilities can be accessed via JupyterLab, from a shell (terminal), from VS Code, etc. You might want to experiment with difference approaches for different usage scenarios.
qcell_1b76x2
SQL queries: 14 points total (3/3/3/5 for Q1, Q2, Q3, and Q4, respectively)codecell_30z8le
load in CSV files into DataFrames: 3 pointscodecell_41ax14
build dim_user
: 5 pointscodecell_43k3n9
build dim_brand
, dim_category
, dim_product
: 13 points (4/4/5 for each of the tables)codecell_44qm5c
build dim_date
: 5 pointscodecell_51ep7v
cleaning: 6 pointscodecell_52hg6x
and codecell_52bf5d
cap silly prices: 4 pointscodecell_53l2kp
lookup tables: 4 pointscodecell_54aaaa
joining everything together: 6 pointsqcell_6a9876
size: 3 pointsqcell_6b1234
answers: 8 pointsREADME.md
)What does "following instructions" mean? These are "free points" if you follow the instructions provided in this assignment. These points are to handle the scenario where all your answers are correct, but you did not follow the instructions and that caused us to go out of our way to fix your submission so that it conforms to the instructions. (For example, you removed the ids that we used for tracking, which would make it much more difficult to grade.) In these and other related cases, we will dock points from this category.
Total: 80 points