Assignment 2: ETL due 10am September 25

This assignment covers the entire ETL process, from the operational database to the analytical database. We hand you a PostgreSQL dump of the operational database of a hypothetical e-commerce site, built on this dataset. At a high level, we ask you to:

Specifically, in the data warehouse, you will build:

Install PostgreSQL

Your first step is to install PostgreSQL on your machine. There are many ways to accomplish this, but you are expected to figure out how to do so yourself. Once you have done so, use pg_ctl to start a PostgreSQL server on your local machine.

Grab the database dump retail_schema_20250826.dump from here.

With the dump file, you can restore the database cs451 as follows:

dropdb --if-exists cs451
createdb cs451

pg_restore \
 -d cs451 \
 --jobs 4 --clean --if-exists \
 --no-owner --no-acl \
 retail_schema_20250826.dump

Launch the psql shell to connect to the cs451 database that you've restored. With the following command:

\dt retail.*

You will see all the tables under the schema retail, as follows:

cs451=# \dt retail.*
            List of relations
 Schema |     Name     | Type  |  Owner   
--------+--------------+-------+----------
 retail | brand        | table | jimmylin
 retail | category     | table | jimmylin
 retail | events       | table | jimmylin
 retail | product      | table | jimmylin
 retail | product_name | table | jimmylin
 retail | session      | table | jimmylin
 retail | user         | table | jimmylin
(7 rows)

As a sanity check, issue a simple SQL query:

cs451=# select * from retail.events limit 10;
       event_time       | event_type |              session_id              | product_id |   price   
------------------------+------------+--------------------------------------+------------+-----------
 2019-09-30 20:00:00-04 | view       | 72d76fde-8bb3-4e00-8c23-a032dfed738c |   44600062 |     35.79
 2019-09-30 20:00:00-04 | view       | 9333dfbd-b87a-4708-9857-6336556b0fcc |    3900821 |     33.20
 2019-09-30 20:00:01-04 | view       | 566511c2-e2e3-422b-b695-cf8e6e792ca8 |   17200506 |    543.10
 2019-09-30 20:00:01-04 | view       | 7c90fc70-0e80-4590-96f3-13c02c18c713 |    1307067 |    251.74
 2019-09-30 20:00:04-04 | view       | c6bd7419-2748-4c56-95b4-8cec9ff8b80d |    1004237 | 108198.00
 2019-09-30 20:00:05-04 | view       | 0d0d91c2-c9c2-4e81-90a5-86594dec0db9 |    1480613 |    908.62
 2019-09-30 20:00:08-04 | view       | 4fe811e9-91de-46da-90c3-bbd87ed3a65d |   17300353 |    380.96
 2019-09-30 20:00:08-04 | view       | 6280d577-25c8-4147-99a7-abc6048498d6 |   31500053 |     41.16
 2019-09-30 20:00:10-04 | view       | ac1cd4e5-a3ce-4224-a2d7-ff660a105880 |   28719074 |    102.71
 2019-09-30 20:00:11-04 | view       | 406c46ed-90a4-4787-a43b-59a410c1a5fb |    1004545 |    566.01
(10 rows)

If you've gotten this far, it appears that you've successfully restored the operational database! 🎉

Understand the Operational Database

Take some time to understand the operational database. There are no specific assignment questions associated with this task, but the remainder of the assignment will be a lot easier if you actually understand what you're doing...

A starting point would be the dataset this assignment was built on. For reference, these are the tables and their schemas in the PostgreSQL dump you've been given:

You want to understand the real-world entities that each table is modeling, the schema of each table and how they interact (i.e., primary key—foreign key relationships). Do some exploratory data analysis, i.e., "poking around". For example, join two tables to trace the PK/FK. Join multiple tables to reconstruct a session: be able to identify what was purchased, by whom, for how much, etc.

The Actual Assignment

The Jupyter notebook that contains the actual assignment is available here.

Use this link to create an assignment repo for submission. In the assignment repo, enter your answers in assignment2.ipynb. Submit the assignment by committing your edits and pushing your repo (with the answers filled out in the notebook) back to origin.

Already stated in the software page, but worth repeating: While the assignments are to be turned in via Jupyter notebooks, it is not necessarily the case that they are the most convenient development environment. Different software capabilities can be accessed via JupyterLab, from a shell (terminal), from VS Code, etc. You might want to experiment with difference approaches for different usage scenarios.

Grading Scheme

What does "following instructions" mean? These are "free points" if you follow the instructions provided in this assignment. These points are to handle the scenario where all your answers are correct, but you did not follow the instructions and that caused us to go out of our way to fix your submission so that it conforms to the instructions. (For example, you removed the ids that we used for tracking, which would make it much more difficult to grade.) In these and other related cases, we will dock points from this category.

Total: 80 points

Back to top