Schedule

Week Description Dates Assignments
1 The Data Flywheel Sep 4 A1 Released: 9/4
2 Data Warehouses, Data Lakes, and Lakehouses Sep 9, 11 A2 Released: 9/11 A1 Due: 9/11
3 Batch Processing I Sep 16, 18
4 Batch Processing II Sep 23, 25 A3 Released: 9/25 A2 Due: 9/25
5 Rubber, Meet Road Sept 30, Oct 1
6 Data Infrastructure for Machine Learning Oct 7, 9 A4 Released: 10/9 A3 Due: 10/9
7Reading Week: No Classes!
8 Midterm Exam Oct 21, 23
9 Text Processing I Oct 28, 30 A5 Released: 10/30 A4 Due: 10/30
10 Text Processing II Nov 4, 6
11 Clustering Nov 11, 13 A6 Released: 11/13 A5 Due: 11/13
12 Graph Processing Nov 18, 20
13 Stream Processing Nov 25, 27 A6 Due: 11/27
14 LLMs Dec 2
Final Exam TBD

Week 1: The Data Flywheel Sep 4

Key Questions

  • What does it mean to be an AI-first or data-driven company?
  • What's the data flywheel?
  • What's data engineering?
  • What are data platforms?
  • What are the 4 V's of data?
  • What is this course about? And what is it not about?

Readings

The above readings are available for free online through the university's library. The links above point directly to Waterloo proxied content, but if you're having trouble accessing the content (e.g., due to VPN settings), you might have go through the library's portal (i.e., search for the book title and follow the appropriate link).

Slides

PDF slides for Sept 4 (v1.00)

Back to top

Week 2: Data Warehouses, Data Lakes, and Lakehouses Sep 9, 11

Key Questions

  • What are the main differences between operational and analytical infrastructure?
  • What are data warehouses? What problems did they evolve to solve?
  • What are data lakes and lakehouses? What problems did they evolve to solve?
  • What are the components of modern data platforms?
  • How do operational and analytical data models differ?
  • What goes on in ETL/ELT?
  • How do different physical representations of data affect storage, compute, and other tradeoffs within data platforms?

Readings

The above readings are available for free online through the university's library. The links above point directly to Waterloo proxied content, but if you're having trouble accessing the content (e.g., due to VPN settings), you might have go through the library's portal (i.e., search for the book title and follow the appropriate link).

Slides

PDF slides for Sept 9 (v1.01) PDF slides for Sept 11 (v1.00)

Back to top

Week 3: Batch Processing I Sep 16, 18

Key Questions

  • What's the difference between scaling up and scaling out?
  • What are the implications of distributed processing across many machines?
  • What are the challenges for a divide-and-conquer strategy?
  • What challenges does partitioning address? What challenges does it exacerbate?
  • What challenges does replication address? What challenges does it exacerbate?
  • What's MapReduce and how does it work with HDFS?
  • What challenges do communication and skew present in scaling out?
  • Why is local aggregation important?

Readings

The above readings are available for free online through the university's library. The links above point directly to Waterloo proxied content, but if you're having trouble accessing the content (e.g., due to VPN settings), you might have go through the library's portal (i.e., search for the book title and follow the appropriate link).

The following are optional. They comprise some of the primary sources from which lecture content is drawn, and can enrich your understanding to provide broader context.

Slides

PDF slides for Sept 16 (v1.01) PDF slides for Sept 18 (v1.01)

Back to top

Week 4: Batch Processing II Sep 23, 25

Key Questions

  • In what ways does Spark improve over MapReduce?
  • How is distributed group by implemented efficiently at scale?
  • How do commutative and associative operations contribute to efficient distributed execution?
  • How does partitioning contribute to efficient distributed execution?
  • How do these concepts come together in efficient joins at scale?

Readings

Reread the assigned readings from last week (or read for the first time if you haven't yet). The material will make a lot more sense given the lecture material. In addition:

The above readings are available for free online through the university's library. The links above point directly to Waterloo proxied content, but if you're having trouble accessing the content (e.g., due to VPN settings), you might have go through the library's portal (i.e., search for the book title and follow the appropriate link).

The following are optional. They comprise some of the primary sources from which lecture content is drawn, and can enrich your understanding to provide broader context.

Slides

PDF slides for Sept 23 (v1.00) PDF slides for Sept 25 (v1.00)

Back to top

Week 5: Rubber, Meet Road Sept 30, Oct 1

Readings

The above readings are available for free online through the university's library. The links above point directly to Waterloo proxied content, but if you're having trouble accessing the content (e.g., due to VPN settings), you might have go through the library's portal (i.e., search for the book title and follow the appropriate link).

Slides

Guest Lectures by Khaled Ammar:

PDF slides for Sept 30 (v1.00) PDF slides for Oct 1 (v1.00)

Week 6: Data Infrastructure for Machine Learning Oct 7, 9

Key Questions

  • What are the key components of an ML solution?
  • How is the supervised machine learning problem formulated?
  • What roles do data platforms and data engineering play?

Readings

The above readings are available for free online through the university's library. The links above point directly to Waterloo proxied content, but if you're having trouble accessing the content (e.g., due to VPN settings), you might have go through the library's portal (i.e., search for the book title and follow the appropriate link).

The following are optional. They provide a deeper dive of lecture content that can enrich your understanding and provide broader context.

Slides

PDF slides for Oct 7 (v1.00) PDF slides for Oct 9 (v1.01)

Back to top

Week 8: Midterm Exam Oct 21, 23

Back to top

Week 9: Text Processing I Oct 28, 30

Key Questions

Readings

Slides

Back to top

Week 10: Text Processing II Nov 4, 6

Key Questions

Readings

Slides

Back to top

Week 11: Clustering Nov 11, 13

Key Questions

Readings

Slides

Back to top

Week 12: Graph Processing Nov 18, 20

Key Questions

Readings

Slides

Back to top

Week 13: Stream Processing Nov 25, 27

Key Questions

Readings

Slides

Back to top

Week 14: LLMs Dec 2

Key Questions

Readings

Slides

Back to top

Final Exam TBD

Back to top