Schedule

Week Description Dates Assignments
1 The Data Flywheel Sep 4 A1 Released: 9/4
2 Data Warehouses, Data Lakes, and Lakehouses Sep 9, 11 A2 Released: 9/11 A1 Due: 9/11
3 Batch Processing I Sep 16, 18
4 Batch Processing II Sep 23, 25 A3 Released: 9/25 A2 Due: 9/25
5 Rubber, Meet Road Sept 30, Oct 1
6 Data Infrastructure for Machine Learning Oct 7, 9 A4 Released: 10/9 A3 Due: 10/9
7Reading Week: No Classes!
8 Midterm Exam Oct 21, 23
9 Text Processing I Oct 28, 30 A4 Due: 10/30
10 Text Processing II Nov 4, 6 A5 Released: 11/4
11 Finding Similar Items Nov 11, 13
12 Graph Processing Nov 18, 20 A6 Released: 11/18 A5 Due: 11/18
13 Stream Processing Nov 25, 27
14 LLMs Dec 2 A6 Due: 12/2
Final Exam TBD

Week 1: The Data Flywheel Sep 4

Key Questions

  • What does it mean to be an AI-first or data-driven company?
  • What's the data flywheel?
  • What's data engineering?
  • What are data platforms?
  • What are the 4 V's of data?
  • What is this course about? And what is it not about?

Readings

The above readings are available for free online through the university's library. The links above point directly to Waterloo proxied content, but if you're having trouble accessing the content (e.g., due to VPN settings), you might have go through the library's portal (i.e., search for the book title and follow the appropriate link).

Slides

PDF slides for Sept 4 (v1.00)

Back to top

Week 2: Data Warehouses, Data Lakes, and Lakehouses Sep 9, 11

Key Questions

  • What are the main differences between operational and analytical infrastructure?
  • What are data warehouses? What problems did they evolve to solve?
  • What are data lakes and lakehouses? What problems did they evolve to solve?
  • What are the components of modern data platforms?
  • How do operational and analytical data models differ?
  • What goes on in ETL/ELT?
  • How do different physical representations of data affect storage, compute, and other tradeoffs within data platforms?

Readings

The above readings are available for free online through the university's library. The links above point directly to Waterloo proxied content, but if you're having trouble accessing the content (e.g., due to VPN settings), you might have go through the library's portal (i.e., search for the book title and follow the appropriate link).

Slides

PDF slides for Sept 9 (v1.01) PDF slides for Sept 11 (v1.01)

Back to top

Week 3: Batch Processing I Sep 16, 18

Key Questions

  • What's the difference between scaling up and scaling out?
  • What are the implications of distributed processing across many machines?
  • What are the challenges for a divide-and-conquer strategy?
  • What challenges does partitioning address? What challenges does it exacerbate?
  • What challenges does replication address? What challenges does it exacerbate?
  • What's MapReduce and how does it work with HDFS?
  • What challenges do communication and skew present in scaling out?
  • Why is local aggregation important?

Readings

The above readings are available for free online through the university's library. The links above point directly to Waterloo proxied content, but if you're having trouble accessing the content (e.g., due to VPN settings), you might have go through the library's portal (i.e., search for the book title and follow the appropriate link).

The following are optional. They comprise some of the primary sources from which lecture content is drawn, and can enrich your understanding to provide broader context.

Slides

PDF slides for Sept 16 (v1.01) PDF slides for Sept 18 (v1.01)

Back to top

Week 4: Batch Processing II Sep 23, 25

Key Questions

  • In what ways does Spark improve over MapReduce?
  • How is distributed group by implemented efficiently at scale?
  • How do commutative and associative operations contribute to efficient distributed execution?
  • How does partitioning contribute to efficient distributed execution?
  • How do these concepts come together in efficient joins at scale?

Readings

Reread the assigned readings from last week (or read for the first time if you haven't yet). The material will make a lot more sense given the lecture material. In addition:

The above readings are available for free online through the university's library. The links above point directly to Waterloo proxied content, but if you're having trouble accessing the content (e.g., due to VPN settings), you might have go through the library's portal (i.e., search for the book title and follow the appropriate link).

The following are optional. They comprise some of the primary sources from which lecture content is drawn, and can enrich your understanding to provide broader context.

Slides

PDF slides for Sept 23 (v1.00) PDF slides for Sept 25 (v1.00)

Back to top

Week 5: Rubber, Meet Road Sept 30, Oct 1

Readings

The above readings are available for free online through the university's library. The links above point directly to Waterloo proxied content, but if you're having trouble accessing the content (e.g., due to VPN settings), you might have go through the library's portal (i.e., search for the book title and follow the appropriate link).

Slides

Guest Lectures by Khaled Ammar:

PDF slides for Sept 30 (v1.00) PDF slides for Oct 1 (v1.00)

Back to top

Week 6: Data Infrastructure for Machine Learning Oct 7, 9

Key Questions

  • What are the key components of an ML solution?
  • How is the supervised machine learning problem formulated?
  • What roles do data platforms and data engineering play?

Readings

The above readings are available for free online through the university's library. The links above point directly to Waterloo proxied content, but if you're having trouble accessing the content (e.g., due to VPN settings), you might have go through the library's portal (i.e., search for the book title and follow the appropriate link).

The following are optional. They provide a deeper dive of lecture content that can enrich your understanding and provide broader context.

Slides

PDF slides for Oct 7 (v1.00) PDF slides for Oct 9 (v1.01)

Back to top

Week 8: Midterm Exam Oct 21, 23

Back to top

Week 9: Text Processing I Oct 28, 30

Key Questions

  • How do data products with operational requirements complicate lakehouse architectures?
  • Why does retrieval remain important in the era of LLMs?
  • How is retrieval formulated as the problem of computing vector similarity?
  • What's the intuition behind sparse and dense vector representations?
  • For sparse vector representations, how do we assign weights?
  • For sparse retrieval, how do we perform top-k retrieval efficiently?

Readings

Some of the above readings are available for free online through the university's library. The links above point directly to Waterloo proxied content, but if you're having trouble accessing the content (e.g., due to VPN settings), you might have go through the library's portal (i.e., search for the book title and follow the appropriate link).

Slides

PDF slides for Oct 28 (v1.02) PDF slides for Oct 30 (v1.02)

Back to top

Week 10: Text Processing II Nov 4, 6

Key Questions

  • For sparse vector representations, how do we assign weights?
  • For sparse retrieval, how do we perform top-k retrieval efficiently?
  • For dense vector representations, how do we assign weights?
  • For dense retrieval, how do we perform top-k retrieval efficiently?
  • Do we need specialized databases for sparse and dense vectors?
  • Can we also learn sparse representations from data?
  • Why do we need rerankers?
  • What's the relationship between RAG and LLM tool use?
  • What challenges does MCP solve?

Readings

The following are optional. They comprise some of the primary sources from which lecture content is drawn, and can enrich your understanding to provide broader context.

  • Pretrained Transformers for Text Ranking:
    • Chapter 5. Learned Dense Representations for Ranking (This chapter is a good starting point if you're interested in a deeper dive of how dense retrieval works. The focus is on generating dense vector representations using encoder-only BERT models. Although most work today uses decoder-only transformer models, much of the core ideas remains the same.)
  • What is the Model Context Protocol (MCP)?

Slides

PDF slides for Nov 4 (v1.01) PDF slides for Nov 6 (v1.00)

Back to top

Week 11: Finding Similar Items Nov 11, 13

Key Questions

  • What do hash collisions have to do with finding similar items?
  • When do you use minhash? And random projections?
  • What are the knobs you control and what do they do?
  • What makes certain types of clustering algorithms amenable to scale-out distributed processing?
  • What are the parallels between k-means clustering and Gaussian Mixture Models?

Readings

The following are from online textbooks that explain what's covered in lecture. They are required to the extent that they help you understand what's covered in class. If you're confused about anything, consult these sources.

Note, readings refer to "banding" (i.e., b bands of r rows per band): this is the same idea as the slides, which refer to k minhash signatures (= row) n times (= band).

Slides

PDF slides for Nov 11 (v1.01) PDF slides for Nov 13 (v1.00)

Back to top

Week 12: Graph Processing Nov 18, 20

Key Questions

  • What are alternative approaches to representing graphs?
  • Why are graph algorithms challenging in MapReduce/Spark?
  • What is the general structure of graph traversals in MapReduce/Spark?
  • Why (even) MapReduce/Spark (for graph processing)?

Readings

If you have further interest in the discussion on scale up vs. scale out for graph processing, you might want to read the sources cited in the paper — in particular, the Twitter WTF paper, the survey on graph processing systems, and the paper on COST by McSherry et al.

Slides

PDF slides for Nov 18 (v1.00) PDF slides for Nov 20 (v1.00)

Back to top

Week 13: Stream Processing Nov 25, 27

Key Questions

  • What are the challenges associated with stream processing?
  • What are the common patterns in connecting data producers and data consumers?
  • What are some common data structures and algorithms for processing unbounded streams?
  • What's the distinction between event time and processing time?
  • How do stream processing platforms fit into the lakehouse?

Readings

  • Apache Beam: The world beyond batch: Streaming 101, Streaming 102. (These are long blog posts but they cover a lot of important concepts.)
  • Kafka Streams in Action:
    • Chapter 1. Welcome to the Kafka event streaming platform (A nice overview of streaming platforms in general.)

The above readings are available for free online through the university's library. The links above point directly to Waterloo proxied content, but if you're having trouble accessing the content (e.g., due to VPN settings), you might have go through the library's portal (i.e., search for the book title and follow the appropriate link).

The following are optional, primarily provided to enrich your understanding.

Slides

PDF slides for Nov 25 (v1.00) PDF slides for Nov 27 (v1.00)

Back to top

Week 14: LLMs Dec 2

Key Questions

Readings

Slides

Back to top

Final Exam TBD

Back to top