Data-Intensive Distributed Computing

Schedule

Part	Description	Dates	Assignments
1	MapReduce Algorithm Design	Sep 6, 11, 13, 18	A0 (Warmup): Sep 18
2	From MapReduce to Spark	Sep 20, 25	A1 (Counting in MR): Sep 25
3	Analyzing Text	Sep 27, Oct 2	A2 (Counting in Spark): Oct 2
4	Analyzing Graphs	Oct 4, 11	A3 (Indexing): Oct 12
5	Analyzing Relational Data	Oct 16, 18, 23
6	Data Mining and Machine Learning	Oct 25, 30, Nov 1, 6	A4 (PageRank): Oct 25
7	Mutable State	Nov 8, 13	A5 (SQL): Nov 8
8	Analyzing Graphs, Redux	Nov 15, 20
9	Real-Time Analytics	Nov 22, 27	A6 (ML): Nov 22
10	Looking Ahead	Nov 29	A7 (Streaming): Nov 29

Part 1: MapReduce Algorithm Design Sep 6, 11, 13, 18

Topics

What's this course about?
Why big data?
The datacenter is the computer and other "big ideas"
MapReduce programming model
Cloud computing and datacenters
Hadoop API
Hadoop physical execution
MapReduce design patterns
Intermediate aggregation and combiners
Partitioning, grouping, sorting, and monoids

Readings

Data-Intensive Text Processing with MapReduce
Hadoop: The Definitive Guide (4th Edition):
- Chapter 1: Meet Hadoop
- Chapter 2: MapReduce
- Chapter 3: The Hadoop Distributed Filesystem (Focus on the mechanics of the HDFS commands and don't worry so much about learning the Java API all at once—you'll pick it up in time.)
- Chapter 5: Hadoop I/O (Read sections "Serialization" and "File-Based Data Structures")
- Chapter 6: Developing a MapReduce Application (Skip sections "Setting Up the Development Environment", "Writing a Unit Test with MRUnit" and "MapReduce Workflows")
- Chapter 7: How MapReduce Works (Skip section on "Configuration Tuning")
- Chapter 8: MapReduce Types and Formats
- Chapter 9: MapReduce Features (Read sections on "Counters", "Sorting", and "Side Data distribution")

Slides

PPTX (Mac) PDF Part 1a: September 6

PPTX (Mac) PDF Part 1b: September 11

PPTX (Mac) PDF Part 1c: September 13

PPTX (Mac) PDF Part 1d: September 18

Part 2: From MapReduce to Spark Sep 20, 25

Topics

Evolution of dataflow abstractions
MapReduce, Pig, Dryad, Spark, Flink, etc.

Readings

Jimmy Lin. Monoidify! Monoids as a Design Principle for Efficient MapReduce Algorithms. arXiv:1304.7544.
Learning Spark (Optional):
- Chapter 1: Introduction to Data Analysis with Spark
- Chapter 2: Downloading Spark and Getting Started (Skip section on downloading)
- Chapter 3: Programming with RDDs
- Chapter 4: Working with Key/Value Pairs
- Chapter 5: Loading and Saving Your Data (Stop when you get to Structured Data with Spark SQL)

Note that the Spark book is a bit outdated since it covers Spark 1.3; we're using Spark 2.1. All the material in the book can be found in a multitude of sources online, but you'll have to hunt around for resources — the book is useful primarily as single reference that gathers everything together.

Slides

PPTX (Mac) PDF Part 2a: September 20

PPTX (Mac) PDF Part 2b: September 25

Part 3: Analyzing Text Sep 27, Oct 2

Topics

Language models and machine translation
Inverted indexing and search

Readings

Data-Intensive Text Processing with MapReduce — Chapter 4: Inverted Indexing for Text Retrieval

Slides

PPTX (Mac) PDF Part 3a: September 27

PPTX (Mac) PDF Part 3b: October 2

Part 4: Analyzing Graphs Oct 4, 11

Topics

Graph representations
Parallel breadth-first search
PageRank and random walks
Issues and challenges with dataflow abstractions

Readings

Data-Intensive Text Processing with MapReduce — Chapter 5: Graph Algorithms

Slides

PPTX (Mac) PDF Part 4a: October 4

PPTX (Mac) PDF Part 4b: October 11

Part 5: Analyzing Relational Data Oct 16, 18, 23

Topics

OLTP vs. OLAP
Data warehousing and data lakes, ETL
SQL-on-Hadoop: relational data processing with MapReduce and Spark
Optimizations for relational processing: row vs. column stores, vectorized processing
Semistructured data and record reconstruction (Parquet)

Readings

Data-Intensive Text Processing with MapReduce — Chapter 6: Processing Relational Data
MapReduce: A major step backwards
Chaudhuri et al. (2011) An overview of business intelligence technology, CACM, 54(8):88-98.

Slides

PPTX (Mac) PDF Part 5a: October 16

PPTX (Mac) PDF Part 5b: October 18

PPTX (Mac) PDF Part 5c: October 23

Part 6: Data Mining and Machine Learning Oct 25, 30, Nov 1, 6

Topics

Supervised machine learning: binary classification
Logistic regression, gradient descent, stochastic gradient descent, ensemble methods
Production machine learning pipelines
Hashing: minhash, random projections, etc.
Clustering: k-means, Gaussian mixture models

Readings

Tom Mitchell. Naive Bayes and Logistic Regression. (This book chapter serves as supplemental reading and goes into classification in more detail than in lecture.)
Deisenroth et al., Mathematics for Machine Learning: Chapter 12, Classification with Support Vector Machines. (Optional supplemental reading)
Deisenroth et al., Mathematics for Machine Learning: Chapter 11, Density Estimation with Gaussian Mixture Models. (This book chapter serves as supplemental reading and goes into clustering with Gaussian mixture models in more detail than in lecture.)
Jimmy Lin and Dmitriy Ryaboy. Scaling Big Data Mining Infrastructure: The Twitter Experience, SIGKDD Explorations, 14(2):6-19, 2012.

Slides

PPTX (Mac) PDF Part 6a: October 25

PPTX (Mac) PDF Part 6b: October 30

PPTX (Mac) PDF Part 6c: November 1

PPTX (Mac) PDF Part 6d: November 6

Part 7: Mutable State Nov 8, 13

Topics

Bigtable/HBase: Log-structure merge trees
Distributed hash tables
Consistency, latency, and availability tradeoffs

Readings

The original Bigtable paper.
The original DHT paper.
Daniel Abadi. Consistency Tradeoffs in Modern Distributed Database System Design, Computer, 45(2):37-42, 2012.

Slides

PPTX (Mac) PDF Part 7a: November 8

PPTX (Mac) PDF Part 7b: November 13

Part 8: Analyzing Graphs, Redux Nov 15, 20

Topics

Bulk synchronous parallel: "think like a vertex" (Giraph)
Alternative approaches: GraphX

Readings

Sherif Sakr. Large-Scale Graph Processing Systems, 2016.

Slides

PPTX (Mac) PDF Part 8a: November 15

PPTX (Mac) PDF Part 8b: November 20

Part 9: Real-Time Analytics Nov 22, 27

Topics

Stream processing semantics, issues, and frameworks
Probabilistic data structures (hyerloglog counters, bloom filters, count-min sketches, etc.)
Integrating batch and stream processing

Readings

Zaharia et al. Discretized Streams: Fault-Tolerant Streaming Computation at Scale, SOSP 2013.
Kulkarni et al. Twitter Heron: Stream Processing at Scale, SIGMOD 2015.
Apache Beam: The world beyond batch: Streaming 101, Streaming 102.
If you're interested, here's my rant about the Lambda and Kappa architectures.

Slides

PPTX (Mac) PDF Part 9a: November 22

PPTX (Mac) PDF Part 9b: November 27

Part 10: Looking Ahead Nov 29

Slides

PPTX (Mac) PDF Bonus: November 29

Syllabus Data-Intensive Distributed Computing (Fall 2018)

Schedule

Part 1: MapReduce Algorithm Design Sep 6, 11, 13, 18

Topics

Readings

Slides

Part 2: From MapReduce to Spark Sep 20, 25

Topics

Readings

Slides

Part 3: Analyzing Text Sep 27, Oct 2

Topics

Readings

Slides

Part 4: Analyzing Graphs Oct 4, 11

Topics

Readings

Slides

Part 5: Analyzing Relational Data Oct 16, 18, 23

Topics

Readings

Slides

Part 6: Data Mining and Machine Learning Oct 25, 30, Nov 1, 6

Topics

Readings

Slides

Part 7: Mutable State Nov 8, 13

Topics

Readings

Slides

Part 8: Analyzing Graphs, Redux Nov 15, 20

Topics

Readings

Slides

Part 9: Real-Time Analytics Nov 22, 27

Topics

Readings

Slides

Part 10: Looking Ahead Nov 29

Slides

Syllabus
Data-Intensive Distributed Computing (Fall 2018)