Data-Intensive Distributed Computing

Schedule

Part	Description	Dates	CS 451/651 Assignments	CS 431/631 Assignments
1	MapReduce Algorithm Design	Jan 4, 9, 11, 16	A0: Jan 16
2	From MapReduce to Spark	Jan 18, 23	A1: Jan 23	A0: Jan 18
3	Analyzing Text	Jan 25, 30	A2: Jan 30	A1: Jan 25
4	Analyzing Graphs	Feb 1, 6	A3: Feb 6	A2: Feb 6
5	Analyzing Relational Data	Feb 8, 13, 15		A3: Feb 15
No classes!
6	Data Mining and Machine Learning	Feb 27, Mar 1, 6, 8	A4: Feb 27
7	Mutable State	Mar 13, 15	A5: Mar 13	A4: Mar 13
8	Analyzing Graphs, Redux	Mar 20, 22
9	Real-Time Analytics	Mar 27, 29	A6: Mar 27	A5: Mar 29
10	Looking Ahead	Apr 3	A7: Apr 3

Part 1: MapReduce Algorithm Design January 4, 9, 11, 16

Topics

What's this course about?
Why big data?
The datacenter is the computer and other "big ideas"
MapReduce programming model
Cloud computing and datacenters
Hadoop API
Hadoop physical execution
MapReduce design patterns
Intermediate aggregation and combiners
Partitioning, grouping, sorting, and monoids

Readings

Data-Intensive Text Processing with MapReduce
Hadoop: The Definitive Guide (4th Edition) (Optional for CS 431/631, recommended for CS 451/651):
- Chapter 1: Meet Hadoop
- Chapter 2: MapReduce
- Chapter 3: The Hadoop Distributed Filesystem (Focus on the mechanics of the HDFS commands and don't worry so much about learning the Java API all at once—you'll pick it up in time.)
- Chapter 5: Hadoop I/O (Read sections "Serialization" and "File-Based Data Structures")
- Chapter 6: Developing a MapReduce Application (Skip sections "Setting Up the Development Environment", "Writing a Unit Test with MRUnit" and "MapReduce Workflows")
- Chapter 7: How MapReduce Works (Skip section on "Configuration Tuning")
- Chapter 8: MapReduce Types and Formats
- Chapter 9: MapReduce Features (Read sections on "Counters", "Sorting", and "Side Data distribution")

Slides

PPTX (Mac) PDF Part 1a: January 4

PPTX (Mac) PDF Part 1b: January 9

PPTX (Mac) PDF Part 1c: January 11

PPTX (Mac) PDF Part 1d: January 16

Part 2: From MapReduce to Spark January 18, 23

Topics

Evolution of dataflow abstractions
MapReduce, Pig, Dryad, Spark, Flink, etc.

Readings

Jimmy Lin. Monoidify! Monoids as a Design Principle for Efficient MapReduce Algorithms. arXiv:1304.7544.
Learning Spark (Optional):
- Chapter 1: Introduction to Data Analysis with Spark
- Chapter 2: Downloading Spark and Getting Started (Skip section on downloading)
- Chapter 3: Programming with RDDs
- Chapter 4: Working with Key/Value Pairs
- Chapter 5: Loading and Saving Your Data (Stop when you get to Structured Data with Spark SQL)
- In all the readings above, CS 451/651 students should focus on the Scala examples since they will only be working with Spark's Scala API. CS 431/631 students should focus on the Python examples, for a similar reason.

Note that the Spark book is a bit outdated since it covers Spark 1.3; we're using Spark 2.1. All the material in the book can be found in a multitude of sources online, but you'll have to hunt around for resources — the book is useful primarily as single reference that gathers everything together.

Slides

PPTX (Mac) PDF Part 2a: January 18

PPTX (Mac) PDF Part 2b: January 23

Part 3: Analyzing Text January 25, 30

Topics

Language models and machine translation
Inverted indexing and search

Readings

Data-Intensive Text Processing with MapReduce — Chapter 4: Inverted Indexing for Text Retrieval

Slides

PPTX (Mac) PDF Part 3a: January 25

PPTX (Mac) PDF Part 3b: January 30

Part 4: Analyzing Graphs February 1, 6

Topics

Graph representations
Parallel breadth-first search
PageRank and random walks
Issues and challenges with dataflow abstractions

Readings

Data-Intensive Text Processing with MapReduce — Chapter 5: Graph Algorithms

Slides

PPTX (Mac) PDF Part 4a: February 1

PPTX (Mac) PDF Part 4b: February 6

Part 5: Analyzing Relational Data February 8, 13, 15

Topics

OLTP vs. OLAP
Data warehousing and data lakes, ETL
SQL-on-Hadoop: relational data processing with MapReduce and Spark
Optimizations for relational processing: row vs. column stores, vectorized processing
Semistructured data and record reconstruction (Parquet)

Readings

Data-Intensive Text Processing with MapReduce — Chapter 6: Processing Relational Data
MapReduce: A major step backwards
Chaudhuri et al. (2011) An overview of business intelligence technology, CACM, 54(8):88-98.

Slides

PPTX (Mac) PDF Part 5a: February 8

PPTX (Mac) PDF Part 5b: February 13

PPTX (Mac) PDF Part 5c: February 15

Part 6: Data Mining and Machine Learning February 27, March 1, 6, 8

Topics

Supervised machine learning: binary classification
Logistic regression, gradient descent, stochastic gradient descent, ensemble methods
Production machine learning pipelines
Hashing: minhash, random projections, etc.
Clustering: k-means, Gaussian mixture models

Readings

Tom Mitchell. Naive Bayes and Logistic Regression. (This book chapter serves as supplemental reading and goes into classification in more detail than in lecture.)
Jimmy Lin and Dmitriy Ryaboy. Scaling Big Data Mining Infrastructure: The Twitter Experience, SIGKDD Explorations, 14(2):6-19, 2012.

Slides

PPTX (Mac) PDF Part 6a: February 27

PPTX (Mac) PDF Part 6b: March 1

PPTX (Mac) PDF Part 6c: March 6

PPTX (Mac) PDF Part 6d: March 8

Part 7: Mutable State March 13, 15

Topics

Bigtable/HBase: Log-structure merge trees
Distributed hash tables
Consistency, latency, and availability tradeoffs

Slides

PPTX (Mac) PDF Part 7a: March 13

PPTX (Mac) PDF Part 7b: March 15

Part 8: Analyzing Graphs, Redux March 20, 22

Topics

Bulk synchronous parallel: "think like a vertex" (Giraph)
Alternative approaches: GraphX

Slides

PPTX (Mac) PDF Part 8a: March 20

PPTX (Mac) PDF Part 8b: March 22

Part 9: Real-Time Analytics March 27, 29

Topics

Stream processing semantics, issues, and frameworks
Probabilistic data structures (hyerloglog counters, bloom filters, count-min sketches, etc.)
Integrating batch and stream processing

Slides

PPTX (Mac) PDF Part 9a: March 27

PPTX (Mac) PDF Part 9b: March 29

Part 10: Looking Ahead April 3

Slides

PPTX (Mac) PDF Part 10: April 3

Syllabus Data-Intensive Distributed Computing (Winter 2018)

Schedule

Part 1: MapReduce Algorithm Design January 4, 9, 11, 16

Topics

Readings

Slides

Part 2: From MapReduce to Spark January 18, 23

Topics

Readings

Slides

Part 3: Analyzing Text January 25, 30

Topics

Readings

Slides

Part 4: Analyzing Graphs February 1, 6

Topics

Readings

Slides

Part 5: Analyzing Relational Data February 8, 13, 15

Topics

Readings

Slides

Part 6: Data Mining and Machine Learning February 27, March 1, 6, 8

Topics

Readings

Slides

Part 7: Mutable State March 13, 15

Topics

Slides

Part 8: Analyzing Graphs, Redux March 20, 22

Topics

Slides

Part 9: Real-Time Analytics March 27, 29

Topics

Slides

Part 10: Looking Ahead April 3

Slides

Syllabus
Data-Intensive Distributed Computing (Winter 2018)