Schedule

Part Description Dates Assignments
1 MapReduce Algorithm Design Sep 6, 11, 13, 18 A0 (Warmup): Sep 18
2 From MapReduce to Spark Sep 20, 25 A1 (Counting in MR): Sep 25
3 Analyzing Text Sep 27, Oct 2 A2 (Counting in Spark): Oct 2
4 Analyzing Graphs Oct 4, 11 A3 (Indexing): Oct 12
5 Analyzing Relational Data Oct 16, 18, 23
6 Data Mining and Machine Learning Oct 25, 30, Nov 1, 6 A4 (PageRank): Oct 25
7Mutable State Nov 8, 13 A5 (SQL): Nov 8
8 Analyzing Graphs, Redux Nov 15, 20
9 Real-Time Analytics Nov 22, 27 A6 (ML): Nov 22
10 Looking Ahead Nov 29 A7 (Streaming): Nov 29

Part 1: MapReduce Algorithm Design Sep 6, 11, 13, 18

Topics

  • What's this course about?
  • Why big data?
  • The datacenter is the computer and other "big ideas"
  • MapReduce programming model
  • Cloud computing and datacenters
  • Hadoop API
  • Hadoop physical execution
  • MapReduce design patterns
  • Intermediate aggregation and combiners
  • Partitioning, grouping, sorting, and monoids

Readings

  • Data-Intensive Text Processing with MapReduce
  • Hadoop: The Definitive Guide (4th Edition):
    • Chapter 1: Meet Hadoop
    • Chapter 2: MapReduce
    • Chapter 3: The Hadoop Distributed Filesystem (Focus on the mechanics of the HDFS commands and don't worry so much about learning the Java API all at once—you'll pick it up in time.)
    • Chapter 5: Hadoop I/O (Read sections "Serialization" and "File-Based Data Structures")
    • Chapter 6: Developing a MapReduce Application (Skip sections "Setting Up the Development Environment", "Writing a Unit Test with MRUnit" and "MapReduce Workflows")
    • Chapter 7: How MapReduce Works (Skip section on "Configuration Tuning")
    • Chapter 8: MapReduce Types and Formats
    • Chapter 9: MapReduce Features (Read sections on "Counters", "Sorting", and "Side Data distribution")

Slides

PPTX (Mac) PDF   Part 1a: September 6

PPTX (Mac) PDF   Part 1b: September 11

PPTX (Mac) PDF   Part 1c: September 13

PPTX (Mac) PDF   Part 1d: September 18

Back to top

Part 2: From MapReduce to Spark Sep 20, 25

Topics

  • Evolution of dataflow abstractions
  • MapReduce, Pig, Dryad, Spark, Flink, etc.

Readings

  • Jimmy Lin. Monoidify! Monoids as a Design Principle for Efficient MapReduce Algorithms. arXiv:1304.7544.
  • Learning Spark (Optional):
    • Chapter 1: Introduction to Data Analysis with Spark
    • Chapter 2: Downloading Spark and Getting Started (Skip section on downloading)
    • Chapter 3: Programming with RDDs
    • Chapter 4: Working with Key/Value Pairs
    • Chapter 5: Loading and Saving Your Data (Stop when you get to Structured Data with Spark SQL)

Note that the Spark book is a bit outdated since it covers Spark 1.3; we're using Spark 2.1. All the material in the book can be found in a multitude of sources online, but you'll have to hunt around for resources — the book is useful primarily as single reference that gathers everything together.

Slides

PPTX (Mac) PDF   Part 2a: September 20

PPTX (Mac) PDF   Part 2b: September 25

Back to top

Part 3: Analyzing Text Sep 27, Oct 2

Topics

  • Language models and machine translation
  • Inverted indexing and search

Readings

Slides

PPTX (Mac) PDF   Part 3a: September 27

PPTX (Mac) PDF   Part 3b: October 2

Back to top

Part 4: Analyzing Graphs Oct 4, 11

Topics

  • Graph representations
  • Parallel breadth-first search
  • PageRank and random walks
  • Issues and challenges with dataflow abstractions

Readings

Slides

PPTX (Mac) PDF   Part 4a: October 4

PPTX (Mac) PDF   Part 4b: October 11

Back to top

Part 5: Analyzing Relational Data Oct 16, 18, 23

Topics

  • OLTP vs. OLAP
  • Data warehousing and data lakes, ETL
  • SQL-on-Hadoop: relational data processing with MapReduce and Spark
  • Optimizations for relational processing: row vs. column stores, vectorized processing
  • Semistructured data and record reconstruction (Parquet)

Readings

Slides

PPTX (Mac) PDF   Part 5a: October 16

PPTX (Mac) PDF   Part 5b: October 18

PPTX (Mac) PDF   Part 5c: October 23

Back to top

Part 6: Data Mining and Machine Learning Oct 25, 30, Nov 1, 6

Topics

  • Supervised machine learning: binary classification
  • Logistic regression, gradient descent, stochastic gradient descent, ensemble methods
  • Production machine learning pipelines
  • Hashing: minhash, random projections, etc.
  • Clustering: k-means, Gaussian mixture models

Readings

Slides

PPTX (Mac) PDF   Part 6a: October 25

PPTX (Mac) PDF   Part 6b: October 30

PPTX (Mac) PDF   Part 6c: November 1

PPTX (Mac) PDF   Part 6d: November 6

Back to top

Part 7: Mutable State Nov 8, 13

Topics

  • Bigtable/HBase: Log-structure merge trees
  • Distributed hash tables
  • Consistency, latency, and availability tradeoffs

Readings

Slides

PPTX (Mac) PDF   Part 7a: November 8

PPTX (Mac) PDF   Part 7b: November 13

Back to top

Part 8: Analyzing Graphs, Redux Nov 15, 20

Topics

  • Bulk synchronous parallel: "think like a vertex" (Giraph)
  • Alternative approaches: GraphX

Readings

Slides

PPTX (Mac) PDF   Part 8a: November 15

PPTX (Mac) PDF   Part 8b: November 20

Back to top

Part 9: Real-Time Analytics Nov 22, 27

Topics

  • Stream processing semantics, issues, and frameworks
  • Probabilistic data structures (hyerloglog counters, bloom filters, count-min sketches, etc.)
  • Integrating batch and stream processing

Readings

Slides

PPTX (Mac) PDF   Part 9a: November 22

PPTX (Mac) PDF   Part 9b: November 27

Back to top

Part 10: Looking Ahead Nov 29

Slides

PPTX (Mac) PDF   Bonus: November 29

Back to top