Schedule

Part Description Dates CS 451/651 Assignments CS 431/631 Assignments
1 MapReduce Algorithm Design Jan 4, 9, 11, 16 A0: Jan 16
2 From MapReduce to Spark Jan 18, 23 A1: Jan 23 A0: Jan 18
3 Analyzing Text Jan 25, 30 A2: Jan 30 A1: Jan 25
4 Analyzing Graphs Feb 1, 6 A3: Feb 6 A2: Feb 6
5 Analyzing Relational Data Feb 8, 13, 15 A3: Feb 15
No classes!
6 Data Mining and Machine Learning Feb 27, Mar 1, 6, 8 A4: Feb 27
7Mutable State Mar 13, 15 A5: Mar 13 A4: Mar 13
8 Analyzing Graphs, Redux Mar 20, 22
9 Real-Time Analytics Mar 27, 29 A6: Mar 27 A5: Mar 29
10 Looking Ahead Apr 3 A7: Apr 3

Part 1: MapReduce Algorithm Design January 4, 9, 11, 16

Topics

  • What's this course about?
  • Why big data?
  • The datacenter is the computer and other "big ideas"
  • MapReduce programming model
  • Cloud computing and datacenters
  • Hadoop API
  • Hadoop physical execution
  • MapReduce design patterns
  • Intermediate aggregation and combiners
  • Partitioning, grouping, sorting, and monoids

Readings

  • Data-Intensive Text Processing with MapReduce
  • Hadoop: The Definitive Guide (4th Edition) (Optional for CS 431/631, recommended for CS 451/651):
    • Chapter 1: Meet Hadoop
    • Chapter 2: MapReduce
    • Chapter 3: The Hadoop Distributed Filesystem (Focus on the mechanics of the HDFS commands and don't worry so much about learning the Java API all at once—you'll pick it up in time.)
    • Chapter 5: Hadoop I/O (Read sections "Serialization" and "File-Based Data Structures")
    • Chapter 6: Developing a MapReduce Application (Skip sections "Setting Up the Development Environment", "Writing a Unit Test with MRUnit" and "MapReduce Workflows")
    • Chapter 7: How MapReduce Works (Skip section on "Configuration Tuning")
    • Chapter 8: MapReduce Types and Formats
    • Chapter 9: MapReduce Features (Read sections on "Counters", "Sorting", and "Side Data distribution")

Slides

PPTX (Mac) PDF   Part 1a: January 4

PPTX (Mac) PDF   Part 1b: January 9

PPTX (Mac) PDF   Part 1c: January 11

PPTX (Mac) PDF   Part 1d: January 16

Back to top

Part 2: From MapReduce to Spark January 18, 23

Topics

  • Evolution of dataflow abstractions
  • MapReduce, Pig, Dryad, Spark, Flink, etc.

Readings

  • Jimmy Lin. Monoidify! Monoids as a Design Principle for Efficient MapReduce Algorithms. arXiv:1304.7544.
  • Learning Spark (Optional):
    • Chapter 1: Introduction to Data Analysis with Spark
    • Chapter 2: Downloading Spark and Getting Started (Skip section on downloading)
    • Chapter 3: Programming with RDDs
    • Chapter 4: Working with Key/Value Pairs
    • Chapter 5: Loading and Saving Your Data (Stop when you get to Structured Data with Spark SQL)
    • In all the readings above, CS 451/651 students should focus on the Scala examples since they will only be working with Spark's Scala API. CS 431/631 students should focus on the Python examples, for a similar reason.

Note that the Spark book is a bit outdated since it covers Spark 1.3; we're using Spark 2.1. All the material in the book can be found in a multitude of sources online, but you'll have to hunt around for resources — the book is useful primarily as single reference that gathers everything together.

Slides

PPTX (Mac) PDF   Part 2a: January 18

PPTX (Mac) PDF   Part 2b: January 23

Back to top

Part 3: Analyzing Text January 25, 30

Topics

  • Language models and machine translation
  • Inverted indexing and search

Readings

Slides

PPTX (Mac) PDF   Part 3a: January 25

PPTX (Mac) PDF   Part 3b: January 30

Back to top

Part 4: Analyzing Graphs February 1, 6

Topics

  • Graph representations
  • Parallel breadth-first search
  • PageRank and random walks
  • Issues and challenges with dataflow abstractions

Readings

Slides

PPTX (Mac) PDF   Part 4a: February 1

PPTX (Mac) PDF   Part 4b: February 6

Back to top

Part 5: Analyzing Relational Data February 8, 13, 15

Topics

  • OLTP vs. OLAP
  • Data warehousing and data lakes, ETL
  • SQL-on-Hadoop: relational data processing with MapReduce and Spark
  • Optimizations for relational processing: row vs. column stores, vectorized processing
  • Semistructured data and record reconstruction (Parquet)

Readings

Slides

PPTX (Mac) PDF   Part 5a: February 8

PPTX (Mac) PDF   Part 5b: February 13

PPTX (Mac) PDF   Part 5c: February 15

Back to top

Part 6: Data Mining and Machine Learning February 27, March 1, 6, 8

Topics

  • Supervised machine learning: binary classification
  • Logistic regression, gradient descent, stochastic gradient descent, ensemble methods
  • Production machine learning pipelines
  • Hashing: minhash, random projections, etc.
  • Clustering: k-means, Gaussian mixture models

Readings

Slides

PPTX (Mac) PDF   Part 6a: February 27

PPTX (Mac) PDF   Part 6b: March 1

PPTX (Mac) PDF   Part 6c: March 6

PPTX (Mac) PDF   Part 6d: March 8

Back to top

Part 7: Mutable State March 13, 15

Topics

  • Bigtable/HBase: Log-structure merge trees
  • Distributed hash tables
  • Consistency, latency, and availability tradeoffs

Slides

PPTX (Mac) PDF   Part 7a: March 13

PPTX (Mac) PDF   Part 7b: March 15

Back to top

Part 8: Analyzing Graphs, Redux March 20, 22

Topics

  • Bulk synchronous parallel: "think like a vertex" (Giraph)
  • Alternative approaches: GraphX

Slides

PPTX (Mac) PDF   Part 8a: March 20

PPTX (Mac) PDF   Part 8b: March 22

Back to top

Part 9: Real-Time Analytics March 27, 29

Topics

  • Stream processing semantics, issues, and frameworks
  • Probabilistic data structures (hyerloglog counters, bloom filters, count-min sketches, etc.)
  • Integrating batch and stream processing

Slides

PPTX (Mac) PDF   Part 9a: March 27

PPTX (Mac) PDF   Part 9b: March 29

Back to top

Part 10: Looking Ahead April 3

Slides

PPTX (Mac) PDF   Part 10: April 3

Back to top