Schedule of Classes

Week Date Topic Assignment Due (Tuesdays before class)
1January 5/7Introduction
2January 12/14MapReduce Algorithm DesignAssignment #0: Warmup
3January 19/21From MapReduce to SparkAssignment #1: Counting in MapReduce
4January 26/28Analyzing TextAssignment #2: Counting in Spark
5February 2/4Analyzing GraphsAssignment #3: Inverted Indexing
6February 9/11Analyzing Relational Data IAssignment #4: Multi-Source Personalized PageRank
No classes!
7February 23/25Analyzing Relational Data II
8March 1/3Data Mining IAssignment #5: Data Warehousing
9March 8/10Data Mining II
10March 15/17Mutable State
11March 22/24Analyzing Graphs, ReduxAssignment #6: Duplicate Sentence Detection
12March 29/31Real-Time Data AnalyticsAssignment #7: Inverted Indexing (Redux)

Week 1: Introduction January 5/7

Topics

  • What's this course about?
  • Why big data?
  • The datacenter is the computer and other "big ideas"
  • The MapReduce programming model

Readings

  • Data-Intensive Text Processing with MapReduce
    • Chapter 1: Introduction
    • Chapter 2: MapReduce Basics
  • (Optional, but recommended) Hadoop: The Definitive Guide (4th Edition):
    • Chapter 1: Meet Hadoop
    • Chapter 2: MapReduce
    • Chapter 3: The Hadoop Distributed Filesystem (Focus on the mechanics of the HDFS commands and don't worry so much about learning the Java API all at once—you'll pick it up in time.)

Slides

PPTX (Mac) PDF   Part 1: January 5

PPTX (Mac) PDF   Part 2: January 7

Back to top

Week 2: MapReduce Algorithm Design January 12/14

Topics

  • MapReduce physical execution
  • MapReduce design patterns
  • Intermediate aggregation and combiners
  • Partitioning, grouping, sorting, and monoids

Readings

  • Data-Intensive Text Processing with MapReduce
    • Chapter 3: MapReduce Algorithm Design
  • (Optional, but recommended) Hadoop: The Definitive Guide (4th Edition):
    • Chapter 5: Hadoop I/O (Read sections "Serialization" and "File-Based Data Structures")
    • Chapter 6: Developing a MapReduce Application (Skip sections "Setting Up the Development Environment", "Writing a Unit Test with MRUnit" and "MapReduce Workflows")
    • Chapter 7: How MapReduce Works (Skip section on "Configuration Tuning")
    • Chapter 8: MapReduce Types and Formats
    • Chapter 9: MapReduce Features (Read sections on "Counters", "Sorting", and "Side Data distribution")

Slides

PPTX (Mac) PDF   Part 1: January 12

PPTX (Mac) PDF   Part 2: January 14

Back to top

Week 3: From MapReduce to Spark January 19/21

Topics

  • Evolution of dataflow abstractions
  • MapReduce, Pig, Dryad, Spark, Flink, etc.

Readings

  • (Optional, but recommended) Learning Spark:
    • Chapter 1: Introduction to Data Analysis with Spark
    • Chapter 2: Downloading Spark and Getting Started (Skip section on downloading)
    • Chapter 3: Programming with RDDs
    • Chapter 4: Working with Key/Value Pairs
    • Chapter 5: Loading and Saving Your Data (Stop when you get to Structured Data with Spark SQL)
    • In all the readings above, don't worry about Python and Java since we're only going to be working with Spark's Scala API.

Slides

PPTX (Mac) PDF   Part 1: January 19

PPTX (Mac) PDF   Part 2: January 21

Back to top

Week 4: Analyzing Text January 26/28

Topics

  • Language models and machine translation
  • Inverted indexing and search

Readings

Slides

PPTX (Mac) PDF   Part 1: January 26

PPTX (Mac) PDF   Part 2: January 28

Back to top

Week 5: Analyzing Graphs February 2/4

Topics

  • Graph representations
  • Parallel breadth-first search
  • PageRank and random walks
  • Issues and challenges with dataflow abstractions

Readings

Slides

PPTX (Mac) PDF   Part 1: February 2

PPTX (Mac) PDF   Part 2: February 4

Back to top

Week 6: Analyzing Relational Data I February 9/11

Topics

  • OLTP vs. OLAP
  • Data warehousing, ETL, data cubes

Readings

Slides

PPTX (Mac) PDF   Part 1: February 11

Back to top

Week 7: Analyzing Relational Data II February 23/25

Topics

  • SQL-on-Hadoop
  • Relational data processing with MapReduce and Spark
  • Row vs. column stores
  • Semistructured data and record reconstruction (Parquet)
  • Optimizations for relational processing

Readings

Slides

PPTX (Mac) PDF   Part 1: February 23

PPTX (Mac) PDF   Part 2: February 25

Back to top

Week 8: Data Mining I March 1/3

Topics

  • Supervised machine learning: binary classification
  • Logistic regression, gradient descent, stochastic gradient descent, ensemble methods
  • Production machine learning pipelines

Readings

Slides

PPTX (Mac) PDF   Part 1: March 1

PPTX (Mac) PDF   Part 2: March 3

Back to top

Week 9: Data Mining II March 8/10

Topics

  • Hashing: minhash, random projections, etc.
  • Clustering: k-means, Gaussian mixture models

Slides

PPTX (Mac) PDF   Part 1: March 8

PPTX (Mac) PDF   Part 2: March 10

Back to top

Week 10: Mutable State March 15/17

Topics

  • Bigtable/HBase: Log-structure merge trees
  • Distributed hash tables
  • Consistency, latency, and availability tradeoffs

Slides

PPTX (Mac) PDF   Part 1: March 15

PPTX (Mac) PDF   Part 2: March 17

Back to top

Week 11: Analyzing Graphs, Redux March 22/24

Topics

  • Bulk synchronous parallel: "think like a vertex" (Giraph)
  • Alternative approaches: GraphX

Slides

PPTX (Mac) PDF   Part 1: March 22

PPTX (Mac) PDF   Part 2: March 24

Back to top

Week 12: Real-Time Data Analytics March 29/31

Topics

  • Stream processing issues and models
  • Probabilistic data structures (hyerloglog counters, bloom filters, count-min sketches, etc.)
  • Integrating batch and stream processing

Slides

PPTX (Mac) PDF   Part 1: March 29

PPTX (Mac) PDF   Part 2: March 31

Back to top