| # |
Date |
Topic |
Assignment Due |
Details |
| 1 | 1/26 |
Introduction to MapReduce |
|
[show] |
| 1 | 1/26 |
Introduction to MapReduce
Readings (complete before class)
Topics
- Administrivia
- Overview of cloud computing
- Overview of MapReduce and the distributed file system
Material
|
[Hide] |
| 2 | 2/2 |
Hadoop: Nuts and Bolts |
Assignment 1-1 |
[show] |
| 2 | 2/2 |
Hadoop: Nuts and Bolts
Readings (complete before class)
- White, Chapter 1, "Meet Hadoop"
- White, Chapter 2, "MapReduce", up until page 32
- White, Chapter 3, "The Hadoop Distributed File System", up until
page 63
- White, Chapter 4, "Hadoop I/O", starting from "Serialization"
until the end of chapter
- White, Chapter 5, "Developing a MapReduce Application", up until
page 144
Note that a lot of chapters are assigned from the White book.
However, the purpose is to get you acquainted with Hadoop—we
don't expect you to digest all the material the first time through,
since no doubt you'll be referring back to the book frequently
throughout the semester.
Topics
- Writing, running, debugging Hadoop programs
- Hadoop behind the scenes
Material
|
[Hide] |
| 2/9 |
Class canceled due to snowstorm |
Assignment 1-2 |
|
| 3 | 2/16 |
MapReduce: the programming environment |
|
[show] |
| 3 | 2/16 |
MapReduce: the programming environment
Readings (complete before class)
- Lin & Dyer, Chapter 3, MapReduce Algorithm Design
- White, Chapter 3, "The Hadoop Distributed File System", page 63
until end of chapter
- White, Chapter 6, "How MapReduce Works"
- White, Chapter 7, "MapReduce Types and Formats"
- White, Chapter 8, "MapReduce Features"
Topics
- "Warehouse-size" computers and the datacenter environment
- MapReduce algorithm design and design patterns
Material
|
[Hide] |
| 4 | 2/23 |
Text retrieval algorithms |
Assignment 2 |
[show] |
| 4 | 2/23 |
Text retrieval algorithms
Readings (complete before class)
Topics
- Introduction to information retrieval
- Basics of indexing and retrieval
- Inverted indexing in MapReduce
- Retrieval at scale
Material
|
[Hide] |
| 5 | 3/2 |
Graph algorithms |
Assignment 3 |
[show] |
| 5 | 3/2 |
Graph algorithms
Readings (complete before class)
Topics
- Graph problems and representations
- Parallel breadth-first search
- PageRank
Material
|
[Hide] |
| 6 | 3/9 |
Midterm |
|
|
| 6 | 3/9 |
|
[Hide] |
| 3/16 |
Spring break: no class! |
| 7 | 3/23 |
MapReduce and databases |
Assignment 4 |
[show] |
| 7 | 3/23 |
MapReduce and databases
Readings (complete before class)
Topics
- Relational databases vs. MapReduce
- MapReduce algorithms for processing relational data
- OLTP vs. OLAP (data warehousing and business intelligence)
Material
|
[Hide] |
| 8 | 3/30 |
Hidden Markov models |
Assignment 5 |
[show] |
| 8 | 3/23 |
Hidden Markov models
Readings (complete before class)
Topics
- Hidden Markov models
- Expectation maximization
Material
|
[Hide] |
| 9 | 4/6 |
Language models |
|
[show] |
| 9 | 4/6 |
Language models
Readings (complete before class)
- Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. (2007) Large Language Models in Machine Translation. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 858–867.
Topics
- N-gram language models
- Parameter estimation for web-scale language models
Material
|
[Hide] |
| 10 | 4/13 |
Large-scale graphs |
|
[show] |
| 10 | 4/13 |
Large-scale graphs
Readings (complete before class)
Topics
- Scalable identity resolution in email collections: Slides in PDF (622 KB)
- DNA sequence assembly: Slides in PDF (5.65 MB)
|
[Hide] |
| 11 | 4/20 |
Dryad and DryadLINQ |
|
[show] |
| 11 | 4/20 |
Dryad and DryadLINQ
Readings (complete before class)
- Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. (2007) Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. Proceedings of the ACM SIGOPS/EuroSys European Conference on Computer Systems 2007 (EuroSys 2007), page 59-72.
- Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Ulfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey. (2008) DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. Proceedings of the 8th Symposium on Operating System Design and Implementation (OSDI 2008), pages 1-14.
Topics
|
[Hide] |
| 12 | 4/27 |
Bigtable, Hive, and Pig |
Assignment 6 |
[show] |
| 12 | 4/27 |
Bigtable, Hive, and Pig
Readings (complete before class)
- Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar Chandra, Andrew Fikes, and Robert Gruber. (2006) Bigtable: A Distributed Storage System for Structured Data. Proceedings of the 7th Symposium on Operating System Design and Implementation (OSDI 2006), pages 205-218.
- Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. (2008) Pig Latin: A Not-So-Foreign Language for Data Processing. Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pages 1099-1110.
Topics
Material
|
[Hide] |
| 13 | 5/4 |
Project |
|
|
| 13 | 5/4 |
|
[Hide] |
| 5/6 |
Project (optional session as makeup for snowstorm) |
|
|
| 14 | 5/11 |
Project Presentations |
|
|
| 14 | 5/11 |
|
[Hide] |
| 15 | TBA |
Final Exam |
|
|