Over the past few years, we have seen the emergence of "big data": disruptive technologies that have transformed commerce, science, and many aspects of society. These developments are enabled by infrastructure that allows us to distribute computations across hundreds or even thousands of commodity servers. One key breakthrough that makes this all possible is the development of abstractions for data-intensive computing that allow programmers to reason about computations at a massive scale, hiding low-level details such as synchronization, data movement, and fault tolerance.
This course provides an introduction to big data infrastructure, starting with MapReduce, the first of these datacenter-scale programming abstractions. The Hadoop implementation of MapReduce lies at the core of an application stack that is gaining widespread adoption in both industry and academia. A major focus of this course is algorithm design and "thinking at scale", applied to a variety of domains: text, graphs, relational data, etc. We will also cover a number of next generation systems that are vying to replace MapReduce as the de facto big data processing platform of tomorrow.
If you're interested in my research, here is my homepage. The best way to get in touch with me is via email, at jimmylin@umd.edu. I am available by appointment to discuss material from class, the readings, homework assignments, the project, etc. Email is the best way to reach me to set up an appointment, and it's also a good way to get a quick answer to a simple question.
Note that this course requires that you have access to a reasonable recent computer with at least 4 GB memory and plenty of hard disk space.
The most recent version of all materials for the course will be posted on this website, including the syllabus, readings, slides used in class, and homework assignments. Please check the site frequently for updates.
The principal textbooks for this course are:
Data-Intensive Text Processing with MapReduce | Book website |
Hadoop: The Definitive Guide | Online version (3rd Ed.) |
Readings from other sources will be assigned as appropriate.
University of Maryland students have free access to the second book via Safari Books Online (click the "Online version" links above), but you may wish to purchase paper copies for convenience. Any online bookseller will have these books.
You're encouraged to use the course mailing list to share information that would be of general interest or for any other purpose that seems reasonable. Mail sent to that address will reach me and all students. If you have not received a message from the mailing list yet, please contact me to make sure that your correct address is included.
Components of the final grade are as follows:
Component | Weight |
Assignment 1 | 5% |
Assignment 2 | 12% |
Assignment 3 | 12% |
Assignment 4 | 12% |
Assignment 5 | 12% |
Assignment 7 | 12% |
Final Project | 35% |
Total | 100% |
The homework assignments are designed to provide an opportunity for you to explore specific topics in a structured way. You may work together on the homework assignments, but all of the material that is turned in for grading must be produced individually. For example, you may form study groups and work out homework solutions together on a whiteboard or by each working separately on different computers and then sharing what you've learned, but it would not be permissible for someone to prepare an answer set and then for others to copy those answers and submit it as their own work. Turning in copied files is specifically prohibited; you must individually write (type) any material that is submitted for grading, including code.
Assignments are due before the class indicated on the syllabus.
Late policy: For assignments turned in 24 hours late (or less), I will take the grade you would have gotten and multiply it by 0.75; For assignments turned in more than 24 hours late but less than 48 hours late, I will take the grade you have gotten and multiply it by 0.5, and so on.
The course will include a final group project. More details can be found at here. In addition, there will be a midterm and a final exam for the course.
The University of Maryland, College Park has a nationally recognized Code of Academic Integrity, administered by the Student Honor Council. This Code sets standards for academic integrity at Maryland for all undergraduate and graduate students. As a student you are responsible for upholding these standards for this course. It is very important for you to be aware of the consequences of cheating, fabrication, facilitation, and plagiarism. For more information on the Code of Academic Integrity or the Student Honor Council, please visit this site.
This is a graduate course in which you are responsible for making your own decisions regarding how best to master the material. You will be held accountable for all content covered in class, in the assigned readings, and in assignments. Experience strongly suggests that you should attend classes. Class attendance for the midterm, final, and final project presentations (May 2 and May 9) is required.
Accommodations for Religious Holidays and Other Special Circumstances. Students wishing to discuss accommodations for religious holidays on dates that assignments are due, or other circumstances not addressed in this course information page, should discuss those circumstances with me before the third class session in order to permit adequate time for planning. Only accommodations for unforeseeable circumstances will be considered after that date. Accommodations for Disabilities. The University is legally obligated to provide appropriate accommodations for students with documented disabilities. Accommodations will be made only in accordance with University policy. Students who are entitled to accommodations due to disabilities must first set up an appointment with the Disability Support Services (DSS). To permit adequate planning, this process must be completed and I must be notified by DSS at least two weeks before the session in which the accommodation is required.