What's Spark?

Big data and data science are enabled by scalable, distributed processing frameworks that allow organizations to analyze petabytes of data on large commodity clusters. MapReduce (especially the Hadoop open-source implementation) is the first, and perhaps most famous, of these frameworks. What's next? Well, Spark is (one) answer.

What's this tutorial about?

This is a two-and-a-half day tutorial on the distributed programming framework Apache Spark.

The class will include introductions to the many Spark features, case studies from current users, best practices for deployment and tuning, future development plans, and hands-on exercises.

In addition, there will be ample time to mingle and network with other big data and data science enthusiasts in the metro DC area.

When? Where?

This tutorial is being organized by Jimmy Lin and jointly hosted by the iSchool and Institute for Advanced Computer Studies at the University of Maryland. The tutorial will be led by Paco Nathan and Reza Zadeh.

The event will take place from October 20 (Monday) to 22 (Wednesday) in the Special Events Room in the McKeldin Library on the University of Maryland campus (actual room number is 6137). The tutorial will run all day Monday, all day Tuesday, and end at noon on Wednesday. The event is free for University of Maryland students and open to the general public for a nominal registration fee.

Update: we've filled up and registration is closed!

If you have any questions, feel free to contact Jimmy Lin at .

Logistics

Maps of the University of Maryland can be found here. McKeldin Library is at one end of the mall that runs across the center of campus; it looks like this and it's pretty hard to miss. Take the elevators up to the 6th floor to room 6137.

To find parking on campus, check out this link. Just a warning, allow ample time getting onto campus in the morning, especially if you arrive on the hour. Students getting to classes can clog up traffic, and it's not rare to sit at an intersection for more than ten minutes waiting for students to stream by.

Yes, we will be providing wireless access and coffee, probably the two most important ingredients to a successful technology tutorial. The power outlet situation, however, is a bit iffy. The room we are in does not have outlets at the seats, although there are outlets along the walls. Make sure your laptop is charged! Also, if you have a power strip conveniently lying around, please bring so we can share...

The tutorial will start at 10am sharp, but doors open at 9am... we'll be around, and you're welcome to stop by and mingle.

The hashtag for the event is #fearthespark.

Preparations

The first two days of the tutorial will be presented at the level of a CS freshman. We expect the attendee to have some programming experience in Python, Java, or Scala.

Throughout the class, there will be hands-on exercises. You are expected to bring your own laptop for those, with the minimum system requirements:

If you're eager to get started, look through resources here.

The third (half day) of the tutorial will be presented at the level of a CS graduate student, focusing specifically on research on or with Spark.

Schedule

Day 1 (10am-4pm, lunch break 12:30-1:30pm)

  • An introduction to Distributed Computing and Spark (Reza Zadeh) [PDF Slides]
  • Hands-on exercises (Paco Nathan) [PDF Slides]:
    • Installing Spark
    • Your first application
    • Spark deconstructed
  • Crash course in Scala (Holden Karau) [PDF Slides]
  • Historical background (Pack Nathan) [PDF Slides]

Lunch break

Additional links:

Day 2 (10am-4pm, lunch break 12:30-1:30pm)

  • Software development lifecycle: build, deploy, monitor (Paco Nathan)
  • Databricks cloud demo (Hossein Falaki)
  • A brief tour of Spark and examples (Holden Karau)

Lunch break

Day 3 (10am-1pm)

  • MLlib and Distributing the Singular Value Decomposition (Reza Zadeh) [PDF Slides]
  • Apache Spark + Elasticsearch (Holden Karau)
  • Graph Processing examples with the GraphX library (Paco Nathan)
  • Additional topics (Paco Nathan):
    • Integrations: Spark + other frameworks
    • Other resources for learning Spark
    • Spark on Mesos on GCP


Full house at the UMD Spark Tutorial!

Additional Resources