Assignment 3: Data Science and Spark Algorithms due 10am October 9

In this assignment, you'll continue to work with the analytical database from assignment 2. The data simulates a hypothetical e-commerce site, built on this dataset.

At a high-level, you'll do three things:

  1. Analyze the dataset by answering a number of questions about the data, both using SQL queries and DataFrame manipulations.
  2. Implement a naive version and a more efficient version of an algorithm for computing means using RDDs.
  3. Implement two different join algorithms using RDDs.

The Jupyter notebook that contains the actual assignment is available here. Everything should be self explanatory.

Implementation Note: Not directly relevant to the structure of the assignment or its correctness, but you might have noticed that the RDD algorithms you have been asked to implement are slower than expected (especially in absolute terms). This is because the implementation requires the code to unnecessarily cross the Python/JVM boundary. A more efficient implementation would push as much over to the JVM as possible, e.g., with implementations in Scala or Java. We're not asking you to do that, and hence the RDD transformations will take a performance hit.

Assignment Submission

Use this link to create an assignment repo for submission. In the assignment repo, enter your answers in assignment3.ipynb.

In addition, please explicitly add the following two files in your repo (at top level):

Submit the assignment by committing your edits and pushing your repo (with the answers filled out in the notebook) back to origin.

Grading Scheme

What does "following instructions" mean? These are "free points" if you follow the instructions provided in this assignment. These points are to handle the scenario where all your answers are correct, but you did not follow the instructions and that caused us to go out of our way to fix your submission so that it conforms to the instructions. (For example, you removed the ids that we used for tracking, which would make it much more difficult to grade.) In these and other related cases, we will dock points from this category.

Total: 90 points

Back to top