In this assignment, you'll continue to work with the analytical database from assignment 2. The data simulates a hypothetical e-commerce site, built on this dataset.
At a high-level, you'll do three things:
The Jupyter notebook that contains the actual assignment is available here. Everything should be self explanatory.
Implementation Note: Not directly relevant to the structure of the assignment or its correctness, but you might have noticed that the RDD algorithms you have been asked to implement are slower than expected (especially in absolute terms). This is because the implementation requires the code to unnecessarily cross the Python/JVM boundary. A more efficient implementation would push as much over to the JVM as possible, e.g., with implementations in Scala or Java. We're not asking you to do that, and hence the RDD transformations will take a performance hit.
Use this link to create an assignment repo for submission.
In the assignment repo, enter your answers in assignment3.ipynb
.
In addition, please explicitly add the following two files in your repo (at top level):
36c.png
: the plot for Q637c.png
: the plot for Q7Submit the assignment by committing your edits and pushing your repo (with the answers filled out in the notebook) back to origin.
codecell_31a
and codecell_31b
: Q1 using SQL/DataFrames: 3 points each, 6 points totalcodecell_32a
and codecell_32b
: Q2 using SQL/DataFrames: 3 points each, 6 points totalcodecell_33a
and codecell_33b
: Q3 using SQL/DataFrames: 3 points each, 6 points totalcodecell_34a
and codecell_34b
: Q4 using SQL/DataFrames: 3 points each, 6 points totalcodecell_35a
and codecell_35b
: Q5 using SQL/DataFrames: 3 points each, 6 points totalcodecell_36a
, codecell_36b
, codecell_36c
: Q6 using SQL/DataFrames (3 points each) + plot (2 points): 8 points totalcodecell_37a
, codecell_37b
, codecell_37c
: Q7 using SQL/DataFrames (3 points each) + plot (2 points): 8 points totalcodecell_5x1
and codecell_5x2
: algorithms for computing averages: 6 points each, 12 points totalcodecell_61a
: shuffle join implementation: 8 pointscodecell_62a
: hash join implementation: 8 pointsqcell_7x1290
: performance comparisons: 7 pointsREADME.md
)What does "following instructions" mean? These are "free points" if you follow the instructions provided in this assignment. These points are to handle the scenario where all your answers are correct, but you did not follow the instructions and that caused us to go out of our way to fix your submission so that it conforms to the instructions. (For example, you removed the ids that we used for tracking, which would make it much more difficult to grade.) In these and other related cases, we will dock points from this category.
Total: 90 points