Note that there separate sets of assignments for CS 451/651 and CS 431/631. Make sure you work on the correct asssignments!
The purpose of this assignment is to serve as a warmup exercise and a practice "dry run" for the submission procedures of subsequent assignments. You'll have to write a bit of code but this assignment is mostly about the "mechanics" of setting up your Hadoop development environment. In addition to running Hadoop locally in either the Linux student CS environment or on your own machine, you'll also try running jobs on the Altiscale cluster.
The general setup is as follows: you will complete your assignments and check everything into a private GitHub repo. Shortly after the assignment deadline, we'll pull your repo for marking.
I'm assuming you already have a GitHub account. If not, create one as soon as possible. Once you've signed up for an account, go and request an educational account. This will allow you to create private repos for free. Please do this as soon as possible since there may be delays in the request verification process.
Hadoop and Spark are already installed in
the linux.student.cs.uwaterloo.ca
environment (you just
need to do some simple config). Alternatively, you may wish to install
everything locally on your own machine. For both, see
the software page for more details.
Bespin is a library that contains reference implementations of "big
data" algorithms in MapReduce and Spark. We'll be using it throughout
this course. Go and run
the Word Count in
MapReduce and Spark example as shown in the Bespin README (clone
and build the repo, download the data files, run word count in both
MapReduce in Spark, and verify output). Assuming you are
using linux.student.cs.uwaterloo.ca
(or if you have
properly set up your local environment), this task should be as simple
as copying and pasting commands from the Bespin README.
Create a private repo
called bigdata2018w
. I'm assuming that you're
already familiar with Git and GitHub, but just in case, here
is how you
create a repo on GitHub. For "Who has access to this repository?",
make sure you click "Only the people I specify". If you've
successfully gotten an educational account (per above), you should be
able to create private repos for free. If you're not already familiar
with Git, there are plenty of good tutorials online: do a simple web
search and find one you like.
What you're going to do now is to copy the MapReduce word count
example into you own private repo. Start with
this pom.xml
: copy it
into your bigdata2018w
repo.
Next, copy:
bespin/src/main/java/io/bespin/java/mapreduce/wordcount/WordCount.java
over to
bigdata2018w/src/main/java/ca/uwaterloo/cs451/a0/WordCount.java
.
Tip: mkdir -p
creates parent directories as needed.
Open up this new version of WordCount.java
using a
text editor (or your IDE of choice) and change the Java package
to ca.uwaterloo.cs451.a0
.
Now, in the bigdata2018w/
base directory, you should
be able to run Maven to build your package:
$ mvn clean package
Once the build succeeds, you should be able to run the word count demo program in your own repository:
$ hadoop jar target/assignments-1.0.jar ca.uwaterloo.cs451.a0.WordCount \ -input data/Shakespeare.txt -output wc
You should be running this in the Linux student CS environment or
on your own machine. Note that you'll need to copy over the
Shakespeare collection in data/
. The output should be
exactly the same as the same program in Bespin, but the difference
here is that the code is now in a repository under your control.
Let's make a simple modification to word count: I would like to know the distribution (counts) of all words that follow the word "perfect". That is, for the phrase "perfect x", I want to know how many times each word appears as the x, where x is any non-zero-length word. To reduce noise, I am not interested in x's that appear only once.
Important: To be clear, use the "words" as generated by the
class io.bespin.java.util.Tokenizer
class in Bespin. This
will ensure consistent handling of corner cases involving empty
strings, words containing punctuation, etc.
Create a program called PerfectX
in the
package ca.uwaterloo.cs451.a0
that implements the
specifications above.
$ hadoop jar target/assignments-1.0.jar ca.uwaterloo.cs451.a0.PerfectX \ -input data/Shakespeare.txt -output cs451-lintool-a0-shakespeare
You shouldn't need to write more than a couple lines of code (beyond changing class names and other boilerplate). We'll go over the Hadoop API in more detail in class, but the changes should be straightforward.
Answer the following questions:
Question 1. In the Shakespeare collection, what is the most frequent x and how many times does it appear? (Answer this question with command-line tools.)
You can run the above instructions using
check_assignment0_public_linux.py
as follows:
$ wget http://lintool.github.io/bigdata-2018w/assignments/check_assignment0_public_linux.py $ chmod +x check_assignment0_public_linux.py $ ./check_assignment0_public_linux.py lintool
We'll be using exactly this script to check your assignment in the Linux Student CS environment. Important: Make sure that your code runs there even if you do development on your own machine.
The software page has details on
getting started with the Altiscale cluster. Register your account and
follow instructions to set up ssh into the "workspace". Make sure
you've properly set up the proxy to view the cluster Resource Manager
(RM) webapp
at http://rm-ia.s3s.altiscale.com:8088/cluster/
.
Getting access to the RM webapp is important—you'll need it to
track your job status and for debugging purposes.
Once you've ssh'ed into the workspace, you should be able to run
your copy of the word count program (copied over from Bespin)
from your assignments repo bigdata2018w/
:
$ hadoop jar target/assignments-1.0.jar ca.uwaterloo.cs451.a0.WordCount \ -input /shared/uwaterloo/cs451/data/enwiki-20161220-sentences-0.1sample.txt -output wc-jmr-combiner
Note that we're running word count over a larger collection here: a 10% sample of English Wikipedia totaling 1.6 GB (here's a chance to exercise your newly-acquired HDFS skills to confirm for yourself).
Question 2. Run word count on the Altiscale cluster and make
sure you can access the Resource Manager webapp. What is your
application id? It looks something like
application_XXXXXXXXXXXXX_XXXX
and can be found in the
Resource Manager webapp. If you ran word count multiple times, any id
will do.
Question 3. For this word count job, how many mappers ran in parallel?
Question 4. From the word count program, how many times does "waterloo" appear in the sample Wikipedia collection?
Now run your PerfectX
program on the sample Wikipedia
data:
$ hadoop jar target/assignments-1.0.jar ca.uwaterloo.cs451.a0.PerfectX \ -input /shared/uwaterloo/cs451/data/enwiki-20161220-sentences-0.1sample.txt -output cs451-lintool-a0-wiki
Question 5. In the sample Wikipedia collection, what are the 10 most frequent x's and how many times does each appear? (Answer this question with command-line tools.)
Note that the Altiscale cluster is a shared resource, and how fast your jobs complete will depend on how busy it is. You're advised to begin the assignment early as to avoid long job queues. "I wasn't able to complete the assignment because there were too many jobs running on the cluster" will not be accepted as an excuse if your assignment is late.
You can run the above instructions using
check_assignment0_public_altiscale.py
as follows:
$ wget http://lintool.github.io/bigdata-2018w/assignments/check_assignment0_public_altiscale.py $ chmod +x check_assignment0_public_altiscale.py $ ./check_assignment0_public_altiscale.py lintool
We'll be using exactly this script to check your assignment on the Altiscale cluster.
At this point, you should have a GitHub
repo bigdata2018w/
and inside the repo, you should have
the word count program copied over from Bespin, and the new
perfect x count implementation, along with
your pom.xml
. Commit these files. Next, create a file
called assignment0.md
inside bigdata2018w/
. In that file, put your answers to
the above questions (1—5). Use the Markdown annotation format:
here's
a simple
guide.
Note: there is no need to commit data/
or target/
(or any results that you may have generated),
so your repo should be very compact — it should only have five
files: two Java source files, pom.xml
,
and assignment0.md
. You can add a .gitignore
file if you wish.
For this and all subsequent assignments, make sure everything is on the master branch. Push your repo to GitHub. You can verify that it's there by logging into your GitHub account in a web browser: your assignment should be viewable in the web interface.
This and subsequent assignments contain two parts, one that can be completed locally, and another that requires the Altiscale cluster. For the first, make sure that your code runs in the Linux Student CS environment (even if you do development on your own machine), which is where we will be doing the marking. "But it runs on my laptop!" will not be accepted as an excuse if we can't get your code to run.
Almost there! Add the user teachtool a collaborator to your repo so that we can access it (under settings in the main web interface on your repo). Note: do not add my primary GitHub account lintool as a collaborator.
Finally, you need to tell us your GitHub account so we can link it to you. Submit your information here.
And that's it!
To give you an idea of how we'll be marking this and future assignments—we will clone your repo and use the above check scripts:
check_assignment0_public_linux.py
in the Linux Student CS environment.check_assignment0_public_altiscale.py
on the Altiscale cluster.We'll make sure the data files are in the right place, and once the code completes, we will verify the output. It is highly recommend that you run these check scripts: if it doesn't work for you, it won't work for us either.
This assignment is worth a total of 20 points, broken down as follows:
PerfectX
in the
Linux student CS environment and on Altiscale). We will make a
minimal effort to fix trivial issues with your code (e.g., a
typo)—and deduct points—but will not spend time
debugging your code. It is your responsibility to make sure your
code runs: we have taken care to specify exactly how we will run
your code—if anything is unclear, it is your responsibility to
seek clarification. In order to get a perfect score of 8 for this
portion of the grade, we should be able to run the two public check
scripts above successfully without any errors.