Bespin is a software library that contains reference implementations of "big data" algorithms in MapReduce and Spark. It provides sample code for many of the algorithms we'll be discussing in class and also provides starting points for the assignments.
Software needed for the course can be found in
the linux.student.cs.uwaterloo.ca
environment. We will
ensure that everything works correctly in this environment.
TL;DR. Just set up your environment as follows (in bash; adapt accordingly for your shell of choice):
export PATH=/u0/cs489/packages/spark/bin:/u0/cs489/packages/hadoop/bin:/u0/cs489/packages/maven/bin:/u0/cs489/packages/scala/bin:$PATH export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre
You'll want to add the above lines to your shell config file,
i.e., .bashrc
, .bash_profile
, etc.
Gory Details. For the course we need Java, Scala, Hadoop,
Spark, and Maven. Java is already available in the default user
environment. The rest of the packages are installed
in /u0/cs489/packages/
. The
directories scala
, hadoop
, spark
,
and maven
are actually symlinks to specific
versions. This is so that we can transparently change the links to
point to different versions if necessary without affecting downstream
users. Currently, the versions are:
You may wish to install everything you need locally on your own machine. Both Hadoop and Spark work fine on Mac OS X and Linux, but may be difficult to get working on Windows. Note that to run Hadoop and Spark on your local machine comfortably, you'll need at least 4 GB memory and plenty of disk space (10s of GB at least).
You'll also need Java (JDK 1.7 or 1.8 should work), Scala (use Scala 2.10), and Maven (any reasonably recent version).
The versions of the packages installed on linux.student.cs.uwaterloo.ca
are as follows:
http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.5.1.tar.gz
http://mirror.cogentco.com/pub/apache/spark/spark-1.4.1/spark-1.4.1-bin-hadoop2.4.tgz
Download the above packages (e.g., using wget
), unpack
the tarball, add their respective bin/
directories to
your path (and your shell config), and you should be go to go.
Alternatively, you can also install the various packages using a
package manager, e.g., apt-get
, MacPorts, etc. However,
make sure you get the right version.
Note that we can provide basic installation instructions (per above), but course staff cannot provide detailed technical support due to the size of the class and the idiosyncrasies of individual systems. However, we will make sure everything works properly in the Linux Student CS Environment.
In addition to running "toy" Hadoop on a single machine (which obviously defeats the point of a distributed framework), we're going to be playing with a modest cluster thanks to the generous support of Altiscale, which is a "Hadoop-as-a-service" provider. You'll be getting an email directly from Altiscale with account information.
Follow the instructions from the email:
http://rm-ia.s3s.altiscale.com:8088/cluster/
.
The TL;DR version. Configure your ~/.ssh/config file
as follows:
Host altiscale User YOUR_USERNAME Hostname waterloo.z43.altiscale.com Port 1450 IdentityFile ~/.ssh/id_rsa Compression yes ServerAliveInterval 15 DynamicForward localhost:1080 TCPKeepAlive yes Protocol 2,1
And you should be able to ssh into the workspace:
ssh altiscale
Note: the workspace host and port from your web profile (on the Altiscale Portal) may not be correct, but the above information is.
Once you ssh into the workspace, to properly set up your
environment, add the following lines to
your .bash_profile
:
PATH=$PATH:$HOME/bin export PATH export SCALA_HOME=/opt/scala export YARN_CONF_DIR=/etc/hadoop/ export SPARK_HOME=/opt/spark/ cd $SPARK_HOME/test_spark && ./init_spark.sh cd
Running Spark on Altiscale. Running Spark on Altiscale requires a bit more setup, for the gory details, checkout out the documentation. This is the TL;DR version:
In your workspace home directory, you should have
a bin/
directory. Create a script there
called my-spark-submit
with the following:
#!/bin/bash /opt/spark/bin/spark-submit --queue waterloo --master yarn --deploy-mode cluster \ --driver-class-path $(find /opt/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo-* | head -n 1) "$@"
Then chmod
so that it's executable. Now you can
use my-spark-submit
instead of spark-submit
,
and everything should work. The main issue is that running Spark on
the Altiscale cluster requires a host of command-line parameters to
direct Spark to the right cluster configs. You can add those
parameters every time, but the my-spark-submit
script
simplifies the process for you. It takes whatever Spark command-line
parameters you specify, prepends all the "boilerplate" ones, and
actually runs spark-submit
.