Bespin is a software library that contains reference implementations of "big data" algorithms in MapReduce and Spark. It provides sample code for many of the algorithms we'll be discussing in class and also provides starting points for the assignments. You'll want to familiarize yourself with the library.
Software needed for the course can be found in
the linux.student.cs.uwaterloo.ca
environment. We will
ensure that everything works correctly in this environment.
TL;DR. Just set up your environment as follows (in bash; adapt accordingly for your shell of choice):
export PATH=/usr/lib/jvm/java-8-openjdk-amd64/jre/bin:/u3/cs451/packages/spark/bin:/u3/cs451/packages/hadoop/bin:/u3/cs451/packages/maven/bin:/u3/cs451/packages/scala/bin:$PATH export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre
You'll want to add the above lines to your shell config file (e.g.,
.bash_profile
).
Gory Details. For the course we need Java, Scala, Hadoop,
Spark, and Maven. Java is already available in the default user
environment (but we need to point to the right version). The rest of
the packages are installed in /u3/cs451/packages/
. The
directories scala
, hadoop
, spark
,
and maven
are actually symlinks to specific
versions. This is so that we can transparently change the links to
point to different versions if necessary without affecting downstream
users. Currently, the versions are:
You may wish to install all necessary software packages locally on your own machine. We provide basic installation instructions here, but the course staff cannot provide technical support due to the size of the class and the idiosyncrasies of individual systems. We will be responsible for making sure everything works properly in the Linux Student CS Environment (above), but if you want to install everything on your own machine for convenience, you're on your own.
Both Hadoop and Spark work fine on Mac OS X and Linux, but may be difficult to get working on Windows. Note that to run Hadoop and Spark on your local machine comfortably, you'll need at least 4 GB memory and plenty of disk space (at least 10 GB).
You'll also need Java (JDK 1.8), Scala (use Scala 2.11.x), and Maven (any reasonably recent version).
The versions of the packages installed
on linux.student.cs.uwaterloo.ca
are as follows:
Download the above packages, unpack the tarball, add their
respective bin/
directories to your path (and your shell
config), and you should be go to go.
Alternatively, you can also install the various packages using a
package manager, e.g., apt-get
, MacPorts, etc. However,
make sure you get the right version.
In addition to running "toy" Hadoop on a single machine (which obviously defeats the point of a distributed framework), we're going to be playing with a modest cluster thanks to the generous support of Altiscale, which is a "Hadoop-as-a-service" provider. You'll be getting an email directly from Altiscale with account information.
Follow the instructions from the email:
http://rm-ia.s3s.altiscale.com:8088/cluster/
.The TL;DR version. Configure your ~/.ssh/config
file as follows:
Host altiscale User YOUR_USERNAME Hostname ia.z42.altiscale.com Port 1763 IdentityFile ~/.ssh/id_rsa Compression yes ServerAliveInterval 15 DynamicForward localhost:1080 TCPKeepAlive yes Protocol 2,1
And you should be able to ssh into the workspace:
ssh altiscale
That should do it!
Running Spark on Altiscale — the TL;DR version: Add
the following lines to you ~/.bash_profile
to point at
the correct version of Spark:
SPARK_HOME=/opt/spark-beta SPARK_CONF_DIR=/etc/spark-beta PATH=$PATH:/opt/spark-beta/bin
For additional details, consult the Altiscale Spark documentation.