Spark is the core software used in this course. More specifically, we'll be extensively using PySpark, which provide Python bindings to Spark. (Py)Spark can be invoked from the command line, but the assignments are written as Jupyter notebooks.
Although (Py)Spark is intended to run on a cluster, in this course you'll be running the complete software stack locally. The assignments are designed so that they can be completed on any reasonably modern laptop. The datasets we'll be giving you are no more than a small number of gigabytes, and processing time is not intended to take more than a couple of minutes at most. If you encounter any divergence from this, it's likely that you're not doing something right...
We will strive to be helpful, but it is our expectation that you will be able to install and configure all software necessary for this course on your own. We provide basic installation instructions here, but the course staff cannot provide detailed technical support and individual hand-holding for everyone, due to the number of students and the idiosyncrasies of individual systems.
Conda is recommended for managing your environment. Here's how you might get started by creating an environment for this course:
conda create -n cs451 python=3.10 -y conda activate cs451
Once you've activated the environment, install the following packages:
conda install -c conda-forge openjdk=21 maven -y pip install findspark pip install numpy pip install jupyterlab
We'll be using Python 3.10 and Java 21.
As of September 2025, jupyterlab
is at version 4.4.6.
Next, grab the Spark tarball from here. You want v4.0.0, "Pre-built for Apache Hadoop 3.4 and later". Note that Spark 4 is pre-built with Scala 2.13.
Unpack the tarball somewhere convenient.
Set your SPARK_HOME
environment variable to point to the location where you've unpacked the tarball.
Add the bin/
sub-directory to your path.
Something like this:
export SPARK_HOME="/path/to/spark-4.0.0-bin-hadoop3" export PATH="$PATH:$SPARK_HOME/bin"
Once this is all done, you should be able to launch JupyterLab in a shell (terminal), as follows:
jupyter lab
The above instructions provide a starting point. There are many variations, especially for different operating systems, so YMMV.
For anyone interested in using the student Linux servers to complete course assignments, please follow the steps outlined here to get access. This option is highly recommended for Windows users, since assignment notebooks may not run seamlessly on Windows, even if using WSL.
Once you're logged into the server, follow these installation steps:
# Download Miniconda installer wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh # Run the installer bash miniconda.sh -b -p $HOME/miniconda # Initialize conda in your shell eval "$($HOME/miniconda/bin/conda shell.bash hook)" conda init # Create course environment conda create -n cs451 python=3.10 -y conda activate cs451 # Java 21 + Maven conda install -c conda-forge openjdk=21 maven -y # Python libraries pip install findspark numpy jupyterlab # Spark 4.0.0 prebuilt with Hadoop 3.4+ wget https://archive.apache.org/dist/spark/spark-4.0.0/spark-4.0.0-bin-hadoop3.tgz tar -xvzf spark-4.0.0-bin-hadoop3.tgz -C $HOME # Set environment variables. Add these two lines to the bottom of your ~/.bashrc: export SPARK_HOME="/path/to/spark-4.0.0-bin-hadoop3" export PATH="$PATH:$SPARK_HOME/bin" # Reload shell source ~/.bashrc
Here's how to verify everything works:
# Ensure conda course environment is activated then run the following commands spark-shell --version # should print Spark info jupyter lab # launch JupyterLab
When you start JupyterLab from the terminal, it will display a few links. Copy any one of them into your browser on your local machine to open and work with the notebook.
The Remote - SSH extension lets you use any remote machine with a SSH server as your development environment.
After connecting to the student linux environment using your SSH credentials, open a directory for your assignments - this will relaunch VS Code requiring you to provide your SSH password again.
Opening a .ipynb
notebook should render the notebook, allowing it to be run and edited. Clicking the ⊳ next to an executable cell should prompt you to install Jupyter extension pack. Accepting this will allow you to pick the python interpreter that will be used. One of these options should be cs451 conda environment you created earlier - select this one.
To verify everything works, try editing the first executable cell to include print("VS Code + Jupyter is working!")
and see if this is printed after clicking the ⊳ next to the cell.
While the assignments are to be turned in via Jupyter notebooks, it is not necessarily the case that they are the most convenient development environment. Different software capabilities can be accessed via JupyterLab, from a shell (terminal), from VS Code, etc. You might want to experiment with difference approaches for different usage scenarios. Remember, though, to follow the submission instructions for each assignment.