CS 451/651: Data-Intensive Distributed Computing (Fall 2025)

Spark and PySpark

Spark is the core software used in this course. More specifically, we'll be extensively using PySpark, which provide Python bindings to Spark. (Py)Spark can be invoked from the command line, but the assignments are written as Jupyter notebooks.

Although (Py)Spark is intended to run on a cluster, in this course you'll be running the complete software stack locally. The assignments are designed so that they can be completed on any reasonably modern laptop. The datasets we'll be giving you are no more than a small number of gigabytes, and processing time is not intended to take more than a couple of minutes at most. If you encounter any divergence from this, it's likely that you're not doing something right...

We will strive to be helpful, but it is our expectation that you will be able to install and configure all software necessary for this course on your own. We provide basic installation instructions here, but the course staff cannot provide detailed technical support and individual hand-holding for everyone, due to the number of students and the idiosyncrasies of individual systems.

Conda

Conda is recommended for managing your environment. Here's how you might get started by creating an environment for this course:

conda create -n cs451 python=3.10 -y
conda activate cs451

Once you've activated the environment, install the following packages:

conda install -c conda-forge openjdk=21 maven -y
pip install findspark
pip install numpy
pip install jupyterlab

We'll be using Python 3.10 and Java 21. As of September 2025, jupyterlab is at version 4.4.6.

Next, grab the Spark tarball from here. You want v4.0.0, "Pre-built for Apache Hadoop 3.4 and later". Note that Spark 4 is pre-built with Scala 2.13.

Unpack the tarball somewhere convenient. Set your SPARK_HOME environment variable to point to the location where you've unpacked the tarball. Add the bin/ sub-directory to your path. Something like this:

export SPARK_HOME="/path/to/spark-4.0.0-bin-hadoop3"
export PATH="$PATH:$SPARK_HOME/bin"

Once this is all done, you should be able to launch JupyterLab in a shell (terminal), as follows:

jupyter lab

The above instructions provide a starting point. There are many variations, especially for different operating systems, so YMMV.

Using the Student Linux Environment

For anyone interested in using the student Linux servers to complete course assignments, please follow the steps outlined here to get access. This option is highly recommended for Windows users, since assignment notebooks may not run seamlessly on Windows, even if using WSL.

Once you're logged into the server, follow these installation steps:

# Download Miniconda installer
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
# Run the installer
bash miniconda.sh -b -p $HOME/miniconda
# Initialize conda in your shell
eval "$($HOME/miniconda/bin/conda shell.bash hook)"
conda init
# Create course environment
conda create -n cs451 python=3.10 -y
conda activate cs451

# Java 21 + Maven
conda install -c conda-forge openjdk=21 maven -y
# Python libraries
pip install findspark numpy jupyterlab
# Spark 4.0.0 prebuilt with Hadoop 3.4+
wget https://archive.apache.org/dist/spark/spark-4.0.0/spark-4.0.0-bin-hadoop3.tgz
tar -xvzf spark-4.0.0-bin-hadoop3.tgz -C $HOME

# Set environment variables. Add these two lines to the bottom of your ~/.bashrc:
export SPARK_HOME="/path/to/spark-4.0.0-bin-hadoop3"
export PATH="$PATH:$SPARK_HOME/bin"

# Reload shell
source ~/.bashrc

Here's how to verify everything works:

# Ensure conda course environment is activated then run the following commands
spark-shell --version # should print Spark info
jupyter lab # launch JupyterLab

When you start JupyterLab from the terminal, it will display a few links. Copy any one of them into your browser on your local machine to open and work with the notebook.

Using the VS Code Remote SSH Extension

The Remote - SSH extension lets you use any remote machine with a SSH server as your development environment.

After connecting to the student linux environment using your SSH credentials, open a directory for your assignments - this will relaunch VS Code requiring you to provide your SSH password again.

Opening a .ipynb notebook should render the notebook, allowing it to be run and edited. Clicking the ⊳ next to an executable cell should prompt you to install Jupyter extension pack. Accepting this will allow you to pick the python interpreter that will be used. One of these options should be cs451 conda environment you created earlier - select this one.

To verify everything works, try editing the first executable cell to include print("VS Code + Jupyter is working!") and see if this is printed after clicking the ⊳ next to the cell.

Different Entry Points

While the assignments are to be turned in via Jupyter notebooks, it is not necessarily the case that they are the most convenient development environment. Different software capabilities can be accessed via JupyterLab, from a shell (terminal), from VS Code, etc. You might want to experiment with difference approaches for different usage scenarios. Remember, though, to follow the submission instructions for each assignment.

Software CS 451/651: Data-Intensive Distributed Computing (Fall 2025)

Spark and PySpark

Conda

Using the Student Linux Environment

Using the VS Code Remote SSH Extension

Different Entry Points

Software
CS 451/651: Data-Intensive Distributed Computing (Fall 2025)