Installing Spark Notebook on a Cloud Computer
You might want a bit more power than you can get on your local computer. If so, read on to learn how to deploy warcbase on a Linux or cloud machine. For a linux machine, you may want to skip to Step Four below.
This is a walkthrough for installing Warcbase and Spark on a Ubuntu_14.04_Trusty x86_64 (QCOW2) image provided for Compute Canada. Amazon EC2 provides a similar images. For more information on Warcbase, check out the repository here.
It is a bit bare boned as it assumes some knowledge of a command line environment.
Step One: SSH to the server (if applicable)
- For me on a Compute Canada instance, it's
ssh -i macpro.key ubuntu@IPADDRESS.
- If you're running locally on linux, skip to the next step.
Step Two: Install dependencies (if applicable)
sudo apt-get update
sudo apt-get install htop
sudo apt-get install git
sudo apt-get install maven
sudo apt-get install scala
sudo apt-get install openjdk-7-jdk
Step Three: Set up the server properly (if on a cloud machine)
ping $HOSTNAME: if it responds with something like
ping: unknown host milligan-wahr-05you need to add an entry to your
sudo vim /etc/hosts
- replace the
localhostentry with your hostname - in my case, the first line now reads:
- now try to
ping $HOSTNAME: if you begin to see packet transmission/receipt information, you're golden.
Step Four: Install Spark
- You can download it from here. This pre-built version is ideal for all systems.
- for the lazier among us:
tar -xvf spark-1.6.1-bin-hadoop2.6.tgz
- remove the tar file:
- for the lazier among us:
Step Five: Install Warcbase
- bring yourself back to your home directory:
- download warcbase:
git clone https://github.com/lintool/warcbase.git
- change to the warcbase directory:
- build warcbase-core:
sudo mvn clean package -pl warcbase-core -DskipTests
Step Six: Test that Warcbase and Spark are working
- verify that the shell works by navigating to your spark-shell directory and running:
- if you don't get a bunch of errors, try:
./bin/spark-shell --jars /home/ubuntu/warcbase/warcbase-core/target/warcbase-core-0.1.0-SNAPSHOT-fatjar.jarto initalize warcbase
- Try this following script. In order to paste code, type
pasteand then Ctrl+D when you finish it up. Depending on your directory, you might need to change
/home/ubuntu/warcbase/warcbase-core/src/test/resources/arc/example.arc.gzto the path where warcbase is.
import org.warcbase.spark.matchbox._ import org.warcbase.spark.rdd.RecordRDD._ val r = RecordLoader.loadArchives("/home/ubuntu/warcbase/warcbase-core/src/test/resources/arc/example.arc.gz", sc) .keepValidPages() .map(r => ExtractDomain(r.getUrl)) .countItems() .take(10)
If you receive the following output:
r: Array[(String, Int)] = Array((www.archive.org,132), (deadlists.com,2), (www.hideout.com.br,1))
Then you're working.
Step Seven: Getting the Spark Notebook working
- download it with this command:
- unzip it:
tar -xvf spark-notebook-master-scala-2.10.4-spark-1.5.1-hadoop-2.6.0-cdh5.4.2.tgz
- test that it works:
The catch is that you'll want to view it on a browser, but you're working on a server.
Step Eight: Deployment
While it is easy to deploy, at least for quick testing purposes, with
sudo ./bin/spark-notebook -Dhttp.port=80, this will leave your Spark Notebook wide open to the world. As you've got read/write privileges, unless you really don't care about your machine or your data, this isn't the best way.
Instead, open an SSH tunnel to your instance. You'll need to reconnect using
ssh. The following command should establish a tunnel from the remote localhost:9000 to your local localhost:9000.
ssh -i macpro.key ubuntu@MYIPADDRESS -L 9000:127.0.0.1:9000
Once in, deploy your Spark Notebook by running
sudo ./bin/spark-notebook -Dhttp.port=9000 from your spark-notebook directory (in my case, that is
On your local browser, point it to
localhost:9000. You should see your spark-notebook!
Step Nine: Tweaking
You might find that your jobs are taking too long. This might because you don't have enough executors set.
Find how many CPU cores you have free by running
htop. On a lightweight cloud machine, you might see:
This shows four cores in action. So when you run your
spark-shell as in Step Six above, you might want to pass
--num-executors 4. Tweak and refine as needed.
Step Ten: Have Fun
You've now got warcbase running in the cloud. What more could a person want?