Installing Spark Notebook on a Cloud Computer

You might want a bit more power than you can get on your local computer. If so, read on to learn how to deploy warcbase on a Linux or cloud machine. For a linux machine, you may want to skip to Step Four below.

This is a walkthrough for installing Warcbase and Spark on a Ubuntu_14.04_Trusty x86_64 (QCOW2) image provided for Compute Canada. Amazon EC2 provides a similar images. For more information on Warcbase, check out the repository here.

It is a bit bare boned as it assumes some knowledge of a command line environment.

Step One: SSH to the server (if applicable)

Step Two: Install dependencies (if applicable)

Step Three: Set up the server properly (if on a cloud machine)

Step Four: Install Spark

Step Five: Install Warcbase

Step Six: Test that Warcbase and Spark are working

import org.warcbase.spark.matchbox._ 
import org.warcbase.spark.rdd.RecordRDD._ 

val r = RecordLoader.loadArchives("/home/ubuntu/warcbase/warcbase-core/src/test/resources/arc/example.arc.gz", sc)
  .keepValidPages()
  .map(r => ExtractDomain(r.getUrl))
  .countItems()
  .take(10)

If you receive the following output:

r: Array[(String, Int)] = Array((www.archive.org,132), (deadlists.com,2), (www.hideout.com.br,1))

Then you're working.

Step Seven: Getting the Spark Notebook working

The catch is that you'll want to view it on a browser, but you're working on a server.

Step Eight: Deployment

While it is easy to deploy, at least for quick testing purposes, with sudo ./bin/spark-notebook -Dhttp.port=80, this will leave your Spark Notebook wide open to the world. As you've got read/write privileges, unless you really don't care about your machine or your data, this isn't the best way.

Instead, open an SSH tunnel to your instance. You'll need to reconnect using ssh. The following command should establish a tunnel from the remote localhost:9000 to your local localhost:9000.

ssh -i macpro.key ubuntu@MYIPADDRESS -L 9000:127.0.0.1:9000

Once in, deploy your Spark Notebook by running

sudo ./bin/spark-notebook -Dhttp.port=9000 from your spark-notebook directory (in my case, that is ~/spark-notebook-0.6.2-SNAPSHOT-scala-2.10.4-spark-1.5.1-hadoop-2.6.0-cdh5.4.2).

On your local browser, point it to localhost:9000. You should see your spark-notebook!

the spark notebook in action

Step Nine: Tweaking

You might find that your jobs are taking too long. This might because you don't have enough executors set.

Find how many CPU cores you have free by running htop. On a lightweight cloud machine, you might see:

four cores in action

This shows four cores in action. So when you run your spark-shell as in Step Six above, you might want to pass --num-executors 4. Tweak and refine as needed.

Step Ten: Have Fun

You've now got warcbase running in the cloud. What more could a person want?