Analyzing Web Archives with Spark

Setup

If you need assistance with setting up Spark, please visit this walkthrough.

Scripts are written in Scala. For a quick introduction to the language, Programming in Scala by Martin Odersky, Lex Spoon, and Bill Venners is a good introduction. However, for the most part you should be able to adapt and extend the scripts that we provide in this guide.

Collection Level Analytics

Collection Level Analytics: Use Spark Notebook to generate an interactive visualization of what's in your collections.

Textual Analysis

Extracting Domain Level Plain Text: This command extracts plain text from a collection of ARC or WARC files, either by date, domain, or through other filters you might specify.

Network Analysis

Analysis of Site Link Structure: This command generates aggregated site link structure from a collection of ARC or WARC files.