Analyzing Web Archives with Spark
If you need assistance with setting up Spark, please visit this walkthrough.
Scripts are written in Scala. For a quick introduction to the language, Programming in Scala by Martin Odersky, Lex Spoon, and Bill Venners is a good introduction. However, for the most part you should be able to adapt and extend the scripts that we provide in this guide.
Collection Level Analytics
Collection Level Analytics: Use Spark Notebook to generate an interactive visualization of what's in your collections.
Extracting Domain Level Plain Text: This command extracts plain text from a collection of ARC or WARC files, either by date, domain, or through other filters you might specify.
Analysis of Site Link Structure: This command generates aggregated site link structure from a collection of ARC or WARC files.