Warcbase

network of the Canadian Political Parties collection

Warcbase is an open-source platform for managing web archives built on Hadoop and HBase. The platform provides a flexible data model for storing and managing raw content as well as metadata and extracted knowledge. Tight integration with Hadoop provides powerful tools for analytics and data processing via Spark. For more information on the project and the team behind it, visit our about page.

Our documentation can be accessed by using the drop-down menus above.

Getting Started

You can download Warcbase here. The easiest way would be to follow our Getting Started tutorial. For a conceptual and practical introduction to the command line, please see Ian Milligan and James Baker's "Introduction to the Bash Command Line" at the Programming Historian.

Using Warcbase

If you've just arrived, you're probably interested in using Spark to analyze your web archive collections: gathering collection statistics, textual analysis, network analysis, etc.

If you want to explore web archives using other means, we have walkthroughs to use the SHINE front end on Solr indexes generated using Warcbase. See this SHINE walkthrough and this building Lucene indexes walkthrough.

About Warcbase

Warcbase is an open-source platform for managing web archives built on Hadoop and HBase. The platform provides a flexible data model for storing and managing raw content as well as metadata and extracted knowledge. Tight integration with Hadoop provides powerful tools for analytics and data processing via Spark.

There are two main ways of using Warcbase:

You can use Warcbase without HBase, and since HBase requires more extensive setup, it is recommended that if you're just starting out, play with the Spark analytics and don't worry about HBase.

Warcbase is built against CDH 5.4.1:

The Hadoop ecosystem is evolving rapidly, so there may be incompatibilities with other versions.

You are currently in our documentation.

Supporting files can be found in the warcbase-resources repository.

Project Team

Warcbase is brought to you by a team of researchers at the University of Waterloo, including:

License

Licensed under the Apache License, Version 2.0.

Acknowlegments

This work is supported in part by the National Science Foundation and by the Mellon Foundation (via Columbia University). Additional support has been forthcoming from the Social Sciences and Humanities Research Council of Canada and the Ontario Ministry of Research and Innovation's Early Researcher Award program. Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the sponsors.