Building and Running Warcbase Under OS X
Warcbase is a web archive platform, not a single program. Its capabilities comprise two main categories:
- Analysis of web archives using the Spark programming language, and assorted helper scripts and utilities
- Web archive database management, with support for the HBase distributed data store, and OpenWayback integration providing a friendly web interface to view stored websites
One can take advantage of the analysis tools (1) without bothering with the database management aspect of Warcbase -- in fact, most digital humanities researchers will probably find the former more useful. Users who are only interested in the analysis tools need only be concerned with the first two sections of this document (Prerequisites and Building Warcbase).
(This document was written because installing Warcbase under OS X requires a number of minor changes to the official project instructions.)
- OS X Developer Tools
- Maven (
brew install maven)
- Hadoop (
brew install hadoop)
brew install hbase)
Configure HBase by making the changes to the following files located in the HBase installation directory, which will be something like
/usr/local/Cellar/hbase/0.98.6.1/libexec/(depending on the version number).
conf/hbase-site.xml: Insert within the
tags hbase.rootdir file:///Users/yourname/hbase hbase.zookeeper.property.dataDir /Users/yourname/zookeeper
yournameis your username. Feel free to choose other directories to store these files, used by HBase and its ZooKeeper instance, if you like. HBase will create these directories. If they already exist, they will cause problems later on.
conf/hbase-env.sh: Look for the following line, export HBASE_OPTS="-XX:+UseConcMarkSweepGC"
and change it to:
export HBASE_OPTS="-XX:+UseConcMarkSweepGC -Djava.security.krb5.realm=-Djava.security.krb5.kdc="
Verify that HBase is installed correctly by running the HBase shell:
$ hbase shell hbase(main):001:0> list // Some lines of log messages 0 row(s) in 1.3060 seconds =>  hbase(main):002:0> exit
brew install tomcat) (Only necessary for OpenWayback integration; skip otherwise)
brew install spark)
N.B. If you run an automatic Homebrew system update (
brew update && brew upgrade) it is possible a new version of Hadoop, HBase, or Tomcat will be installed. The previous version will remain on your system, but the symbolic links in
/ur/local/bin/will point to the new version; i.e., it is the new version that will be executed when you run any of the software's components, unless you specify the full pathname. There are two solutions: 1. Re-configure the new version of the updated software according to the instructions above. 2. Make the symbolic links point to the older version of the updated software, with the command
brew switch <formula> <version>. E.g.,
brew switch hbase 0.98.6.1.
To start, you will need to clone the Warcbase Git repository:
$ git clone http://github.com/lintool/warcbase.git
From inside the root directory
warcbase, build the project:
$ mvn clean package appassembler:assemble -DskipTests
If you leave off
-DskipTests, the build may fail when it runs tests due to a shortage of memory. If you try the build with the tests and this happens, don't worry about it.
Because OS X is not quite case sensitive (it does not allow two files or directories spelled the same but for case), you must remove one file from the JAR package:
$ zip -d target/warcbase-0.1.0-SNAPSHOT-fatjar.jar META-INF/LICENSE
To ingest a directory of web archive files (which may be GZip-compressed, e.g., webcollection.arc.gz), run the following from inside the
$ start-hbase.sh $ export CLASSPATH_PREFIX="/usr/local/Cellar/hbase/0.98.6.1/libexec/conf/" $ sh target/appassembler/bin/IngestFiles -dir /path/to/webarchive/files/ -name archive_name -create -gz
Change as appropriate the HBase configuration path (version number), the directory of web archive files, and the archive name. Use the option
-append instead of
-create to add to an existing database table. Note the
-gz flag: this changes compression method to Gzip from the default Snappy, which is unavailable as a native Hadoop library on OS X. (The above commands assume you are using a shell in the
Tip: To avoid repeatedly setting the CLASSPATH_PREFIX variable, add the
export line to your
If you wish to shut down HBase, the command is
stop-hbase.sh. You can check if HBase is running with the command
jps; if it is running you will see the process
HMaster listed. You can also view detailed server status information at http://localhost:60010/.
Run and test the WarcBrowser
You may now view your archived websites through the WarcBrowser interface.
# Start HBase first, if it isn't already running: $ start-hbase.sh # Set CLASSPATH_PREFIX, if it hasn't been done this terminal session: $ export CLASSPATH_PREFIX="/usr/local/Cellar/hbase/0.98.6.1/libexec/conf/" # Start the browser: $ sh target/appassembler/bin/WarcBrowser -port 8079
You can now use
http://localhost:8079/ to browse the archive. For example:
http://localhost:8079/archive_name/*/http://mysite.com/ will give you a list of available versions of
http://localhost:8079/archive_name/19991231235959/http://mysite.com/ will give you the record of
http://mysite.com/ just before Y2K.
For a more functional, visually-appealing interface, you may install a custom version of the Wayback Machine.
Assuming Tomcat is installed, start it by running:
$ catalina start
Install OpenWayback by downloading the latest binary release here. Extract the .tar.gz; inside it there will be a web application file
openwayback-(version).war. Copy this file into the
webapps folder of Tomcat, something like
/usr/local/Cellar/tomcat/8.0.17/libexec/webapps/, and rename it
ROOT.war. Tomcat will immediately unpack this file into the
If you are running a current version of Tomcat (i.e., version 8) in combination with OpenWayback 2.0.0, edit the file
webapps/ROOT/WEB-INF/web.xml and insert a slash ("/") in front of the paths of parameters
config-path. (Details) Future releases of OpenWayback should already include this configuration change.
Add the Warcbase jar file to the Wayback installation, by copying
target/appassembler/repo/org/warcbase/warcbase/0.1.0-SNAPSHOT/warcbase-0.1.0-SNAPSHOT.jar from the Warcbase build directory into Tomcat's
webapps/ROOT/WEB-INF copy Warcbase's
archive_name (or whatever the archive table in HBase is called).
$ catalina stop $ catalina start
Now, navigate to http://localhost:8080/wayback/ and access one of your archived web pages through the Wayback interface.
Warcbase is useful for managing web archives, but its real power is as a platform for processing and analyzing the archives in its database. Its analysis tools are still under development, but at the moment you can use the tools described below, including the pig scripting interface.
Building the URL mapping
Most of the tools that follow require a URL mapping file, which maps every URL in a set of ARC/WARC files to a unique integer ID. There are two ways of doing this; the first is simpler:
$ hadoop jar target/warcbase-0.1.0-SNAPSHOT-fatjar.jar org.warcbase.data.UrlMappingMapReduceBuilder -input /path/to/webarc/files -output fst.dat
If this does not work due to a lack of memory, try the following steps:
$ hadoop jar target/warcbase-0.1.0-SNAPSHOT-fatjar.jar org.warcbase.analysis.ExtractUniqueUrls -input /path/to/webarchive/files -output urls
(If you have configured HBase to run in distributed mode rather than in standalone mode, which is the configuration provided above, you must now copy the
urls directory out of HDFS into the local filesystem.)
$ sh target/appassembler/bin/UrlMappingBuilder -input /path/to/urls -output fst.dat
We can examine the FST data with the following utility program:
# Lookup by URL, fetches the integer id $ sh target/appassembler/bin/UrlMapping -data fst.dat -getId http://www.foo.com/ # Lookup by id, fetches the URL $ sh target/appassembler/bin/UrlMapping -data fst.dat -getUrl 42 # Fetches all URLs with the prefix $ sh target/appassembler/bin/UrlMapping -data fst.dat -getPrefix http://www.foo.com/
(If you are running in distributed mode, now copy the
fst.dat file into the HDFS, so it is accessible to the cluster:
$ hadoop fs -put fst.dat /hdfs/path/
You might have noticed, we are working here with ARC/WARC files and not tables in HBase. The same is done below as well. This is because most of the tools described her and below do not yet have HBase support.
Extracting the webgraph
We can use the mapping data from above to extract the webgraph, with a Hadoop program:
$ hadoop jar target/warcbase-0.1.0-SNAPSHOT-fatjar.jar org.wa rcbase.analysis.graph.ExtractLinksWac -hdfs /path/to/webarchive/files -output output -urlMapping fst.dat
-hdfs flag is misleading; if HBase is running in standalone mode, this flag should specifiy a local path.
The resulting webgraph will appear in the
output directory, in one or more files with names like
Extracting a site-level webgraph
Instead of extracting links between individual URLs, we can extract the site-level webgraph by aggregating all URLs with common prefix into a "supernode". Link counts between supernodes represent the total number of links between individual URLs. In order to do this, the following input files are needed:
a CSV prefix file providing URL prefixes for each supernode (comma-delimited: ID, URL prefix). The first line of this file is ignored (reserved for headers). The URL prefix is a simple string representing a site, e.g.,
http://cnn.com/. The ID should be unique (I think), so take not of the total number of unique URLs extracted when building the URL mapping above, and make sure your IDs are larger than this.
an FST mapping file to map individual URLs to unique integer ids (from above)
Then run this MapReduce program:
$ hadoop jar target/warcbase-0.1.0-SNAPSHOT-fatjar.jar org.warcbase.data.ExtractSiteLinks -hdfs /path/to/webarchive/files -output output -numReducers 1 -urlMapping fst.dat -prefixFile prefix.csv
Other analysis tools
The tools described in this section are relatively simple. Some can process ARC and/or WARC files, while others exist in ARC and WARC versions.
There are three counting tools. They count different content-types, crawl-dates, and urls:
$ hadoop jar target/warcbase-0.1.0-SNAPSHOT-fatjar.jar org.warcbase.analysis.CountArcContentTypes -input /arc/files/ -output contentTypes $ hadoop jar target/warcbase-0.1.0-SNAPSHOT-fatjar.jar org.warcbase.analysis.CountArcCrawlDates -input /arc/files/ -output crawlDates $ hadoop jar target/warcbase-0.1.0-SNAPSHOT-fatjar.jar org.warcbase.analysis.CountArcUrls -input ~/arc/files/ -output urls
For WARC files, replace "Arc" with "Warc" in the class names (e.g.,
There is a tool to extract unique URLs:
$ hadoop jar target/warcbase-0.1.0-SNAPSHOT-fatjar.jar org.warcbase.analysis.ExtractUniqueUrls -input /arc/or/warc/files -output uniqueUrls
There is a pair of tools to find URLs, according to a regex pattern.
$ hadoop jar target/warcbase-0.1.0-SNAPSHOT-fatjar.jar org.warcbase.analysis.FindArcUrls -input /arc/files/ -output foundUrls -pattern "http://.*org/.*" $ hadoop jar target/warcbase-0.1.0-SNAPSHOT-fatjar.jar org.warcbase.analysis.FindWarcUrls -input /warc/files/ -output foundUrls -pattern "http://.*org/.*"
There is a tool to detect duplicates in the HBase:
$ sh target/appassembler/bin/DetectDuplicates -name table
There is also a web graph tool that pulls out link anchor text (background: https://github.com/lintool/warcbase/issues/8).
$ hadoop jar target/warcbase-0.1.0-SNAPSHOT-fatjar.jar org.warcbase.analysis.graph.InvertAnchorText -hdfs /arc/or/warc/files -output output -numReducers 1 -urlMapping fst.dat
For all of these tools that employ a URL mapping, you must use the FST mapping generated for the set of data files you are analyzing.