Ingesting Content into HBase

To use some of the other functions of warcbase, such as quickly serving pages from a Wayback Instance, you can load your data into HBase. You do not need to do this to run analytics on collections, however – in those cases you can just have a directory of ARC or WARC files.

You can find some sample data here. Ingesting data into Warcbase is fairly straightforward:

$ setenv CLASSPATH_PREFIX "/etc/hbase/conf/"
$ sh target/appassembler/bin/IngestFiles \
    -dir /path/to/warc/dir/ -name archive_name -create

Command-line options:

An example on one OS X machine:

$ export CLASSPATH_PREFIX="/usr/local/Cellar/hbase/0.98.6.1/libexec/conf/"
$ sh target/appassembler/bin/IngestFiles -dir ~/desktop/WARC-directory/ -name webarchives1 -create -gz

That should do it. The data should now be in Warcbase.