Several Basic Spark Commands
We have several basic commands that you will see again and again in these scripts.
.keepValidPages(): This keeps only pages that are encoded as text/html, end with htm or html file extensions, do not have a null crawldate, and are not a robots.txt file. When doing text or link analysis, you want to work with the HTML pages themselves.
.keepMimeTypes(): This allows you to specify file types that you're interested in keeping. The opposite command is
.keepDate(): This command allows you to specify specific dates that you are interest in keeping. If you were dealing with a large number of WARCs and only wanted to keep files from October 10th 2005, you would pass
.keepDate("20051010"). The opposite command is
.keepDomains(): This command allows you to specify specific domains that you are interested in keeping. If you were dealing with a large number of WARCs and only wanted to keep domains from the Green Party of Canada, you would pass
.keepDomains(Set("greenparty.ca")). The opposite command is
.keepUrls(): This is a similar command to above but on URLs not just domains. The opposite command is
.keepUrlPatterns(): This command allows you to specify URL patterns for records you wish to keep. The patterns must be regular expression objects. You can generate a regular expression object by appending
.r to the end of a string. E.g.,
keepUrlPatterns(Set("http://www.archive.org/about/.*".r)) will keep all records with URLs beginning with
http://www.archive.org/about/. The opposite command is
.discardUrlPatterns(). (Remember that the dot has special meaning in a regular expression, so if you wished to keep all URLs beginning with
http://www. you would need to escape the dot by specifying
keepUrlPatterns(Set("http://www\\..*".r)).) If you want to make this case insensitive, use
.keepLanguages(): This allows you to keep only pages that are written in a specified language. It uses the ISO 639.2 language codes; currently it supports the following langauges:
da, de, et, el, en, es, fi, fr, hu, is, it, lt, nl, no, pl, pt, ru, sv, th. If you wanted to keep only pages in French and German, you would do
.keepLanguages(Set("fr", "de")). Language detection is somewhat resource-intensive on a large collection, so run your other filters first.
.keepContent(): This command allows you to keep only pages that contain a given keyword. The opposite command is