Several Basic Spark Commands

We have several basic commands that you will see again and again in these scripts.

Filtering

.keepValidPages(): This keeps only pages that are encoded as text/html, end with htm or html file extensions, do not have a null crawldate, and are not a robots.txt file. When doing text or link analysis, you want to work with the HTML pages themselves.

.keepMimeTypes(): This allows you to specify file types that you're interested in keeping. The opposite command is .discardMimeTypes.

.keepDate(): This command allows you to specify specific dates that you are interest in keeping. If you were dealing with a large number of WARCs and only wanted to keep files from October 10th 2005, you would pass .keepDate("20051010"). The opposite command is .discardDate().

.keepDomains(): This command allows you to specify specific domains that you are interested in keeping. If you were dealing with a large number of WARCs and only wanted to keep domains from the Green Party of Canada, you would pass .keepDomains(Set("greenparty.ca")). The opposite command is .discardDomains().

.keepUrls(): This is a similar command to above but on URLs not just domains. The opposite command is .discardUrls().

.keepUrlPatterns(): This command allows you to specify URL patterns for records you wish to keep. The patterns must be regular expression objects. You can generate a regular expression object by appending .r to the end of a string. E.g., keepUrlPatterns(Set("http://www.archive.org/about/.*".r)) will keep all records with URLs beginning with http://www.archive.org/about/. The opposite command is .discardUrlPatterns(). (Remember that the dot has special meaning in a regular expression, so if you wished to keep all URLs beginning with http://www. you would need to escape the dot by specifying keepUrlPatterns(Set("http://www\\..*".r)).) If you want to make this case insensitive, use (?i), e.g. keepUrlPatterns(Set("(?i)http://www.archive.org/about/.*".r)).

.keepLanguages(): This allows you to keep only pages that are written in a specified language. It uses the ISO 639.2 language codes; currently it supports the following langauges: da, de, et, el, en, es, fi, fr, hu, is, it, lt, nl, no, pl, pt, ru, sv, th. If you wanted to keep only pages in French and German, you would do .keepLanguages(Set("fr", "de")). Language detection is somewhat resource-intensive on a large collection, so run your other filters first.

.keepContent(): This command allows you to keep only pages that contain a given keyword. The opposite command is .discardContent().