The visualizer can currently produce the following: a list view, with frequency of the selected entity type represented by font size (inspired by the Trading Consequences Location Cloud a standard word cloud for the selected entity type * a bubble chart, representing all entity types at once.
Generating NER Data
The matchbox contains a function in NERCombinedJson that will extract NER entities from plain text records, summarize them by crawl date, and save the results as a single JSON file. The following script calls the function. Modify the file names in (1) and (2) as appropriate.
import org.warcbase.spark.matchbox.NERCombinedJson sc.addFile("/path/to/english.all.3class.distsim.crf.ser.gz") // (1) val ner = new NERCombinedJson ner.classify("english.all.3class.distsim.crf.ser.gz", "hdfs:///path/to/plaintext/", "results.json", sc) // (2)
Setting Up the Visualizer
To use the visualizer with your own data you must place the files from
warcbase/vis/ner into a folder on a web server. If you wish to serve files locally the Python SimpleHTTPServer is easy to use. Because of cross-domain restrictions your web browser will only allow the visualizer to load JSON files from the same server, so place the data file somewhere accessible (local) to the web server.
For example, navigate to
warcbase/vis/ner and place your
results.json file into that directory.
python -m SimpleHTTPServer 4321
Navigate to localhost:4321. You then need to pass the URL to bring up your results. For example the following URL will display the visualization: