NER Visualization

The Warcbase Spark matchbox functions in ExtractEntities list the named entities contained in each page in an archive, but we are often interested in getting a sense of what is contained in a collection as a whole. Visualization can help. We have provided a Javascript visualizer using D3.js that produces views of NER data. You can try the visualizer here.

The visualizer can currently produce the following: a list view, with frequency of the selected entity type represented by font size (inspired by the Trading Consequences Location Cloud List view a standard word cloud for the selected entity type Word cloud * a bubble chart, representing all entity types at once.Bubble view

Generating NER Data

The matchbox contains a function in NERCombinedJson that will extract NER entities from plain text records, summarize them by crawl date, and save the results as a single JSON file. The following script calls the function. Modify the file names in (1) and (2) as appropriate.

import org.warcbase.spark.matchbox.NERCombinedJson

sc.addFile("/path/to/english.all.3class.distsim.crf.ser.gz") // (1)

val ner = new NERCombinedJson

ner.classify("english.all.3class.distsim.crf.ser.gz", "hdfs:///path/to/plaintext/", "results.json", sc) // (2)

Setting Up the Visualizer

To use the visualizer with your own data you must place the files from warcbase/vis/ner into a folder on a web server. If you wish to serve files locally the Python SimpleHTTPServer is easy to use. Because of cross-domain restrictions your web browser will only allow the visualizer to load JSON files from the same server, so place the data file somewhere accessible (local) to the web server.

For example, navigate to warcbase/vis/ner and place your results.json file into that directory.

Then run:

python -m SimpleHTTPServer 4321

Navigate to localhost:4321. You then need to pass the URL to bring up your results. For example the following URL will display the visualization:

http://localhost:4321/index.html?json=results.json