Extended the Baleen export functionality to dump either an HTML or JSON corpus to disk in a suitable format for NLP analysis, particularly using NLTK. The new export functionality is still single process, but does some smart things to reduce the amount of time the export takes, as well as the amount of memory required. Additionally, we have improved the visual interface to the web application, making status messages more noticeable as we monitor continued data ingestion.
The app can be found online at http://baleen.districtdatalabs.com.
Deployed: Monday, April 18, 2016
Contributors: Benjamin Bengfort, Sasan Bahadaran
Changes
- Updated the exporter to use no_dereferencing and no_cache
- Updated the exporter to write out a json meta file of feeds
- Exporter can now export in either JSON (default) or HTML
- Exporter is now memory and query optimized as good as we can get it
- No HTML sanitization occurs in the exporter any more
- Added a bit of colorization to the web app status page for quick duration identification
- Added iconography to the feeds and status page for better visualization
- Better datetime formatting for the timezone and understandability
- Inclusion of the humanize package for timesince and intcomma readability
- Made the status page responsive