-
Notifications
You must be signed in to change notification settings - Fork 2
Architecture
The entire idea behind this application is to scrape data from government websites and store that data as JSON in order to serve it over a modern, clean, and RESTful API. This will enable developers to build data visualizations and data search solutions easily.
So there are two aspects of this application:
- Jobs that go out and scrape websites and store that data
- RESTful endpoints built out in Express
While it may be a good idea to keep those two separate for scalability purposes (microservices that are coordinated via a queuing system to fetch and process data from various sites), I'm much too lazy for that and for now am building out modules that can be easily ported into separate services.
A basic workflow for a processor module:
-
A 'data source' is recognized. Ex: a website or endpoint that will serve data relevant to the California drought.
-
Scraping and crawling functions will go out and traverse the website, collecting any pertinent, structured data and storing that data in RethinkDB.
-
Express routes will serve that data in a way that makes sense for that data source. Ex:
/reservoirs/:id
Because the drought is an issue that progesses slowly, processors will probably only need to be ran once every morning or even once per week.
What I mean but 'the api should be an api' is that, although data will enter the platform primarily through scraping and web crawling, I like the idea of the platform being able to receive the data through the api.
Keep this in mind when you're building out a processor module so that it isn't too isolated. A processor module can be split into into two parts: scraping and data storage. That way the storage portion can be kept in the Express application for use in a controller -> model interaction but we also give ourselves room to move to a microservice based architecture (read: we get big and it becomes a real platform).