This folder contains all the code necessary to update the data tables. It is a single Python package with the following modules:
- data: auxiliary and help data files used during processing of input
- earthengine: submodule containing code used to perform computations in Google Earth Engine
- lib: core functions and utilities used across the package
- pipelines: contains all the individual pipelines that produce the data tables
- scripts: collection of miscellaneous scripts, used only for manual processing
- test: unit testing of core functions
Data is automatically updated by the CI server on a schedule. To manually update the data for all
the data pipelines, execute the update.py
script:
python update.py
To update a specific data pipeline, use the option --only <comma_separated_names>
; for example:
python update.py --only index,demographics,geography
The update command will use the output/tables
and output/snapshot
folders at the root of the
project path. The raw data sources downloaded to produce outputs are placed under the snapshot
folder, and the processed data is placed in the tables
folder.
Files in the tables
folder are not meant to be used as-is. They are intended to be used for the
purpose of auditing changes in the dataset, being tracked by git
. The files in the tables
folder
are made available for general use after a publishing step.
To execute the unit tests, run the following command from this directory:
python -m unittest
To add new sources to be ingested into an existing table, please take a look to the epidemiology table's README for step-by-step instructions. Even though it's written in the context of adding new sources to the epidemiology table, the same advice applies for any other table.
To create a new table output, it is recommended to start by making a copy of the
_template folder. The name of the folder will determine the file name of the
output table, which will be placed under the output folder. Because of limitations with Python
imports, the folder name cannot use dashes (-
). Instead, use underscores (_
) which will be
replaced by dashes in the output table name.
Individual data sources are encoded as a DataSource
. Each data source goes through the following
steps, executed in order:
- Fetch: download resources into raw data
- Parse: convert raw data to intermediate structured format
- Merge: associate each record with a known
key
- Filter: filter out unneeded data and keep only desired output columns
The majority of the processing in a data source will likely take place in the parse
step. All
individual records output by the data source have to follow the following guidelines:
- Each record must be matched with a known
key
present in the auxiliary metadata table. Otherwise, it will be dropped from the output. - Each record may include a
date
column, which must be ISO 8601 format (i.e.YYYY-MM-DD
).
To make writing pipelines easier, a default implementation DataSource
already includes a lot of the functionality that is likely to be used by a standard data parsing
routine, including downloading and conversion of raw resources into a pandas DataFrame
. See the
template source for a trivial example which subclasses
the DataSource
implementation.
A DataPipeline
object wraps a list of individual DataSource
s and is defined in a config.yaml
file. Each data source is processed in order and the output is combined into a single data table.
When values for the same key
(or, if present, date
-key
pair) overlap, the value present in the
last data source in the list is chosen. For example, if source1
outputs {key: AA, value: 1}
and
source2
outputs {key: AA, value: 2}
, then the combined output will be {key: AA, value: 2}
--
assuming that source2
comes after source1
in the list. For this reason, you should order your
data sources from less trustworthy to more trustworthy, so the less trustworthy values can be
overwritten if more trusted data is available.
DataPipeline
objects define the schema of the output table. Individual DataSource
objects that
are part of the pipeline may output some or all columns defined in the schema, but columns not in
the schema will be filtered out in the final output.
The following diagram summarizes the data flow architecture:
This project can be run locally and therefore it is agnostic to any specific infrastructure, but it makes use of several Google Cloud products to ensure robustness and continuous updates. Some of the configuration files are checked in with the code, such as AppEngine's app.yaml file. This serves as a reference implementation for users of this code who might want to reproduce our infrastructure setup. Here is a brief overview of the different Google Cloud products used as part of this project:
Some data is computed using Google Earth Engine. The code used is stored under the git submodule under earthengine; you may need to sign in using your Google account and request access via this link to be able to clone the submodule.
Data tables are made available for use via the publish.py
script which uploads the files to a file
server and creates the different versions of the datasets (like date subsets, the main table, JSON
formatted files, etc.). Data is published automatically by the CI server with every change to the
main branch; to run the publish step locally, simply execute this script from the src
directory:
python publish.py
Some data sources are unreliable or only provide daily data (instead of historical). To improve the resiliency of the data pipeline, you may also use a cache layer which creates hourly snapshots of data sources, which can then be aggregated into historical data. URLs defined in the cache.yaml file are downloaded automatically.
There are no strict guidelines regarding a development environment. Here are the recommended settings, which would help ensure consistency across the codebase.
Any python above 3.6 should be able to update and publish the data pipelines, provided the required
dependencies are met. The recommended approach is to install a virtual environment, like conda,
and then install the dependencies using the following command from the src
folder:
conda install --file requirements.txt
Pre-commit Git hooks have a common configuration managed by pre-commit which can be installed like this:
conda install -c conda-forge pre-commit
pre-commit install
This will ensure that lint checks run automatically prior to a git commit. You can also run the lint checks manually:
pre-commit run --all-files