This product can load and optionally interpolate input data which contains a latitude and longitude into a dataset mapped to H3 cells of a desired resolution. H3 (from Uber) cells are used to create a uniform mesh of hexagons to evenly divide the globe at a variety of resolutions.
Once cell attributes are loaded, they will be stored in a DuckDB database.
Non-interpolated datasets may be loaded as point datasets, skipping the interpolation step, and storing the exact location of input data points into the database.
Some environment variables are used by various code and scripts. Set up your environment as follows (note that "source" is used)
source ./bin/environment.sh
It is recommended that a Python virtual environment be created. Several convenience scripts are available to create and activate a virtual environment.
To create a new virtual environment run the below command (it will create a directory called "venv" in your current working directory):
$PROJECT_DIR/bin/venv.sh
Once your virtual environment has been created, it can be activated
as follows (note: you must activate the virtual environment
for it to be used, and the command requires source
to ensure
environment variables to support venv are established correctly):
source $PROJECT_DIR/bin/vactivate.sh
Install the required libraries as follows:
pip install -r requirements.txt
A CLI is available that makes it easy to interact with the service:
python ./src/cli/cli_load.py $VERBOSE --help
usage: cli_load.py [-h] [--verbose] {load,initialize,load-pipeline} ...
Data Mesh Agent Command Line Interface (CLI)
positional arguments:
{load,initialize,load-pipeline}
Available commands
load load a dataset into the geospatial dataset
initialize create source db from giss temperature data
load-pipeline run a loading pipeline for customizable data loading
options:
-h, --help show this help message and exit
--verbose Enable verbose output
(venv)
To load a dataset through the command line supply the relevant configuration
file and run the below command to load it. Output of a successful run of a
loader will always be a duckdb database stored in the
<database_dir>/<dataset_name>.duckdb
location, where database_dir, and dataset_name are parameters specified in the
configuration file.
Example datasets and config files are available in the ./examples/loading
directory
In order to load data a configuration file is needed to specify what is to be done. Parameters common to all loaders are shown below:
Parameter | Type | Description |
---|---|---|
loader_type | str | The type of loader to use. Available loaders: CSVLoader, ParquetLoader |
dataset_name | str | The name of the dataset to be created. |
dataset_type | str | The type of dataset that is to be created. Available options: [ h3, point ] |
database_dir | str | The location where databases containing processed data is stored. Directory must exist. |
interval | str | Determines for what time periods data is differentiated for. Available options: [ one-time, yearly, monthly, daily ] |
max_resolution | str | Determines the maximum h3 resolution for which data will be calculated |
data_columns | list[str] | A list of columns in the dataset that are to be interpreted as data. Any columns except latitude, longitude, date columns, and time columns will be ignored |
year_column | str | An optional parameter that contains the name of the column that contains information about what year this data element is from. Mandatory if interval is yearly, monthly, or daily |
month_column | str | An optional parameter that contains the name of the column that contains information about what month this data element is from. Mandatory if interval is monthly, or daily |
day_column | str | An optional parameter that contains the name of the column that contains information about what day this data element is from. Mandatory if interval is daily |
shapefile | str | An optional parameter that indicates a shapefile to use to limit interpolation for h3 datasets. Only cells within the shapefile's boundaries will be interpolated. Does nothing in point datasets. |
region | str | An optional parameter that indicates the name of a region within a specified shapefile to use to limit interpolation for h3 datasets. Only cells within the region's boundaries will be interpolated. Requires that the shapefile parameter exist. Does nothing in point datasets |
mode | str | Determines what loading mode the loader will use. Available options: [ "create", "insert"] |
max_parallelism | int | Determines the maximum number of simultaneous threads to use when interpolating data |
The below parameters are specific to csv loader:
Parameter | Type | Description |
---|---|---|
file_path | str | The path to the csv file to be loaded |
has_header_row | str | Indicates whether this csv file has a header row. |
columns | Dict[str] | a dictionary of columns mapped to their contained data type. Supported types: [str, float, int] |
The below parameters are specific to the parquet loader:
Parameter | Type | Description |
---|---|---|
file_path | str | The path to the csv file to be loaded |
There are examples of data loading within the examples/loading
directory.
The examples/loading/basic
directory contains two examples, both of which use
the data in the no_date_no_header file. This file contains a very small
amount of data that will demonstrate the most basic abilities of the loader.
This example demonstrates the ability to create a very minimal h3 dataset.
This will create an output file at ./tmp/h3_no_header.duckdb
containing the
loaded dataset.
CONFIG_PATH="./examples/loading/basic/h3_no_header_conf.yml" ;
python ./src/cli/cli_load.py load \
--config_path $CONFIG_PATH
This example demonstrates the ability to create a very minimal point dataset.
This will create an output file at ./tmp/point_no_header.duckdb
containing the
loaded dataset.
CONFIG_PATH="./examples/loading/basic/point_no_header_conf.yml" ;
python ./src/cli/cli_load.py load \
--config_path $CONFIG_PATH
This example provides a slightly more detailed example of a point dataset.
This will create an output file at ./tmp/jamaica_buildings.duckdb
containing
the loaded dataset.
CONFIG_PATH="./examples/loading/jamaica_buildings/jamaica_building_conf.yml" ;
python ./src/cli/cli_load.py load \
--config_path $CONFIG_PATH
This example shows more features of the loader using the GISS temperature dataset, and a shapefile of countries. This example both contains more data than prior examples, but also uses a shapefile to limit the area for which interpolation is performed. In this example data is interpolated only within the borders of Jamaica.
Shapefiles are files that define a geographic region. They are used in this example to ensure that processing only happens within a target region. In order to run the below examples, shapefiles will need to be downloaded from the following link:
Shapefiles source:
Retrieved from parent site: https://public.opendatasoft.com/explore/dataset/world-administrative-boundaries/export/
- retrieved as a dataset from the "Geographic file formats" section, "Shapefile" element, by clicking the "Whole dataset" link
Create the data/shapefiles/WORLD
directory as below (if it does not already exist)
mkdir -p ./data/shapefiles/WORLD
Unzip the world-administrative-boundaries.zip
file into the
data/shapefiles/WORLD
directory. This should result in a
directory structure that looks like below:
data
|-- shapefiles
|-- WORLD
|-- world-adminstrative-boundaries.prj
|-- world-adminstrative-boundaries.cpg
|-- world-adminstrative-boundaries.dbf
|-- world-adminstrative-boundaries.shp
|-- world-adminstrative-boundaries.shx
The GISS temperature dataset contains data on global temperatures, and is used as the raw data for the examples in this README. It is used as sample data for some of the below examples. It can be retrieved from the below links:
GISS Temperature:
These were retrieved from this parent site: https://data.giss.nasa.gov/gistemp/station_data_v4_globe/
Create the data/geo_data/temperatures
directory using the
below command (if it does not already exist):
mkdir -p data/geo_data/temperatures
Copy both the v4.mean_GISS_homogenized.txt
and stations.txt
to the
data/geo_data/temperatures
directory.
Once the raw temperature data is retrieved, it must be turned into the sort of CSV that the loader can process. To do this run the below command, which will produce a csv for the loader representing data in the month of December, in the year 2022:
STATIONS="./data/geo_data/temperatures/station_list.txt" ;
TEMPERATURE="./data/geo_data/temperatures/v4.mean_GISS_homogenized.txt" ;
OUTPUT="./data/geo_data/temperatures/giss_2022_12.csv"
python ./examples/loading/common/temp_giss_to_csv.py \
--stations $STATIONS \
--temperature $TEMPERATURE \
--output $OUTPUT
Running the below script will generate the
tmp/giss_temperature_2022_12_example.duckdb
file, which contains the output of this example.
CONFIG_PATH="./examples/loading/giss_temperature/giss_2022_12.yml" ;
python ./src/cli/cli_load.py load \
--config_path $CONFIG_PATH
This examples shows another dataset that is available. This dataset contains flood information across Europe. The scripts included with this example will generate datasets for Germany, and a smaller higher-resolution dataset for a section of Northwestern Germany.
This example will load the data into a parquet format.
Shapefiles are files that define a geographic region. They are used in this example to ensure that processing only happens within a target region. In order to run the below examples, shapefiles will need to be downloaded from the following link:
Shapefiles source:
Retrieved from parent site: https://public.opendatasoft.com/explore/dataset/world-administrative-boundaries/export/
- retrieved as a dataset from the "Geographic file formats" section, "Shapefile" element, by clicking the "Whole dataset" link
Create the data/shapefiles/WORLD
directory as below
(if it does not already exist)
mkdir -p ./data/shapefiles/WORLD
Unzip the world-administrative-boundaries.zip
file into the
data/shapefiles/WORLD
directory. This should result in a
directory structure that looks like below:
data
|-- shapefiles
|-- WORLD
|-- world-adminstrative-boundaries.prj
|-- world-adminstrative-boundaries.cpg
|-- world-adminstrative-boundaries.dbf
|-- world-adminstrative-boundaries.shp
|-- world-adminstrative-boundaries.shx
In addition a simple shapefile that selects a small rectangle of northeast
Germany is available
in the examples/loading/flood_data/nw_germany_shapefile
location.
Create the ./data/shapefiles/custom/nw_germany_shapefile
directory to hold
these files:
mkdir -p ./data/shapefiles/custom/nw_germany_shapefile
Then copy the files from examples/loading/flood_data/nw_germany_shapefile
directory to the ./data/shapefiles/custom/nw_germany_shapefile
directory.
cp ./examples/loading/flood_data/nw_germany_shapefile/* ./data/shapefiles/custom/nw_germany_shapefile
Additionally, the flood data that will be used as the raw data for this example will need to be retrieved. Note that this data is 5GB in size.
It can be retrieved from the below link
Which was retrieved from this parent site
Create the data/geo_data/flood/europe_flood_data
directory as below:
mkdir -p ./data/geo_data/flood/europe_flood_data
Unzip the Pan-European data sets of river flood probability of occurrence under present and future climate_1_all.zip
file into the data/geo_data/flood/europe_flood_data
directory.
This should result in a directory structure that looks like the below:
data
|-- geo_data
|-- flood
|-- europe_flood_data
|-- data.zip
|-- readme_river_floods_v1.1.pdf
Create the data/geo_data/flood/europe_flood_data/data
directory as below
mkdir -p ./data/geo_data/flood/europe_flood_data/data
Unzip the data.zip
file into the
./data/geo_data/flood/europe_flood_data/data
directory. This should result in a file structure like below:
data
|-- geo_data
|-- flood
|-- europe_flood_data
|-- data.zip
|-- readme_river_floods_v1.1.pdf
|-- data
|-- River_discharge_1971_2000_hist.dbf
|-- River_discharge_1971_2000_hist.prj
...
Once the information is retrieved, it must be turned into a parquet file that the loader can process. To do that run the below file:
for Germany
RAW="./data/geo_data/flood/europe_flood_data/data/River_flood_depth_1971_2000_hist_0010y.tif" ;
OUT="./tmp/flood_germany.parquet" ;
FILTER="Germany" ;
python ./examples/common/flood_to_parquet.py \
--raw $RAW \
--output $OUT \
--filter $FILTER
for nw germany
RAW="./data/geo_data/flood/europe_flood_data/data/River_flood_depth_1971_2000_hist_0010y.tif" ;
OUT="./tmp/flood_nw_germany.parquet" ;
FILTER="NW_Germany" ;
python ./examples/common/flood_to_parquet.py \
--raw $RAW \
--output $OUT \
--filter $FILTER
Running this example will generate the tmp/flood_data.duckdb
file
as the output of this example. It will contain flood information about
Germany, up to resolution 7.
CONFIG_PATH="./examples/loading/flood_data/flood_data.yml" ;
python ./src/cli/cli_load.py load \
--config_path $CONFIG_PATH
Running this example will generate the tmp/flood_nw_germany.duckdb
file
as the output of this example. It will contain flood information about
northwestern Germany, up to resolution 9.
CONFIG_PATH="./examples/loading/flood_data/flood_data_nw_germany.yml" ;
python ./src/cli/cli_load.py load \
--config_path $CONFIG_PATH
While this dataset uses the same raw data as the h3 Germany example, it loads the data as a collection of points, without interpolating.
CONFIG_PATH="./examples/loading/flood_data/flood_data_point.yml" ;
python ./src/cli/cli_load.py load \
--config_path $CONFIG_PATH
In addition to the simpler loaders mentioned above there is also a loading pipeline available that allows multiple operations to be chained together, for more precise control of what exactly is present in the dataset.
Type | Description |
---|---|
Reading Step | A reading step loads the initial source data into the pipeline as a DataFrame, allowing further processing. Only a single reading step is allowed. |
Preprocessing Step | A preprocessing step is a step that will be performed on each individual data point before aggregation is performed. If multiple preprocessing steps are present, they are processed in the order they are mentioned in the configuration. |
Aggregation Step | During the processing of aggregation steps, data points will be grouped basedo n what H3 cell they are located in. Each aggregation step will be run on this grouped data, generating a single output per cell |
Postprocessing Step | A postprocessing step will run after the aggregation. If multiple postprocessing steps are present, they are processed in the order they are mentioned in the configuration. |
Output Step | An output step will take the dataset created by the epreceeding steps and put it into a specified output location for storage |
Parameter | Type | Mandatory | Description |
---|---|---|---|
reading_step | Dict[str,Any] | True | The parameters for the reading step in the pipeline. Parameters must contain the "class_name" key, with a corresponding str value which is the module and class name of the class of the reading step to run. Specified class must extend the loading.reading_step.ReadingStep abstract class. All other entries in the dictionary will be passed to the constructor of this class as an argument. |
preprocessing_steps | List[Dict[str,Any]] | False | A list of preprocessing steps to run in this pipeline. Each entry in the list must contain the "class_name" key, with a corresponding str value which is the module and class name of the class of the preprocessing step to run. This class must extend the loading.preprocessing_step.PreprocessingStep abstract class.All other entries in the dictionary will be passed to the constructor of this class as an argument. |
aggregation_steps | List[Dict[str,Any]] | False | A list of aggregation steps to run in this pipeline. Each entry in the list must contain the "class_name" key, with a corresponding str value which is the module and class name of the class of the preprocessing step to run. This class must extend the loading.aggregation_step.AggregationStep abstract class.All other entries in the dictionary will be passed to the constructor of this class as an argument. If any aggregation steps are present, the aggregation_resolution step must be set |
postprocessing_step | List[Dict[str,Any]] | False | A list of postprocessing steps to run in this pipeline. Each entry in the list must contain the "class_name" key, with a corresponding str value which is the module and class name of the class of the preprocessing step to run. This class must extend the loading.postprocessing_step.PostprocessingStep abstract class.All other entries in the dictionary will be passed to the constructor of this class as an argument. |
output_step | Dict[str,Any] | True | The parameters for the output step to be executed. Parameters must contain the "class_name" key, with a corresponding str value which is the module and class name of the class of the reading step to run.Specified class must extend the loading.output_step.OutputStep abstract class. All other entries in the dictionary will be passed to the constructor of this class as an argument. |
aggregation_resolution | int | False | The h3 resolution level at which data will be aggregated. Mandatory if any aggregation steps are present. Ignored if no aggregation steps are present. |
The shapefiles mentioned here are the same as in previous examples. If you have already aquired them you can skip this step.
Shapefiles are files that define a geographic region. They are used in this example to ensure that processing only happens within a target region. In order to run the below examples, shapefiles will need to be downloaded from the following link:
Shapefiles source:
Retrieved from parent site: https://public.opendatasoft.com/explore/dataset/world-administrative-boundaries/export/
- retrieved as a dataset from the "Geographic file formats" section, "Shapefile" element, by clicking the "Whole dataset" link
Create the data/shapefiles/WORLD
directory as below
(if it does not already exist)
mkdir -p ./data/shapefiles/WORLD
Unzip the world-administrative-boundaries.zip
file into the
data/shapefiles/WORLD
directory. This should result in a
directory structure that looks like below:
data
|-- shapefiles
|-- WORLD
|-- world-adminstrative-boundaries.prj
|-- world-adminstrative-boundaries.cpg
|-- world-adminstrative-boundaries.dbf
|-- world-adminstrative-boundaries.shp
|-- world-adminstrative-boundaries.shx
This loading pipeline contains only a reading and an output step, demonstrating the smallest and simplest pipeline possible. It will load a dataset that consists of 6 data points, and put this dataset into a database with no changes.
CONFIG_PATH="./examples/loading/loading_pipeline/minimal_pipeline.yml" ;
python ./src/cli/cli_load.py load-pipeline \
--config_path $CONFIG_PATH
This loading example contains every available type of step. It uses the same dataset as the minimal pipeline, and will a) filter for points located in Cuba, b) aggregate it down to a single cell, and c) multiply values by 2.
CONFIG_PATH="./examples/loading/loading_pipeline/all_steps.yml" ;
python ./src/cli/cli_load.py load-pipeline \
--config_path $CONFIG_PATH