From 685344378f59e136b746c771c0ebd0e48a1a9697 Mon Sep 17 00:00:00 2001 From: Bo Xu Date: Tue, 21 Nov 2023 17:58:50 +0000 Subject: [PATCH] Update documentation site to refer to the RSI approach for custom DC (#378) --- custom_dc/customize_ui.md | 104 ------------------ custom_dc/index.md | 13 ++- custom_dc/prepare_data.md | 122 --------------------- custom_dc/setup_gcp.md | 94 ---------------- custom_dc/upload_data.md | 218 -------------------------------------- 5 files changed, 10 insertions(+), 541 deletions(-) delete mode 100644 custom_dc/customize_ui.md delete mode 100644 custom_dc/prepare_data.md delete mode 100644 custom_dc/setup_gcp.md delete mode 100644 custom_dc/upload_data.md diff --git a/custom_dc/customize_ui.md b/custom_dc/customize_ui.md deleted file mode 100644 index ef3226a13..000000000 --- a/custom_dc/customize_ui.md +++ /dev/null @@ -1,104 +0,0 @@ ---- -layout: default -title: Customize UI -nav_order: 4 -parent: Custom Data Commons -published: true ---- - -## Overview - -Custom Data Commons allows customization of the web pages on top of -[datacommons.org](https://datacommons.org). The customization includes overall -color scheme, home page content, landing pages of timeline/scatter/map tools -and etc. - -## Environment Setup - -Fork [datacommonsorg/website](https://github.com/datacommonsorg/website) Github -repo following [these -instructions](https://github.com/datacommonsorg/website#github-workflow) into a -new repo, which will be used as the custom Data Commons codebase. Custom Data -Commons development and deployment will be based on this forked repo. - -To run website in a local environment (Mac, Linux), follow this -[guide](https://github.com/datacommonsorg/website/blob/master/docs/developer_guide.md#local-development-with-flask). -Use `-e custom` flag when starting the local Flask server: - -```bash -./run_server.sh -e custom -``` - -## Update UI Code - -### Update Header, Footer and Page Content - -Page header and footer can be customized in -[base.html](https://github.com/datacommonsorg/website/blob/master/server/templates/custom_dc/custom/base.html) -by updating the html element within `
` and ``. - -Homepage can be customized in -[homepage.html](https://github.com/datacommonsorg/website/blob/master/server/templates/custom_dc/custom/homepage.html). - -### Update CSS and Javascript - -The custom Data Commons provides an -[overrides.css](https://github.com/datacommonsorg/website/tree/master/static/custom_dc/custom/overrides.css) -to override CSS styles. It has a default color override. More style changes can -be added in that file. - -If there are already existing CSS and Javascript files, put them under the -[/static/custom_dc/custom](https://github.com/datacommonsorg/website/tree/master/static/custom_dc/custom) -folder. Then include these files in the `` section of the corresponding -html files as - -```html - -``` - -or - -```html - -``` - -## Deploy to GCP - -### One Time Setup - -- Install the following tools: - - - [`gcloud`](https://cloud.google.com/sdk/docs/install) - - [`kubectl`](https://kubernetes.io/docs/tasks/tools/install-kubectl/) - - [`kustomize`](https://kustomize.io/) - - [`yq` 4.x](https://github.com/mikefarah/yq#install) - -- Install gke-gcloud-auth-plugin - - - `gcloud components install gke-gcloud-auth-plugin` - -### Deploy Local Change - -After testing locally, follow the instructions below to deploy to GCP. -`project_id` refers to the GCP project where custom Data Commons is installed. - -- Git commit all local changes (no need to push to Github repo). Later steps - will build docker image based on the hash of this commit. - -- Run the following code to build and push docker images to the Container - Registry. - - ```bash - ./scripts/push_image.sh - ``` - - Follow the link from the log to check the status until the push is complete. - -- Deploy the website to GKE: - - ```bash - ./scripts/deploy_gke.sh -p - ``` - - Check the deployment from GKE console. Once it's done, check the UI changes - from the website. diff --git a/custom_dc/index.md b/custom_dc/index.md index ad9b8bca5..f1fc7282c 100644 --- a/custom_dc/index.md +++ b/custom_dc/index.md @@ -2,7 +2,7 @@ layout: default title: Custom Data Commons nav_order: 90 -has_children: true +has_children: false --- ## Overview @@ -21,9 +21,16 @@ full control over data, computing resources and access control. It can be accessible by the general public or can be access controlled to limited principals. +## System Setup and Custom Data Import + +Check this +[documentation](https://github.com/datacommonsorg/website/blob/master/custom_dc/README.md) +including the system diagram, data storage options, deployment instructions, private +data preparation and UI customization. + ## Case Study -#### Feeding America Data Commons +### Feeding America Data Commons [Feeding America Data Commons](https://datacommons.feedingamerica.org/) provides access to data from [Map the Meal Gap](https://map.feedingamerica.org/), overlaid with data from a wide range of additional sources into a single @@ -32,7 +39,7 @@ heart health and food insecurity can be retrieved with a few clicks. ![fa](/assets/images/custom_dc/home-heart-food.png){: height="450" } -#### India Data Commons +### India Data Commons [India Data Commons](https://datacommons.iitm.ac.in/) is an effort by Robert Bosch Center for Data Science and Artificial Intelligence, IIT Madras to highlight India-specific data in Data diff --git a/custom_dc/prepare_data.md b/custom_dc/prepare_data.md deleted file mode 100644 index 1a051554c..000000000 --- a/custom_dc/prepare_data.md +++ /dev/null @@ -1,122 +0,0 @@ ---- -layout: default -title: Prepare Data -nav_order: 2 -parent: Custom Data Commons -published: true ---- - -## Overview - -Preparing data involves cleaning / formatting the raw data into compatible CSV -files. Each CSV file is expected to have columns corresponding to the Values -(numeric) about a Variable, Place and Date. The format of a CSV file is -specified by a [Template -MCF](https://github.com/datacommonsorg/data/blob/master/docs/mcf_format.md#template-mcf). -The ready to use artifacts contain one TMCF file (.tmcf) and a few compatible -CSV files (.csv). - -## File Format - -### General Format - -In the table shown below, there are separate columns for Variable (Variable), -Place (Country), Date (Year) and Value (Value) and each row of the CSV -corresponds to one observation of the Variable about a Place at the specified -Date. - -| Year | Country | Variable | Value | Extra Column [Optional] | -| ---- | ------- | --------------- | ----------- | ----------------------- | -| 2017 | UK | Life_Expectancy | 81.25609756 | 1 | -| 2017 | UK | Population | 65844142 | 2 | - -The TMCF for this CSV looks like: - -```txt -Node: E:data->E0 -typeOf: dcs:StatVarObservation -observationAbout: C:data->Country -observationDate: C:data->Year -variableMeasured: C:data->Variable -value: C:data->Value -``` - -Note: If all observations in the CSV are about the same Date, then those do not -need to be specified as columns, but just as constants. This applies to -Variable, Place as well. For the example above, if the CSV has data only for -2017, then the CSV and TMCF looks like: - -| Country | Variable | Value | Extra Column [Optional] | -| ------- | --------------- | -------- | ----------------------- | -| UK | Life_Expectancy | 81.2 | 1 | -| UK | Population | 65844142 | 2 | - -```txt -Node: E:data->E0 -typeOf: dcs:StatVarObservation -observationAbout: C:data->Country -observationDate: 2017 -variableMeasured: C:data->Variable -value: C:data->Value -``` - -### Date as Column Header - -It is possible to specify Date as column headers. - -| Country | Variable | 2017 | 2018 | -| ------- | --------------- | -------- | -------- | -| UK | Life_Expectancy | 81.2 | 81.3 | -| KR | Population | 51361911 | 51606633 | - -```txt -Node: E:data->E0 -typeOf: dcs:StatVarObservation -observationAbout: C:data->Country -observationDate: 2017 -variableMeasured: C:data->Variable -value: C:data->2017 - -Node: E:data->E1 -typeOf: dcs:StatVarObservation -observationAbout: C:data->Country -observationDate: 2018 -variableMeasured: C:data->Variable -value: C:data->2018 -``` - -### Variable as Column Header - -It is possible to specify Variable as column headers. - -| Year | Country | Life_Expectancy | Population | -| ---- | ------- | --------------- | ---------- | -| 2017 | UK | 81.2 | 65844142 | -| 2018 | KR | 82 | 51361911 | - -```txt -Node: E:data->E0 -typeOf: dcs:StatVarObservation -observationAbout: C:data->Country -observationDate: C:data->Year -variableMeasured: Life_Expectancy -value: C:data->Life_Expectancy - -Node: E:data->E1 -typeOf: dcs:StatVarObservation -observationAbout: C:data->Country -observationDate: C:data->Year -variableMeasured: Population -value: C:data->Population -``` - -### Date and Place Formats - -Please check [Supported Date and Place -Formats](https://datacommons.org/import/#supported-formats) - -## Testing Data - -Before uploading the data into custom instance, make sure to run [Import -Checker](https://github.com/datacommonsorg/import#using-import-tool) and make -sure there are no formatting or other issues. diff --git a/custom_dc/setup_gcp.md b/custom_dc/setup_gcp.md deleted file mode 100644 index 842c34b81..000000000 --- a/custom_dc/setup_gcp.md +++ /dev/null @@ -1,94 +0,0 @@ ---- -layout: default -title: System Setup -nav_order: 1 -parent: Custom Data Commons -published: true ---- - -## Overview - -Custom Data Commons is deployed in Google Cloud Platform (GCP). This manual -describes how to install a custom Data Commons instance in an existing GCP -project (with id `PROJECT_ID`). - -### Steps - -1. From [Google Cloud Console](https://console.cloud.google.com/), open Cloud - Shell by clicking on the icon, like below: - - ![fa](/assets/images/custom_dc/install_step_1.png){: width="600" } - -1. Set the environment variables `PROJECT_ID` and `CONTACT_EMAIL` in the - terminal: - - ```bash - export PROJECT_ID= - export CONTACT_EMAIL= - ``` - - ![fa](/assets/images/custom_dc/install_step_2.png){: width="600" } - - Note: If this step fails, please [contact us at with form](https://docs.google.com/forms/d/e/1FAIpQLSeVCR95YOZ56ABsPwdH1tPAjjIeVDtisLF-8oDYlOxYmNZ7LQ/viewform) with the errors. - -1. [Optional] The default domain of the instance is - `-datacommons.com`. If you want to use an existing custom domain, - set the environment variable: - - ```bash - export CUSTOM_DC_DOMAIN= - ``` - - Later on, you need to create a new DNS record by linking the domain with the - IP address (from GCP project) from your domain provider. - -1. Please run the following installation command inside the terminal. This may - take up to 20 minutes to complete. - - ```bash - curl -fsSL https://raw.githubusercontent.com/datacommonsorg/website/custom-dc-v0.3.2/scripts/install_custom_dc.sh -o install_custom_dc.sh && \ - chmod u+x install_custom_dc.sh && \ - ./install_custom_dc.sh - ``` - -1. Please [fill out this form](https://docs.google.com/forms/d/e/1FAIpQLSeVCR95YOZ56ABsPwdH1tPAjjIeVDtisLF-8oDYlOxYmNZ7LQ/viewform) to get an API key - for data access. Store the API key in [Cloud Secret Manager](https://console.cloud.google.com/security/secret-manager) with name `mixer-api-key`. - -1. [Optional] Get a Google Maps API key - ([instruction](https://developers.google.com/maps/documentation/javascript/get-api-key)). - Store the API key in [Cloud Secret - Manager](https://console.cloud.google.com/security/secret-manager) with name - `maps-api-key`. This is used for place search in visualization tools. - -1. You should get an email by Google domains that has the section pictured - below. Please click on “Verify email now”. - - ![fa](/assets/images/custom_dc/install_step_3.png){: width="400" } - - Note: You may not get the verification email if you have verified Cloud - Domains or Google Domains in the past. If you do not get the verification - within 10 minutes, check GCP UI to see if the Cloud Domain is active. If it is - active, then please skip this step. Below is what active Cloud Domains looks - like. - - ![fa](/assets/images/custom_dc/install_step_4.png){: width="400" } - -1. Deploy a default Data Commons Instance: - - First clone the Github repo - - ```bash - git clone https://github.com/datacommonsorg/website.git - ``` - - Then update fields `project` in - [custom_dc_template.yaml](https://github.com/datacommonsorg/website/blob/master/deploy/helm_charts/envs/custom_dc_template.yaml) - with the actual GCP project ID. - - Deploy to GKE by running: - - ```bash - ./scripts/deploy_gke_helm.sh -e custom_dc_template -l us-central1-a - ``` - - Go to [GCP console](https://console.cloud.google.com/kubernetes/workload/overview) to make sure pods are running successfully. diff --git a/custom_dc/upload_data.md b/custom_dc/upload_data.md deleted file mode 100644 index 755747d96..000000000 --- a/custom_dc/upload_data.md +++ /dev/null @@ -1,218 +0,0 @@ ---- -layout: default -title: Upload Data -nav_order: 3 -parent: Custom Data Commons -published: true ---- - -## Overview - -Schema files (MCF), data files (CSV) and data specification files (TMCF) are -stored in Google Cloud Storage (GCS) of the custom Data Commons GCP project. -These files should be stored based on a desired layout, so data can be processed -and show up correctly. It's worth to understand a few terms to better understand -the data layout. - -### Data Source - -Data source refers to a data agency such as "Census", "World Bank". - -### Dataset - -Dataset does not have a standard definition. The granulairty of a -dataset varies depending on the sources. For example, one dataset can contain -public parks information of all the states in USA if they are published -together. Or if each state publishes this information individually, then there -are multiple datasets for this topic. - -### Import - -Import is the smallest unit of data upload in Data Commons. It usually (but not -necessarily) corresponds to a dataset. - -### Import Group - -A group of related imports that have similar topics. This is also the unit of -raw data processing. - -### Table - -Table corresponds to one TMCF file and a set of CSV files that have the same -shape. One import could have one or multiple tables. - -## Example Layout - -Consider the following two datasets: - -1. State level public park general information in 50 csv files (collected by - each state in different format, with size of 5G). -2. State level public park expenditure with 1 csv file per year (collected by an - agency, with size of 5M). - -They can be arranged in multiple ways. - -### Single Import Group - -Since these data are all about public parks, they can be put under one import -group, with two imports: - -- general info import - - - With one schema file describing public parks properties - - With 50 tables for each state - - Each table has one TMCF and one CSV file - -- expenditure import - - With one schema file describing expenditure - - With one table containing one TMCF and mutliple CSV files. - -### Multiple Import Groups - -If the two datasets are managed by different departments and are updated at -different frequencies, they can each be an import group. This way, when -expenditure data is updated, only its data is processed and the larger general -information import is untouched. - -## Storage Layout - -All custom Data Commons data are stored under one GCS folder. Below shows a -typical layout. - -Note, create a root folder under the desired GCS bucket, which will be used to -hold all the data. - -```txt - -├── import_group1/ -│ ├── data/ -│ │ ├── import1/ -│ │ │ ├── table1/ -│ │ │ │ ├── bar.tmcf -│ │ │ │ ├── bar1.csv -│ │ │ │ └── bar2.csv -│ │ │ ├── table2/ -│ │ │ │ ├── foo.tmcf -│ │ │ │ └── foo.csv -│ │ │ ├── schema.mcf -│ │ │ └── provenance.json -│ │ └── import2/ -│ │ ├── table1/ -│ │ │ ├── baz.tmcf -│ │ │ └── baz.csv -│ │ ├── schema.mcf -│ │ └── provenance.json -│ ├── internal/ -│ └── provenance.json -└── import_group2/ - ├── data/ - └── internal/ -``` - -Raw data should be uploaded under `//data//`. Each -`table` folder can only contain one TMCF file while all the CSV files should -have conformating format. - -Note `internal/` folder is used to hold computed data and config files and -should not be touched. - -The data source and other meta info can be specified in `provenance.json` file -with the following fields - -```json -{ - "name": "Name of the source (dataset)", - "url": "Url of the source (dataset)" -} -``` - -provenance.json can be at import group level or import level, usually indicating -data source and dataset provenance respectively. - -## Add a Custom Variable Hierarchy - -When using a custom DC instance with new statistical variables, it can be useful -to define a custom hierarchy for the variables. The hierarchy is used in the -Explorer tools to navigate the variables in a structured manner. For example, a -sample custom hierarchy with two layers of groups and three variables is -provided below. - -```txt -. -└── Example Root Node/ - ├── Group 1A/ - │ ├── Variable X - │ └── Group 2A/ - │ └── Variable Y - └── Group 1B/ - └── Variable Z -``` - -To auto-generate the nodes in a custom hierarchy from a spec like above, use -[this -notebook](https://colab.sandbox.google.com/github/datacommonsorg/api-python/blob/master/notebooks/Custom_Hierarchy_Generator.ipynb). -Alternatively, to hand write the MCF nodes involved, please read on... - -To define the hierarchy, each group needs a `StatVarGroup` definition. The -`StatVarGroup` nodes are linked to each other and to a custom root node via -`specializationOf` properties. The example below can be used as a template — -please replace all {} with custom identifiers: - -
- Node: dcid:dc/g/Custom_Root - typeOf: dcs:StatVarGroup - specializationOf: dcid:dc/g/Root - name: "{Example Root Node}" - - Node: dcid:dc/g/Custom_{1A} - typeOf: dcs:StatVarGroup - name: "{Group 1A}" - specializationOf: dcid:dc/g/Custom_Root - displayRank: {1} - - Node: dcid:dc/g/Custom_{1B} - typeOf: dcs:StatVarGroup - name: "{Group 1B}" - specializationOf: dcid:dc/g/Custom_Root - displayRank: {2} - - Node: dcid:dc/g/Custom_{2A} - typeOf: dcs:StatVarGroup - name: "{Group 2A}" - specializationOf: dcid:dc/g/Custom_{1A} - displayRank: {1} - -
- -Next, each new variable needs a `StatisticalVariable` node definition, which will specify which group in the hierarchy it belongs to. The example below can be used as a template — please replace all {} with custom identifiers: - -
- Node: dcid:{Variable_X} - name: "{Variable X}" - typeOf: dcs:StatisticalVariable - populationType: dcs:Thing - measuredProperty: dcs:{Variable_X} - statType: dcs:measuredValue - memberOf: dcid:dc/g/Custom_{1A} - - Node: dcid:{Variable_Y} - name: "{Variable Y}" - typeOf: dcs:StatisticalVariable - populationType: dcs:Thing - measuredProperty: dcs:{Variable_Y} - statType: dcs:measuredValue - memberOf: dcid:dc/g/Custom_{2A} - - Node: dcid:{Variable_Z} - name: "{Variable Z}" - typeOf: dcs:StatisticalVariable - populationType: dcs:Thing - measuredProperty: dcs:{Variable_Z} - statType: dcs:measuredValue - memberOf: dcid:dc/g/Custom_{1B} - -
- -The `StatVarGroup` and `StatisticalVariable` nodes that make up the hiearchy can -be included in a `.mcf` file and added to the GCS bucket associated with the -custom DC instance.