AD.model.* (csv | jsonld): this is the current, "live" version of the AD Portal data model. It is being used by both the staging and production versions of the multitenant Data Curator App.
AD.model.csv
or AD.model.jsonld
by hand!
The main branch of this repo is protected, so you cannot push changes to main. To make changes to the data model:
- Create a new branch in this repo and give it an informative name. The schema-convert workflow will not work from a private fork.
- On that branch, make and commit any changes. You can do this by cloning the repo locally or by using a Github codespace. Please write informative commit messages in case we need to track down data model inconsistencies or introduced bugs.
- Open a pull request and request review from someone else on the AD DCC team. The Github Action described in Automation will run as soon as you open the PR. If this action fails, something about the data model csv could not be converted to a json-ld and should be investigated. If this action passes, the PR can be merged with one approving review.
- After the PR is merged, delete your branch.
The full AD.model.csv
file has over 1400 attributes and is unwieldy to edit and hard to review changes for. For ease of editing, the full data model is divided into "module" subfolders, like so:
data-models/
├── AD.model.csv (do not edit!)
├── AD.model.jsonld (do not edit!)
└── modules/
├── biospecimen/
│ ├── specimenID.csv
│ ├── organ.csv
│ └── tissue.csv
└── sequencing/
├── readLength.csv
└── platform.csv
Within each module, every attribute in the data model where Parent
= DataProperty
has its own csv, named after that attribute (example: organ.csv
). Any valid values of the attribute "organ" have Parent
= organ
and are listed as rows in the file organ.csv
. Attributes with Parent
= DataProperty
are used as columns in metadata and annotation manifest templates. Attributes with Parent
= DataType
describe the templates themselves. At this time, any other value for Parent
means the attribute is a valid value of some other enumerated attribute.
Some common data model editing scenarios are:
- If you wanted to add a new valid value "eyeball" to our existing column attribute "organ", after making a new branch and opening the repo either locally or within a codespace, you would go to
modules/biospecimen/organ.csv
. - Then, create a new row for an attribute named "eyeball", with a description and source (preferably an ontology URI). In the
Parent
column, make sure the value is "organ". - Next, find the row for the attribute "organ" (should be the first row), and w/in the valid values column, add "eyeball" to the comma-separated list of valid values.
- Save your changes and write an informative commit. Please try to add valid values alphabetically!
- If you wanted to add the column "furColor" to the "model-ad_individual_animal_metadata" template, first decide which module the new column should belong to. In this case, "MODEL-AD" makes the most sense.
- W/in the
MODEL-AD
subfolder, create a new csv calledfurColor.csv
with the required schematic column headers. Describe the attribute "furColor" as necessary and make sureParent
=DataProperty
. Add any valid values for "furColor" as new rows to this csv as described in the previous scenario. - Find the manifest template attributes in
modules/template/templates.csv
. In the "model-ad_individual_animal_metadata" row, add your new column "furColor" to the comma-separated list of attributes in theDependsOn
column. - Save your changes and write an informative commit.
For more advanced data modeling scenarios like adding conditional logic, creating validation rules, or creating new manifests, please consult the #ad-dcc-team slack channel.
A persistent issue is that manually editing csvs is challening. Some columns in our modules are very short, and others are veeeeery long (Description, Valid Values). Some options for working on csvs, and their pros and cons:
- Editing in the Github UI : convenient, but challenging to keep track of columns in plain text format.
- Cloning the repo, making a branch, and opening csvs locally in Excel or another spreadsheet program 🖥️ : probably the best UI experience, but involves a few extra steps with git.
- Using a Github codespace to launch VSCode in the browser, and editing with the pre-installed RainbowCSV extension 🌈 : Still difficult to edit csvs as plain text, but the color formatting and ability to use a soft word wrap makes it much easier to distinguish columns. RainbowCSV lets you designate "sticky" rows and columns for easier scrolling, and also has a nice "CSVLint" function that will check for formatting errors after you make changes.
We are exploring better solutions to this problem -- if you have ideas, tell us!
When you open a PR that includes any changes to files in the modules/
directory, a Github Action will automatically run before merging is allowed. This action:
- Runs the
assemble_csv_data_model.py
script to concatenate the modular attribute csvs into one data frame, sort alphabetically byParent
and thenAttribute
, and write the combined dataframe toAD.model.csv
. The action then commits the changes to the master data model csv. - Installs
schematic
from the develop branch and runsschema convert
on the newly-concatenated data model csv to generate a new version of the jsonld fileAD.model.jsonld
. The action also commits the changes to the jsonld.
If this automated workflow fails, then the data model may be invalid and further investigation is needed.
If you want to make changes to the data model and test them out by generating manifests with schematic
, you can use the devcontainer in this repo with a Github Codespace. This will open a container in a remote instance of VSCode and install the latest version of schematic.
Codespace secrets:
- SYNAPSE_PAT: scoped to view and download permissions on the sysbio-dcc-tasks-01 Synapse service account
- SERVICE_ACCOUNT_CREDS: these are creds for using the Google sheets api with schematic
Previous versions of the data model live in the legacy-data-models/
folder. This include the Diverse Cohorts pilot model and the intial "legacy" model representing the AD Portal Synapse project metadata dictionary and metadata templates from August 2023. These are not being used by DCA.