A data validation tool.
The abacus
repository includes scripts and tools that facilitate various forms
of validation between datasets and their data dictionaries(data expectiations).
- Run me:
pip install git+https://github.com/NIH-NCPI/abacus.git
- Commands here
-
Create and activate a virtual environment (SKIP if installing as a package):
If you want to run the scripts locally it is recoomended you use a virtual environment to keep the imports used siloed. This could reduce future import issues.
Here for more on virtual environments.# Step 1: cd into the directory to store the venv # Step 2: run this code. It will create the virtual env named abacus_venv in the current directory. python3 -m venv abacus_venv # Step 3: run this code. It will activate the abacus_venv environment source abacus_venv/bin/activate # On Windows: venv\Scripts\activate # You are ready for installations! # If you want to deactivate the venv run: deactivate
-
Install the package and dependencies:
- If you have the repo cloned and attempting to run locally, this command should
be run in the root of the repository.
pip install git+https://github.com/NIH-NCPI/abacus.git
-
Run a command/action
-
NOTE: If you have the repo cloned and attempting to run locally, run these commands from abacus/src/abacus.
validate_csv
runs cerberus validation on a datadictionary/dataset pair and returns results of the validation in the terminal.
See data expectations herevalidate_csv -dd {path/to/datadictionary.csv} -dt {path/to/dataset.csv} -m {Format of missing values in the dataset choose one (i.e. NA, na, null, ...)} # example validate_csv -dd data/input/data_dictionary.csv -dt data/input/dataset.csv -m NA
summarize_csv
returns aggregates and attributes of the provided dataset which is exported as a yaml file.
See data expectations heresummarize_csv -dd {path/to/datadictionary.csv} -dt {path/to/dataset.csv} -m {Format of missing values in the dataset choose one (i.e. NA, na, null, ...)} -e {export/filepath/summary.yaml} # example summarize_csv -dd data/input/data_dictionary.csv -dt data/input/dataset.csv -m NA -e data/output/summary.yaml
validate_linkml
runs linkml validation on a datadictionary/dataset pair and returns results of the validation in the terminal from the directory that contains the datafiles. (datadictionary, dataset, AND iIMPORTS-adjoining datadictionaries)
See data expectations herevalidate_linkml -dd {path/to/datadictionary.csv} -dt {path/to/dataset.csv} -dc {data class - linkml tree_root} # example validate_linkml -dd data/input/assay.yaml -dt data/input/assay_data.yaml -dc Assay
Visit this link for more indepth specs
Datasets should be csvs, follow the format described by the data dictionary, and have consitant notation of missing data [NULL, NA, etc.].
Data dictionaries should be a yaml file formatted for linkml, and contain all dataset expectations for validation. Validation requires all data dictionaries referenced in the
imports
section present in the same file location. Imports beginning withlinkml:
can be ignored
Example seen below.id: https://w3id.org/include/assay imports: - linkml:types - include_core - include_participant - include_study
Datasets should be yaml, json or csv file formatted for linkml, follow the format described by the data dictionary, and have consitant notation of missing data [NULL, NA, etc.].
If the dataset is a csv, multivalue fields should have pipe separators
See examples below.# Yaml file representation # Instances of Biospecimen class - studyCode: "Study1" participantGlobalId: "PID123" ... ... ... - studyCode: "Study1" participantGlobalId: "PID123"
CSV representation
studyCode,studyTitle,program study_code,Study of Cancer,program1|program2
If working on a new feature it is possible to install a package version within the remote or local branch. These commands should be run from the project root.
# remote pip install git+https://github.com/NIH-NCPI/abacus.git@{branch_name} # local pip install -e . # handy troubleshooting commands when unsure of version. pip install --upgrade abacus pip install --upgrade abacus==2.0.0 pip uninstall abacus -y