Abacus

A data validation tool.

Overview

The abacus repository includes scripts and tools that facilitate various forms of validation between datasets and their data dictionaries(data expectiations).

TLDR/Quick start:

Run me: pip install git+https://github.com/NIH-NCPI/abacus.git
Commands here

Installation

Create and activate a virtual environment (SKIP if installing as a package):
If you want to run the scripts locally it is recoomended you use a virtual environment to keep the imports used siloed. This could reduce future import issues.
Here for more on virtual environments.

# Step 1: cd into the directory to store the venv

# Step 2: run this code. It will create the virtual env named abacus_venv in the current directory.
python3 -m venv abacus_venv

# Step 3: run this code. It will activate the abacus_venv environment
source abacus_venv/bin/activate # On Windows: venv\Scripts\activate

# You are ready for installations! 
# If you want to deactivate the venv run:
deactivate

Install the package and dependencies:

If you have the repo cloned and attempting to run locally, this command should be run in the root of the repository.
```
pip install git+https://github.com/NIH-NCPI/abacus.git
```

Run a command/action

Available actions:
Commands

NOTE: If you have the repo cloned and attempting to run locally, run these commands from abacus/src/abacus.

validate_csv

validate_csv runs cerberus validation on a datadictionary/dataset pair and returns results of the validation in the terminal.
See data expectations here
```
validate_csv -dd {path/to/datadictionary.csv} -dt {path/to/dataset.csv} -m {Format of missing values in the dataset choose one (i.e. NA, na, null, ...)}   

# example
validate_csv -dd data/input/data_dictionary.csv -dt data/input/dataset.csv -m NA 
```
summarize_csv

summarize_csv returns aggregates and attributes of the provided dataset which is exported as a yaml file.
See data expectations here
```
summarize_csv -dd {path/to/datadictionary.csv} -dt {path/to/dataset.csv} -m {Format of missing values in the dataset choose one (i.e. NA, na, null, ...)} -e {export/filepath/summary.yaml}

# example 
summarize_csv -dd data/input/data_dictionary.csv -dt data/input/dataset.csv -m NA -e data/output/summary.yaml
```
validate_linkml

validate_linkml runs linkml validation on a datadictionary/dataset pair and returns results of the validation in the terminal from the directory that contains the datafiles. (datadictionary, dataset, AND iIMPORTS-adjoining datadictionaries)
See data expectations here
```
validate_linkml -dd {path/to/datadictionary.csv} -dt {path/to/dataset.csv} -dc {data class - linkml tree_root}

# example 
validate_linkml -dd data/input/assay.yaml -dt data/input/assay_data.yaml -dc Assay
```
Data Expectations

csv - validation(cerberus) and summary

data dictionary format:

Visit this link for more indepth specs

dataset format:

Datasets should be csvs, follow the format described by the data dictionary, and have consitant notation of missing data [NULL, NA, etc.].

yaml/json - validation(linkml)

data dictionary format:

Data dictionaries should be a yaml file formatted for linkml, and contain all dataset expectations for validation. Validation requires all data dictionaries referenced in the imports section present in the same file location. Imports beginning with linkml: can be ignored
Example seen below.
```
id: https://w3id.org/include/assay
imports:
- linkml:types
- include_core
- include_participant
- include_study
```
dataset format:

Datasets should be yaml, json or csv file formatted for linkml, follow the format described by the data dictionary, and have consitant notation of missing data [NULL, NA, etc.].

If the dataset is a csv, multivalue fields should have pipe separators
See examples below.
```
# Yaml file representation
# Instances of Biospecimen class
- studyCode: "Study1"
  participantGlobalId: "PID123"
  ...
  ...
  ...
- studyCode: "Study1"
  participantGlobalId: "PID123"
```
CSV representation
```
studyCode,studyTitle,program
study_code,Study of Cancer,program1|program2
```
Working on a branch?

If working on a new feature it is possible to install a package version within the remote or local branch. These commands should be run from the project root.
```
# remote
pip install git+https://github.com/NIH-NCPI/abacus.git@{branch_name}

# local
pip install -e .

# handy troubleshooting commands when unsure of version.
pip install --upgrade abacus
pip install --upgrade abacus==2.0.0
pip uninstall abacus -y
```

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.vscode		.vscode
src/abacus		src/abacus
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Abacus

Overview

TLDR/Quick start:

Installation

Available actions:

Commands

validate_csv

summarize_csv

validate_linkml

Data Expectations

csv - validation(cerberus) and summary

data dictionary format:

dataset format:

yaml/json - validation(linkml)

data dictionary format:

dataset format:

Working on a branch?

About

Releases

Packages

Contributors 2

Languages

NIH-NCPI/abacus

Folders and files

Latest commit

History

Repository files navigation

Abacus

Overview

TLDR/Quick start:

Installation

Available actions:

Commands

validate_csv

summarize_csv

validate_linkml

Data Expectations

csv - validation(cerberus) and summary

data dictionary format:

dataset format:

yaml/json - validation(linkml)

data dictionary format:

dataset format:

Working on a branch?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages