Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IBCDPE-688] Great Expectations Implementation for Metabolomics Data #96

Merged
merged 74 commits into from
Nov 22, 2023
Merged
Show file tree
Hide file tree
Changes from 31 commits
Commits
Show all changes
74 commits
Select commit Hold shift + click to select a range
bc729d4
adds great expectations to repo
BWMac Nov 8, 2023
28c0168
adds great expectations workspace
BWMac Nov 8, 2023
62ab669
adds great expectations metabolomics example and data
BWMac Nov 8, 2023
4fcad25
remove python 3.10 and up
BWMac Nov 9, 2023
6e0449d
removes jupyter notebooks from gitignore
BWMac Nov 9, 2023
5de025f
adds metabolomics JSON gx suite
BWMac Nov 9, 2023
5987944
adds new custom expecations
BWMac Nov 9, 2023
4530a14
updates metabolomics expectation suite
BWMac Nov 9, 2023
50538a4
hard-pins gx version
BWMac Nov 10, 2023
eee3f96
removes metabolomics script and sample data
BWMac Nov 10, 2023
1451b3f
updates metabolomics Jupyter Notebook
BWMac Nov 10, 2023
11b7e2b
adds great expectations class
BWMac Nov 10, 2023
4cdee4d
implement great exepctations in processing
BWMac Nov 10, 2023
379814c
adds missing docstrings
BWMac Nov 10, 2023
46d1997
updates GX execution
BWMac Nov 10, 2023
8ae53ce
adds missing type hints
BWMac Nov 10, 2023
8c209d8
clean jupyter notebook
BWMac Nov 10, 2023
73f128d
updates runner class
BWMac Nov 10, 2023
19e26db
fix relative path
BWMac Nov 10, 2023
50a4c5c
adds dummy data for testing
BWMac Nov 10, 2023
66f66d5
updates gx class
BWMac Nov 10, 2023
445d9dc
adds tests for class methods
BWMac Nov 10, 2023
8d2dcbf
breaks up stacked logic in _get_results_path
BWMac Nov 13, 2023
46169b9
adds test_get_results_path
BWMac Nov 13, 2023
e47b834
adds gx upload folder to configs
BWMac Nov 14, 2023
cdf5ef7
updates gx class to take dataset_name and upload_folder
BWMac Nov 14, 2023
2b8ffec
updates process to check if the upload folder is in config
BWMac Nov 14, 2023
d30a081
updates tests for gx class and methods
BWMac Nov 14, 2023
2113d07
reorganize tests into class
BWMac Nov 14, 2023
2be7e60
updates contributing docs
BWMac Nov 14, 2023
4fb2e45
Update CONTRIBUTING.md
BWMac Nov 15, 2023
c15137b
remove commented code from custom expectations
BWMac Nov 15, 2023
20845bd
adds data docs to notebook
BWMac Nov 15, 2023
3d5b777
adds provenance to report upload
BWMac Nov 15, 2023
9f78e8c
add absolute path logic and test
BWMac Nov 15, 2023
ad19174
updates gx path handling
BWMac Nov 16, 2023
896af76
run tests with print
BWMac Nov 16, 2023
d099b3f
move print statements for debugging
BWMac Nov 16, 2023
083fc8d
fixes typo
BWMac Nov 16, 2023
86957b5
changes plus to path join
BWMac Nov 16, 2023
b3fc034
see directory in action
BWMac Nov 16, 2023
d0e17c8
move great_expectations folder to src
BWMac Nov 16, 2023
f6b5a63
adjust jupyter notebook path
BWMac Nov 16, 2023
50311ac
adjusts paths in class and tests
BWMac Nov 16, 2023
ba0ffc4
move great_expectations to agoradatatools
BWMac Nov 16, 2023
0d89003
adjusts notebook path
BWMac Nov 16, 2023
752d69c
adjusts class and tests paths
BWMac Nov 16, 2023
cafd5cd
lists dir before pytest
BWMac Nov 16, 2023
714c67e
look for path directory
BWMac Nov 16, 2023
1140c23
change to directory path
BWMac Nov 16, 2023
1e9b7e2
updates pipenv lock
BWMac Nov 16, 2023
9645b05
adds include to setup
BWMac Nov 16, 2023
e1c40d0
explicitly add great_expectations
BWMac Nov 16, 2023
9535744
try as package data
BWMac Nov 16, 2023
39e455a
make it a submodule
BWMac Nov 16, 2023
d28c424
try manifest.in
BWMac Nov 16, 2023
5f045ce
change path
BWMac Nov 16, 2023
8943c94
try less specific path test
BWMac Nov 16, 2023
d88d726
removes ls's
BWMac Nov 16, 2023
131688e
remove src
BWMac Nov 16, 2023
808ad40
remove -s from pytest in CI
BWMac Nov 16, 2023
7e7e9dc
updates notebook instructions
BWMac Nov 16, 2023
c3cc759
removes print statements
BWMac Nov 16, 2023
5cd68d4
updates contributing guide
BWMac Nov 16, 2023
999d8ce
adds comment to MANIFEST.in
BWMac Nov 16, 2023
47dc9c9
bumps synapse python client version
BWMac Nov 16, 2023
fb2e69e
updates process and tests
BWMac Nov 22, 2023
fb7b5fa
updates extract
BWMac Nov 22, 2023
bf1b583
updates process docstring
BWMac Nov 22, 2023
56179b3
updates load and tests
BWMac Nov 22, 2023
57df208
updates docstring
BWMac Nov 22, 2023
db265af
updates docstring
BWMac Nov 22, 2023
8112d98
Merge pull request #97 from Sage-Bionetworks/bwmac/IBCDPE-527/syn_req…
BWMac Nov 22, 2023
0c680cb
updates min_values to strict_mins
BWMac Nov 22, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions .github/workflows/dev.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,9 @@ jobs:
python-version:
- "3.8"
- "3.9"
- "3.10"
- "3.11"
# Support for Python 3.10 and 3.11 is temproarily disabled
# - "3.10"
# - "3.11"
steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
Expand Down
4 changes: 0 additions & 4 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -75,10 +75,6 @@ docs/_build/
# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints
*.ipynb

# IPython
profile_default/
ipython_config.py
Expand Down
19 changes: 19 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,25 @@ This package has a `src/agoradatatools/etl/transform` submodule. This folder ho
- Use `pytest.mark.parameterize` to loop through multiple datasets in a single test.
- The class `TestTransformGenesBiodomains` can be used as an example for future tests contibuted.

### Great Expectations

This package uses [Great Expectations](https://greatexpectations.io/) to validate output data. The `great_expectations` folder houses our file system data context and Great Expectations-specific configuration files. Eventually, our goal is for each `agora-data-tools` dataset to be convered by an expectation suite. To add data validation for more datasets, follow these steps:

1. Create a new expectation suite by defining the expectations for the new dataset in a Jupyter Notebook inside the `gx_suite_definitions` folder. Use `metabolomics.ipynb` as an example.
BWMac marked this conversation as resolved.
Show resolved Hide resolved
1. Run the notebook to generate the new expectation suite. It should populate as a JSON file in the `great_expectations/expectations` folder.
1. Add support for running Great Expectations on a dataset by adding the `gx_folder` key to the configuration for your datatset in both `test_config.yaml` and `config.yaml`. The `gx_folder` should be a Synapse ID pointing to the folder where we will upload generated HTML reports from Great Expectations. If a folder specific to your dataset does not yet exist in the proper locations ([Prod](https://www.synapse.org/#!Synapse:syn52948668), [Testing](https://www.synapse.org/#!Synapse:syn52948670)), create folders named after the dataset and copy the new folders' Synapse IDs to the config files.
1. Test data processing by running `adt test_config.yaml` and ensure that HTML reports with all expectations are generated and uploaded to the proper folder in Synapse.

#### Custom Expectations

This repository is currently home to three custom expectations that were created for use on `agora-data-tools` datasets:

1. `ExpectColumnValuesToHaveListLength`: checks to see if the lists in a particular column are the length that we expect.
1. `ExpectColumnValuesToHaveListMembers`: checks to see if the lists in a particular column contain only values that we expect.
1. `ExpectColumnValuesToHaveListMembersOfType`: checks to see if the lists in a particular column contain members of the type we expect.

These expectations are defined in the `great_expectations/gx/plugins/expectations` folder. To add more custom expectations, follow the instructions provided in the Great Expectations [documentation](https://docs.greatexpectations.io/docs/guides/expectations/custom_expectations_lp).

### DockerHub

Rather than using GitHub actions to build and push Docker images to DockerHub, the Docker images are automatically built in DockerHub. This requires the `sagebiodockerhub` GitHub user to be an Admin of this repo. You can view the docker build [here](https://hub.docker.com/r/sagebionetworks/agora-data-tools).
3,935 changes: 3,324 additions & 611 deletions Pipfile.lock

Large diffs are not rendered by default.

3 changes: 2 additions & 1 deletion config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
column_rename:
biodomain: name
destination: *dest

- genes_biodomains:
files:
- name: genes_biodomains
Expand Down Expand Up @@ -102,6 +102,7 @@
provenance:
- syn26064497.1
destination: *dest
gx_folder: syn52948669

- gene_info:
files:
Expand Down
3 changes: 3 additions & 0 deletions great_expectations/gx/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@

uncommitted/
.ge_store_backend_id
30 changes: 30 additions & 0 deletions great_expectations/gx/checkpoints/metabolomics.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
name: metabolomics
config_version: 1.0
template_name:
module_name: great_expectations.checkpoint
class_name: Checkpoint
run_name_template:
expectation_suite_name:
batch_request: {}
action_list:
- name: store_validation_result
action:
class_name: StoreValidationResultAction
- name: store_evaluation_params
action:
class_name: StoreEvaluationParametersAction
- name: update_data_docs
action:
class_name: UpdateDataDocsAction
evaluation_parameters: {}
runtime_configuration: {}
validations:
- batch_request:
datasource_name: default_pandas_datasource
data_asset_name: '#ephemeral_pandas_asset'
options: {}
batch_slice:
expectation_suite_name: metabolomics
profilers: []
ge_cloud_id:
expectation_suite_ge_cloud_id:
Loading
Loading