Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IBCDPE-688] Great Expectations Implementation for Metabolomics Data #96

Merged
merged 74 commits into from
Nov 22, 2023
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
74 commits
Select commit Hold shift + click to select a range
bc729d4
adds great expectations to repo
BWMac Nov 8, 2023
28c0168
adds great expectations workspace
BWMac Nov 8, 2023
62ab669
adds great expectations metabolomics example and data
BWMac Nov 8, 2023
4fcad25
remove python 3.10 and up
BWMac Nov 9, 2023
6e0449d
removes jupyter notebooks from gitignore
BWMac Nov 9, 2023
5de025f
adds metabolomics JSON gx suite
BWMac Nov 9, 2023
5987944
adds new custom expecations
BWMac Nov 9, 2023
4530a14
updates metabolomics expectation suite
BWMac Nov 9, 2023
50538a4
hard-pins gx version
BWMac Nov 10, 2023
eee3f96
removes metabolomics script and sample data
BWMac Nov 10, 2023
1451b3f
updates metabolomics Jupyter Notebook
BWMac Nov 10, 2023
11b7e2b
adds great expectations class
BWMac Nov 10, 2023
4cdee4d
implement great exepctations in processing
BWMac Nov 10, 2023
379814c
adds missing docstrings
BWMac Nov 10, 2023
46d1997
updates GX execution
BWMac Nov 10, 2023
8ae53ce
adds missing type hints
BWMac Nov 10, 2023
8c209d8
clean jupyter notebook
BWMac Nov 10, 2023
73f128d
updates runner class
BWMac Nov 10, 2023
19e26db
fix relative path
BWMac Nov 10, 2023
50a4c5c
adds dummy data for testing
BWMac Nov 10, 2023
66f66d5
updates gx class
BWMac Nov 10, 2023
445d9dc
adds tests for class methods
BWMac Nov 10, 2023
8d2dcbf
breaks up stacked logic in _get_results_path
BWMac Nov 13, 2023
46169b9
adds test_get_results_path
BWMac Nov 13, 2023
e47b834
adds gx upload folder to configs
BWMac Nov 14, 2023
cdf5ef7
updates gx class to take dataset_name and upload_folder
BWMac Nov 14, 2023
2b8ffec
updates process to check if the upload folder is in config
BWMac Nov 14, 2023
d30a081
updates tests for gx class and methods
BWMac Nov 14, 2023
2113d07
reorganize tests into class
BWMac Nov 14, 2023
2be7e60
updates contributing docs
BWMac Nov 14, 2023
4fb2e45
Update CONTRIBUTING.md
BWMac Nov 15, 2023
c15137b
remove commented code from custom expectations
BWMac Nov 15, 2023
20845bd
adds data docs to notebook
BWMac Nov 15, 2023
3d5b777
adds provenance to report upload
BWMac Nov 15, 2023
9f78e8c
add absolute path logic and test
BWMac Nov 15, 2023
ad19174
updates gx path handling
BWMac Nov 16, 2023
896af76
run tests with print
BWMac Nov 16, 2023
d099b3f
move print statements for debugging
BWMac Nov 16, 2023
083fc8d
fixes typo
BWMac Nov 16, 2023
86957b5
changes plus to path join
BWMac Nov 16, 2023
b3fc034
see directory in action
BWMac Nov 16, 2023
d0e17c8
move great_expectations folder to src
BWMac Nov 16, 2023
f6b5a63
adjust jupyter notebook path
BWMac Nov 16, 2023
50311ac
adjusts paths in class and tests
BWMac Nov 16, 2023
ba0ffc4
move great_expectations to agoradatatools
BWMac Nov 16, 2023
0d89003
adjusts notebook path
BWMac Nov 16, 2023
752d69c
adjusts class and tests paths
BWMac Nov 16, 2023
cafd5cd
lists dir before pytest
BWMac Nov 16, 2023
714c67e
look for path directory
BWMac Nov 16, 2023
1140c23
change to directory path
BWMac Nov 16, 2023
1e9b7e2
updates pipenv lock
BWMac Nov 16, 2023
9645b05
adds include to setup
BWMac Nov 16, 2023
e1c40d0
explicitly add great_expectations
BWMac Nov 16, 2023
9535744
try as package data
BWMac Nov 16, 2023
39e455a
make it a submodule
BWMac Nov 16, 2023
d28c424
try manifest.in
BWMac Nov 16, 2023
5f045ce
change path
BWMac Nov 16, 2023
8943c94
try less specific path test
BWMac Nov 16, 2023
d88d726
removes ls's
BWMac Nov 16, 2023
131688e
remove src
BWMac Nov 16, 2023
808ad40
remove -s from pytest in CI
BWMac Nov 16, 2023
7e7e9dc
updates notebook instructions
BWMac Nov 16, 2023
c3cc759
removes print statements
BWMac Nov 16, 2023
5cd68d4
updates contributing guide
BWMac Nov 16, 2023
999d8ce
adds comment to MANIFEST.in
BWMac Nov 16, 2023
47dc9c9
bumps synapse python client version
BWMac Nov 16, 2023
fb2e69e
updates process and tests
BWMac Nov 22, 2023
fb7b5fa
updates extract
BWMac Nov 22, 2023
bf1b583
updates process docstring
BWMac Nov 22, 2023
56179b3
updates load and tests
BWMac Nov 22, 2023
57df208
updates docstring
BWMac Nov 22, 2023
db265af
updates docstring
BWMac Nov 22, 2023
8112d98
Merge pull request #97 from Sage-Bionetworks/bwmac/IBCDPE-527/syn_req…
BWMac Nov 22, 2023
0c680cb
updates min_values to strict_mins
BWMac Nov 22, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3,935 changes: 3,324 additions & 611 deletions Pipfile.lock

Large diffs are not rendered by default.

3 changes: 3 additions & 0 deletions great_expectations/gx/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@

uncommitted/
.ge_store_backend_id
102 changes: 102 additions & 0 deletions great_expectations/gx/great_expectations.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@

# Welcome to Great Expectations! Always know what to expect from your data.
#
# Here you can define datasources, batch kwargs generators, integrations and
# more. This file is intended to be committed to your repo. For help with
# configuration please:
# - Read our docs: https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/connect_to_data_overview/#2-configure-your-datasource
# - Join our slack channel: http://greatexpectations.io/slack

# config_version refers to the syntactic version of this config file, and is used in maintaining backwards compatibility
# It is auto-generated and usually does not need to be changed.
config_version: 3

# Datasources tell Great Expectations where your data lives and how to get it.
# Read more at https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/connect_to_data_overview
datasources: {}

# This config file supports variable substitution which enables: 1) keeping
# secrets out of source control & 2) environment-based configuration changes
# such as staging vs prod.
#
# When GX encounters substitution syntax (like `my_key: ${my_value}` or
# `my_key: $my_value`) in the great_expectations.yml file, it will attempt
# to replace the value of `my_key` with the value from an environment
# variable `my_value` or a corresponding key read from this config file,
# which is defined through the `config_variables_file_path`.
# Environment variables take precedence over variables defined here.
#
# Substitution values defined here can be a simple (non-nested) value,
# nested value such as a dictionary, or an environment variable (i.e. ${ENV_VAR})
#
#
# https://docs.greatexpectations.io/docs/guides/setup/configuring_data_contexts/how_to_configure_credentials


config_variables_file_path: uncommitted/config_variables.yml

# The plugins_directory will be added to your python path for custom modules
# used to override and extend Great Expectations.
plugins_directory: plugins/

stores:
# Stores are configurable places to store things like Expectations, Validations
# Data Docs, and more. These are for advanced users only - most users can simply
# leave this section alone.
#
# Three stores are required: expectations, validations, and
# evaluation_parameters, and must exist with a valid store entry. Additional
# stores can be configured for uses such as data_docs, etc.
expectations_store:
class_name: ExpectationsStore
store_backend:
class_name: TupleFilesystemStoreBackend
base_directory: expectations/

validations_store:
class_name: ValidationsStore
store_backend:
class_name: TupleFilesystemStoreBackend
base_directory: uncommitted/validations/

evaluation_parameter_store:
# Evaluation Parameters enable dynamic expectations. Read more here:
# https://docs.greatexpectations.io/docs/reference/evaluation_parameters/
class_name: EvaluationParameterStore

checkpoint_store:
class_name: CheckpointStore
store_backend:
class_name: TupleFilesystemStoreBackend
suppress_store_backend_id: true
base_directory: checkpoints/

profiler_store:
class_name: ProfilerStore
store_backend:
class_name: TupleFilesystemStoreBackend
suppress_store_backend_id: true
base_directory: profilers/

expectations_store_name: expectations_store
validations_store_name: validations_store
evaluation_parameter_store_name: evaluation_parameter_store
checkpoint_store_name: checkpoint_store

data_docs_sites:
# Data Docs make it simple to visualize data quality in your project. These
# include Expectations, Validations & Profiles. The are built for all
# Datasources from JSON artifacts in the local repo including validations &
# profiles from the uncommitted directory. Read more at https://docs.greatexpectations.io/docs/terms/data_docs
local_site:
class_name: SiteBuilder
# set to false to hide how-to buttons in Data Docs
show_how_to_buttons: true
store_backend:
class_name: TupleFilesystemStoreBackend
base_directory: uncommitted/data_docs/local_site/
site_index_builder:
class_name: DefaultSiteIndexBuilder

anonymous_usage_statistics:
enabled: True
BWMac marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
/*index page*/
.ge-index-page-site-name-title {}
.ge-index-page-table-container {}
.ge-index-page-table {}
.ge-index-page-table-profiling-links-header {}
.ge-index-page-table-expectations-links-header {}
.ge-index-page-table-validations-links-header {}
.ge-index-page-table-profiling-links-list {}
.ge-index-page-table-profiling-links-item {}
.ge-index-page-table-expectation-suite-link {}
.ge-index-page-table-validation-links-list {}
.ge-index-page-table-validation-links-item {}

/*breadcrumbs*/
.ge-breadcrumbs {}
.ge-breadcrumbs-item {}

/*navigation sidebar*/
.ge-navigation-sidebar-container {}
.ge-navigation-sidebar-content {}
.ge-navigation-sidebar-title {}
.ge-navigation-sidebar-link {}
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
import pandas as pd
from typing import Optional, Any

from great_expectations.core.expectation_configuration import ExpectationConfiguration
from great_expectations.execution_engine import PandasExecutionEngine
from great_expectations.expectations.expectation import ColumnMapExpectation
from great_expectations.expectations.metrics import (
ColumnMapMetricProvider,
column_condition_partial,
)


# This class defines a Metric to support your Expectation.
# For most ColumnMapExpectations, the main business logic for calculation will live in this class.
class ColumnValuesListLength(ColumnMapMetricProvider):
"""Class definition for list length checking metric."""

# This is the id string that will be used to reference your metric.
condition_metric_name = "column_values.list_length"
condition_value_keys = ("list_length",)

# This method implements the core logic for the PandasExecutionEngine
@column_condition_partial(engine=PandasExecutionEngine)
def _pandas(cls, column: pd.core.series.Series, list_length: int, **kwargs) -> bool:
"""Core logic for list length checking metric on a
pandas execution engine.

Args:
column (pd.core.series.Series): Pandas column to be evaluated.
BWMac marked this conversation as resolved.
Show resolved Hide resolved
list_length (int): Expected list length.
Returns:
bool: Whether or not the column values have the expected list length.
"""
return column.apply(lambda x: cls._check_list_length(x, list_length))

@staticmethod
def _check_list_length(cell: Any, list_length: int) -> bool:
"""Check if a cell is a list, and if it has the expected length.

Args:
cell (Any): Individual cell to be evaluated.
list_length (int): Expected list length.

Returns:
bool: Whether or not the cell is a list with the expected length.
"""
if not isinstance(cell, list):
return False
if len(cell) != list_length:
return False
return True

# This method defines the business logic for evaluating your metric when using a SqlAlchemyExecutionEngine
# @column_condition_partial(engine=SqlAlchemyExecutionEngine)
BWMac marked this conversation as resolved.
Show resolved Hide resolved
# def _sqlalchemy(cls, column, _dialect, **kwargs):
# raise NotImplementedError

# This method defines the business logic for evaluating your metric when using a SparkDFExecutionEngine
# @column_condition_partial(engine=SparkDFExecutionEngine)
# def _spark(cls, column, **kwargs):
# raise NotImplementedError


# This class defines the Expectation itself
class ExpectColumnValuesToHaveListLength(ColumnMapExpectation):
"""Expect the list in column values to have a certain length."""

# These examples will be shown in the public gallery.
# They will also be executed as unit tests for your Expectation.
examples = [
{
"data": {
"a": [[1, 2, 3, 4, 5]],
},
"tests": [
{
"title": "positive_test_with_list_length_5",
"exact_match_out": False,
"include_in_gallery": True,
"in": {"column": "a", "list_length": 5},
"out": {"success": True},
},
{
"title": "negative_test_with_list_length_5",
"exact_match_out": False,
"include_in_gallery": True,
"in": {"column": "a", "list_length": 4},
"out": {"success": False},
},
],
}
]

# This is the id string of the Metric used by this Expectation.
# For most Expectations, it will be the same as the `condition_metric_name` defined in your Metric class above.
map_metric = "column_values.list_length"

# This is a list of parameter names that can affect whether the Expectation evaluates to True or False
success_keys = ("list_length",)

# This dictionary contains default values for any parameters that should have default values
default_kwarg_values = {}

def validate_configuration(
self, configuration: Optional[ExpectationConfiguration] = None
) -> None:
"""
Validates that a configuration has been set, and sets a configuration if it has yet to be set. Ensures that
necessary configuration arguments have been provided for the validation of the expectation.

Args:
configuration (OPTIONAL[ExpectationConfiguration]): \
An optional Expectation Configuration entry that will be used to configure the expectation
Returns:
None. Raises InvalidExpectationConfigurationError if the config is not validated successfully
"""

super().validate_configuration(configuration)
configuration = configuration or self.configuration

# # Check other things in configuration.kwargs and raise Exceptions if needed
# try:
# assert (
# ...
# ), "message"
# assert (
# ...
# ), "message"
# except AssertionError as e:
# raise InvalidExpectationConfigurationError(str(e))

# This object contains metadata for display in the public Gallery
library_metadata = {
"tags": [], # Tags for this Expectation in the Gallery
"contributors": [ # Github handles for all contributors to this Expectation.
"@BWMac", # Don't forget to add your github handle here!
],
}


if __name__ == "__main__":
ExpectColumnValuesToHaveListLength().print_diagnostic_checklist()
94 changes: 94 additions & 0 deletions gx_metabolomics.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
import great_expectations as gx
from great_expectations.data_context import FileDataContext

context = FileDataContext.create(project_root_dir="great_expectations")

from expectations.expect_column_values_to_have_list_length import (
ExpectColumnValuesToHaveListLength,
)

test_dataset = "./metabolomics.json"
context = gx.get_context()
validator = context.sources.pandas_default.read_json(test_dataset)

# ad_diagnosis_p_value
validator.expect_column_values_to_be_of_type("ad_diagnosis_p_value", "list")
validator.expect_column_values_to_not_be_null("ad_diagnosis_p_value")
# for custom and experimental expectations you have to pass args as kwargs
validator.expect_column_values_to_have_list_length(
column="ad_diagnosis_p_value", list_length=1
)

# associated gene name
validator.expect_column_values_to_be_of_type("associated_gene_name", "str")
validator.expect_column_values_to_not_be_null("associated_gene_name")
validator.expect_column_value_lengths_to_be_between(
"associated_gene_name", min_value=1, max_value=25
BWMac marked this conversation as resolved.
Show resolved Hide resolved
)
# allows all alphanumeric characters, underscores, periods, and dashes
validator.expect_column_values_to_match_regex(
"associated_gene_name", "^[A-Za-z0-9_.-]+$"
)

# association p
validator.expect_column_values_to_be_of_type("association_p", "float")
validator.expect_column_values_to_not_be_null("association_p")
validator.expect_column_values_to_be_between("association_p", min_value=0, max_value=1)

BWMac marked this conversation as resolved.
Show resolved Hide resolved
# ensembl gene id
validator.expect_column_values_to_be_of_type("ensembl_gene_id", "str")
validator.expect_column_values_to_not_be_null("ensembl_gene_id")
validator.expect_column_value_lengths_to_equal("ensembl_gene_id", 15)
# checks format and allowed chatacters
validator.expect_column_values_to_match_regex("ensembl_gene_id", "^ENSG\d{11}$")
validator.expect_column_values_to_be_unique("ensembl_gene_id")

# gene_wide_p_threshold_1kgp
validator.expect_column_values_to_be_of_type("gene_wide_p_threshold_1kgp", "float")
validator.expect_column_values_to_not_be_null("gene_wide_p_threshold_1kgp")
validator.expect_column_values_to_be_between(
"gene_wide_p_threshold_1kgp", min_value=0, max_value=0.05
)

BWMac marked this conversation as resolved.
Show resolved Hide resolved
# metabolite full name
validator.expect_column_values_to_be_of_type("metabolite_full_name", "str")
validator.expect_column_values_to_not_be_null("metabolite_full_name")
validator.expect_column_value_lengths_to_be_between(
"metabolite_full_name", min_value=1, max_value=25
BWMac marked this conversation as resolved.
Show resolved Hide resolved
)
# allows all alphanumeric characters, dashes, parentheses, hyphens and spaces
validator.expect_column_values_to_match_regex(
"metabolite_full_name", "^[A-Za-z0-9\s\-:.()+]+$"
)

# metabolite ID
validator.expect_column_values_to_be_of_type("metabolite_id", "str")
validator.expect_column_values_to_not_be_null("metabolite_id")
validator.expect_column_value_lengths_to_be_between(
"metabolite_id", min_value=1, max_value=15
BWMac marked this conversation as resolved.
Show resolved Hide resolved
)
# allows all alphanumeric characters and periods
validator.expect_column_values_to_match_regex("metabolite_id", "^[A-Za-z0-9.]+$")

# n_per_group
validator.expect_column_values_to_be_of_type("n_per_group", "list")
validator.expect_column_values_to_not_be_null("n_per_group")
validator.expect_column_values_to_have_list_length(column="n_per_group", list_length=2)

# transposed_boxplot_stats
validator.expect_column_values_to_be_of_type("transposed_boxplot_stats", "list")
validator.expect_column_values_to_not_be_null("transposed_boxplot_stats")
validator.expect_column_values_to_have_list_length(
column="transposed_boxplot_stats", list_length=2
)

# save expectation suite and run checkpoint
validator.save_expectation_suite()
checkpoint = context.add_or_update_checkpoint(
name="agora-test-checkpoint",
validator=validator,
)
checkpoint_result = checkpoint.run()

# generate and open report
context.view_validation_result(checkpoint_result)
Loading
Loading