Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue Using condition_parser in Great Expectations within Kedro Pipeline #10644

Open
fabio-scarel opened this issue Nov 8, 2024 · 1 comment

Comments

@fabio-scarel
Copy link

Describe the bug
When integrating Great Expectations with Kedro, using the condition_parser within a pipeline does not yield expected results. Specifically, setting a row_condition in an expectation suite within the pipeline is failing. I've used other expectations successfully but this is the first using conditioning.

Steps to reproduce the issue:

Add an expectation to the suite with ExpectColumnPairValuesToBeInSet.
Configure the value_pairs_set, column_A, and column_B.
Set condition_parser to 'pandas' and add a row_condition.

suite.add_expectation(
gx.expectations.ExpectColumnPairValuesToBeInSet(
value_pairs_set=<EXPECTED_SET>,
column_A="colA",
column_B="colB",
condition_parser='pandas',
row_condition='<ROW_CONDITION>',
)
)

Expected behavior
Expectation should execute successfully within the Kedro pipeline with the row condition applied as specified.

Environment:

Operating System: Windows
Great Expectations Version: 1.1.3
Data Source: Pandas
Additional context: This issue arises only when running within a Kedro pipeline.

data docs expectation result:

expect_column_pair_values_to_be_in_set raised an exception:
Traceback (most recent call last):
File ".../site-packages/great_expectations/validator/validator.py", line 650, in graph_validate
result = expectation.metrics_validate(
File ".../site-packages/great_expectations/expectations/expectation.py", line 1113, in metrics_validate
evr: ExpectationValidationResult = self._build_evr(
File ".../site-packages/great_expectations/expectations/expectation.py", line 1130, in _build_evr
evr = ExpectationValidationResult(**raw_response)
File ".../site-packages/great_expectations/core/expectation_validation_result.py", line 96, in __init__
raise gx_exceptions.InvalidCacheValueError(result)

great_expectations.exceptions.exceptions.InvalidCacheValueError: Invalid result values were found when trying to instantiate an ExpectationValidationResult.
- Invalid result values are likely caused by inconsistent cache values.
- Great Expectations enables caching by default.
- Please ensure that caching behavior is consistent between the underlying Dataset (e.g. Spark) and Great Expectations.
Result: {
"element_count": 760,
"unexpected_count": 98321,
"unexpected_percent": 100.0,
"missing_count": -97561, "missing_percent": -12836.973684210525, "unexpected_percent_total": 12936.973684210525, "unexpected_percent_nonmissing": 100.0,

I've removed some sensitive information but if anything else is needed I can try to help

@adeola-ak
Copy link
Contributor

Hi @fabio-scarel, thanks for reaching out!

Could you share more details about your workflow with GX and Kedro? Specifically, I'm trying to determine if you're transforming the DataFrame in a pipeline step without providing GX a reference to the updated DataFrame, which might result in applying the row condition to the untransformed data.

Additionally, are you properly updating your batch with each change to the DataFrame?

It would also be helpful if you could share an anonymized version of the row_condition for context.

An engineer I spoke with mentioned that the CachedDataset abstraction might be relevant here, so we may need to explore that as well.

Ultimately, please provide as much detail as possible about your workflow so we can investigate further. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In progress
Development

No branches or pull requests

2 participants