Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Exception during validation of ExpectColumnValuesToNotBeNull #10410

Open
Utkarsh-Krishna opened this issue Sep 17, 2024 · 13 comments
Open
Labels
bug Bugs bugs bugs!

Comments

@Utkarsh-Krishna
Copy link

Describe the bug
I am using a spark/pandas dataframe. The dataframe has multiple columns and I am using one of them as a parameter for this expectation. If I use a column which has no null values then there is no exception and I get the expected result. Now when I pass some other column (also does not have any null value) or some columns which have nulls, I see exceptions.

To Reproduce
Traceback:
"exception_info": {
"exception_traceback": "Traceback (most recent call last):\n File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-6e39b63e-ade0-4e51-94c2-99c6cf2319a5/lib/python3.9/site-packages/great_expectations/validator/validator.py", line 648, in graph_validate\n result = expectation.metrics_validate(\n File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-6e39b63e-ade0-4e51-94c2-99c6cf2319a5/lib/python3.9/site-packages/great_expectations/expectations/expectation.py", line 1081, in metrics_validate\n _validate_dependencies_against_available_metrics(\n File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-6e39b63e-ade0-4e51-94c2-99c6cf2319a5/lib/python3.9/site-packages/great_expectations/expectations/expectation.py", line 2773, in _validate_dependencies_against_available_metrics\n raise InvalidExpectationConfigurationError( # noqa: TRY003\ngreat_expectations.exceptions.exceptions.InvalidExpectationConfigurationError: Metric ('column_values.nonnull.unexpected_count', '657e384d8614677fff7d7be97ee019fe', ()) is not available for validation of configuration. Please check your configuration.\n",
"exception_message": "Metric ('column_values.nonnull.unexpected_count', '657e384d8614677fff7d7be97ee019fe', ()) is not available for validation of configuration. Please check your configuration.",
"raised_exception": true

Environment (please complete the following information):

  • Databricks runtime 12.2 LTS
  • GX version 1.0.4
@Utkarsh-Krishna Utkarsh-Krishna changed the title Exception during validation of ExpectColumnValuesToNotBeNull [BUG] Exception during validation of ExpectColumnValuesToNotBeNull Sep 17, 2024
@adeola-ak
Copy link
Contributor

Please share the expectations you have written.

@adeola-ak
Copy link
Contributor

adeola-ak commented Sep 18, 2024

Just to clarify, are you passing columns with both non-null and null values to ExpectColumnValuesToNotBeNull and only encountering exceptions when the columns have null values?

Could you please share as much as possible regarding your configuration of suites and expectations? After reviewing the exception message, it seems the error suggests that the metric related to non-null values isn’t being handled as expected. If your configuration looks fine, I'll escalate this to the team to investigate why the metric can't be computed.

@Utkarsh-Krishna
Copy link
Author

Utkarsh-Krishna commented Sep 19, 2024

Irrespective of the column values (null or not null) this is working for some columns and giving exceptions for some columns in the validation_results (see the code below).

I am sharing an example of the code that I have written.

CODE:

import great_expectations as gx

context = gx.get_context()
data_source_name = "my_data_source"
data_source = context.data_sources.add_spark(name=data_source_name)
data_asset_name = "my_dataframe_data_asset"
data_asset = data_source.add_dataframe_asset(name=data_asset_name)

batch_definition_name = "my_batch_definition"
batch_definition = data_asset.add_batch_definition_whole_dataframe(
    batch_definition_name
)

suite = context.suites.add(
    gx.core.expectation_suite.ExpectationSuite(name="my_expectations")
)

suite.add_expectation(
    expectation = gx.expectations.ExpectColumnValuesToNotBeNull(
        column="id"
    )
)

dataframe = spark.table("table1")
batch_parameters = {"dataframe": dataframe}

#validation
validation_definition = context.validation_definitions.add(
    gx.core.validation_definition.ValidationDefinition(
        name="my_validation_definition",
        data=batch_definition,
        suite=suite,
    )
)

validation_results = validation_definition.run(batch_parameters=batch_parameters)
print(validation_results)

@adeola-ak
Copy link
Contributor

for the ones that are working and not working - are they all within the same table, "table1"? also, for the columns that are not working, what is the type of the columns that you see are resulting in exceptions? I am not able to reproduce this yet in databricks

@Utkarsh-Krishna
Copy link
Author

data types for columns which are working - date, string
data types for columns that NOT working - string
NOT working column data - string with 40 characters (alpha numeric)

Just FYI.
Same data works well with SparkDFDataset (GX version = 0.18.17)

@adeola-ak
Copy link
Contributor

hi there, are you still running into this issue?

@Utkarsh-Krishna
Copy link
Author

hi, yes I am.

@adeola-ak
Copy link
Contributor

can you run your code with nyc taxi sample data from databricks. more info here and lets try using the same data to see if we can narrow down on whats going on here? because i am still unable to reproduce this on varying types of data on my end. I don't know if your data could be the issue

@Utkarsh-Krishna
Copy link
Author

I found the issue, it seems the code works for my data if the column name is passed in UPPER CASE e.g. "ID" but the same code doesn't work if the column name is passed in LOWER CASE e.g. "id".

So looks like the columns are case sensitive now which was not the case with SparkDFDataset.
Let me know if this is expected.

@JerryLeeD3d
Copy link

I have the same issue - ExpectColumnValuesToNotBeNull for a spark dataframe.

@Utkarsh-Krishna
Copy link
Author

any updates?

@adeola-ak
Copy link
Contributor

Ensuring that columns are passed in the exact case that they are defined in the DataFrame schema is a temporary workaround while this is investigated further

@adeola-ak adeola-ak added the bug Bugs bugs bugs! label Oct 22, 2024
@jwalant-dattani
Copy link

For those still facing this issue, there could be another reason for this.

In my case the column names were in the expected case. However, I was getting this error when input spark df was getting created from a pandas df. I switched that to create it directly from the dataset using spark.read.* api and then it worked fine. Possibly some implicit conversion of schema or data happening underneath that was causing this.

See if this workaround helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bugs bugs bugs!
Projects
Status: In progress
Development

No branches or pull requests

4 participants