Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subclassed checks with an aliased Pydantic Field cannot be validated. #10455

Open
vovavili opened this issue Oct 1, 2024 · 1 comment
Open

Comments

@vovavili
Copy link
Contributor

vovavili commented Oct 1, 2024

Describe the bug
If I try to subclass a check to define a certain customizable behavior, everything goes well so long as I keep the original name for the input variable. However, if I add an alias for a Field, an expectation can be instantiated, but it cannot be validated.

To Reproduce
Code:

import great_expectations as gx

from pydantic import v1 as pydantic_v1
from pyspark.sql import SparkSession
from pyspark.sql import types as st
from great_expectations import expectations as gxe
from great_expectations.core.suite_parameters import SuiteParameterDict


class ExpectColumnValuesToStartWith(gxe.ExpectColumnValuesToMatchRegex):
    """Pre-fill a regex expression expectation with a caret, and escape all special characters."""

    regex: str | SuiteParameterDict = pydantic_v1.Field(
        default="(?s).*",
        alias="startswith",
        description="Expect rows in a given column to start with some particular value.",
    )

    @pydantic_v1.validator("regex", pre=True)
    def validate_regex(cls, v: str):
        return (
            "^"
            + "".join(
                char if char not in set(r"[@_!#$%^&*()<>?/\|}{~:]") else "\\" + char
                for char in v
            )
            + ".*"
        )

    class Config(gxe.ExpectColumnValuesToMatchRegex.Config):
        populate_by_name = True


df = (
    SparkSession.builder.appName("spark")
    .getOrCreate()
    .createDataFrame(
        [("aaa", "bcc"), ("abb", "bdd"), ("acc", "abc")],
        st.StructType(
            [
                st.StructField("col1", st.StringType(), True),
                st.StructField("col2", st.StringType(), True),
            ]
        ),
    )
)


(
    gx.get_context()
    .data_sources.add_spark(name="spark_source")
    .add_dataframe_asset(name="dataframe_asset")
    .add_batch_definition_whole_dataframe(name="whole_dataframe")
    .get_batch(batch_parameters={"dataframe": df})
    .validate(ExpectColumnValuesToStartWith(column="col2", startwith="a"))
)

Stack trace:

ValidationError                           Traceback (most recent call last)
Cell In[25], line 16
     13 exp = ExpectColumnValuesToStartWith(column="col2", startwith="^a.*")
     14 print("Expectation instantiated")
---> 16 batch_definition.get_batch(batch_parameters={"dataframe": df}).validate(exp)
     17 print("DF validated")

File ~/cluster-env/trident_env/lib/python3.10/site-packages/great_expectations/datasource/fluent/interfaces.py:1146, in Batch.validate(self, expect, result_format)
   1143 from great_expectations.expectations.expectation import Expectation
   1145 if isinstance(expect, Expectation):
-> 1146     return self._validate_expectation(expect, result_format=result_format)
   1147 elif isinstance(expect, ExpectationSuite):
   1148     return self._validate_expectation_suite(expect, result_format=result_format)

File ~/cluster-env/trident_env/lib/python3.10/site-packages/great_expectations/datasource/fluent/interfaces.py:1163, in Batch._validate_expectation(self, expect, result_format)
   1156 def _validate_expectation(
   1157     self,
   1158     expect: Expectation,
   1159     result_format: ResultFormatUnion,
   1160 ) -> ExpectationValidationResult:
   1161     return self._create_validator(
   1162         result_format=result_format,
-> 1163     ).validate_expectation(expect)

File ~/cluster-env/trident_env/lib/python3.10/site-packages/great_expectations/validator/v1_validator.py:55, in Validator.validate_expectation(self, expectation, expectation_parameters)
     49 def validate_expectation(
     50     self,
     51     expectation: Expectation,
     52     expectation_parameters: Optional[dict[str, Any]] = None,
     53 ) -> ExpectationValidationResult:
     54     """Run a single expectation against the batch definition"""
---> 55     results = self._validate_expectation_configs([expectation.configuration])
     57     assert len(results) == 1
     58     return results[0]

File ~/cluster-env/trident_env/lib/python3.10/site-packages/great_expectations/validator/v1_validator.py:128, in Validator._validate_expectation_configs(self, expectation_configs, expectation_parameters)
    125 else:
    126     runtime_configuration = {"result_format": self.result_format}
--> 128 results = self._wrapped_validator.graph_validate(
    129     configurations=processed_expectation_configs,
    130     runtime_configuration=runtime_configuration,
    131 )
    133 if self._include_rendered_content:
    134     for result in results:

File ~/cluster-env/trident_env/lib/python3.10/site-packages/great_expectations/validator/validator.py:602, in Validator.graph_validate(self, configurations, runtime_configuration)
    594 evrs: List[ExpectationValidationResult]
    596 processed_configurations: List[ExpectationConfiguration] = []
    598 (
    599     expectation_validation_graphs,
    600     evrs,
    601     processed_configurations,
--> 602 ) = self._generate_metric_dependency_subgraphs_for_each_expectation_configuration(
    603     expectation_configurations=configurations,
    604     processed_configurations=processed_configurations,
    605     catch_exceptions=catch_exceptions,
    606     runtime_configuration=runtime_configuration,
    607 )
    609 graph: ValidationGraph = self._generate_suite_level_graph_from_expectation_level_sub_graphs(
    610     expectation_validation_graphs=expectation_validation_graphs
    611 )
    613 resolved_metrics: _MetricsDict

File ~/cluster-env/trident_env/lib/python3.10/site-packages/great_expectations/validator/validator.py:703, in Validator._generate_metric_dependency_subgraphs_for_each_expectation_configuration(self, expectation_configurations, processed_configurations, catch_exceptions, runtime_configuration)
    700 if self.active_batch_id:
    701     evaluated_config.kwargs.update({"batch_id": self.active_batch_id})
--> 703 expectation = evaluated_config.to_domain_obj()
    704 validation_dependencies: ValidationDependencies = (
    705     expectation.get_validation_dependencies(
    706         execution_engine=self._execution_engine,
    707         runtime_configuration=runtime_configuration,
    708     )
    709 )
    711 try:

File ~/cluster-env/trident_env/lib/python3.10/site-packages/great_expectations/expectations/expectation_configuration.py:459, in ExpectationConfiguration.to_domain_obj(self)
    457     kwargs.update({"description": self.description})
    458 kwargs.update(self.kwargs)
--> 459 return expectation_impl(**kwargs)

File /nfs4/pyenv-74aa3c6d-785e-495b-ba94-32a358134de2/lib/python3.10/site-packages/pydantic/v1/main.py:341, in BaseModel.__init__(__pydantic_self__, **data)
    339 values, fields_set, validation_error = validate_model(__pydantic_self__.__class__, data)
    340 if validation_error:
--> 341     raise validation_error
    342 try:
    343     object_setattr(__pydantic_self__, '__dict__', values)

ValidationError: 1 validation error for ExpectColumnValuesToStartWith
regex
  extra fields not permitted (type=value_error.extra)

Please note that this fails even if I set extras="allowed" Config class attribute.

Expected behavior
I expected this check to fail because of a bad row value, but instead, I got a ValidationError because Field alias are not accounted for in the batch_definition.get_batch(...).validate method.

Environment (please complete the following information):

  • Operating System: Linux
  • Great Expectations Version: 1.0.5
  • Data Source: Spark
  • Cloud environment: Databricks

Additional context
See my discussion with Tyler Hoffman here.

@vovavili
Copy link
Contributor Author

vovavili commented Oct 1, 2024

A workaround for this is something like this, but documentation seems to implicitly suggest subclassing for customized checks, so I think this bug is still worth pursuing, in the very least because it would allow us to also do useful procedures like modifying the description Field parameter in one go:

def ExpectColumnValuesToStartWith(
    startwith: str, column: str
) -> gxe.ExpectColumnValuesToMatchRegex:
    """Pre-fill a regex expression expectation with a caret, and escape all special characters."""
    regex = (
        "^"
        + "".join(
            char if char not in set(r"[@_!#$%^&*()<>?/\|}{~:]") else "\\" + char
            for char in startwith
        )
        + ".*"
    )
    return gxe.ExpectColumnValuesToMatchRegex(regex=regex, column=column)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: To Do
Development

No branches or pull requests

1 participant