unexpected_index_query #10701

niatos99 · 2024-11-24T11:13:20Z

Describe the bug
I want to execute the unexpected_index_query, so that i can find the broken records in run-time.

But the unexpected_index_query does not work as-is.

For the output of the gxe.ExpectColumnValuesToBeUnique(column="Navn")
it seems to be broken with the "unspecifiedframe$".
The result format/output of the validation gives this query: df.filter(F.expr(NOT (count(1) OVER (PARTITION BY Navn unspecifiedframe$()) <= 1)))
and when editing it to dataframe.filter(F.expr(f"NOT (count(1) OVER (PARTITION BY Navn) <= 1)"))
it says "AnalysisException: It is not allowed to use window functions inside WHERE clause".

It could be nice to use it in run-time, so that we can find and quarentine bad records

The same is for the exepctation: gxe.ExpectColumnValuesToNotBeNull(column="ID")
which gives following index_query: df.filter(F.expr(NOT (ID IS NOT NULL)))
But we have to manually give it quotation marks df.filter(F.expr("NOT (ID IS NOT NULL)")) to make it work properly.

It could be very nice, to have the mark "" around the expression per default. in this way, we can filter out bad records and do so many flows.

To Reproduce

from pyspark.sql import SparkSession
import great_expectations as gx
import great_expectations.expectations as gxe

# Start en Spark session
spark = SparkSession.builder.appName("example").getOrCreate()

# Opret data og DataFrame
data = [
    (2, "Brian", 25, "Aarhus"),
    (3, "Cecilie", 28, "Odense"),
    (4, "David", 22, "Aalborg"),  ## Duplikatrække
    (4, "David", 22, "Aalborg"),  ## Duplikatrække
    (1, "Anna", 30, "København"), # Duplikatrække
    (1, "Anna", 30, "København"), # Duplikatrække
    (1, "Anna", 30, "København"), # Duplikatrække
    (1, "Anna", 30, "København"), # Duplikatrække
    (1, "Anna", 30, "København"), # Duplikatrække
    (1, "Anna", 30, "København"), # Duplikatrække
    (1, "Anna", 30, "København"), # Duplikatrække
    (1, "Anna", 30, "København"),  # Duplikatrække
    (6, "Thomas", 30, "København"),  # Duplikatrække på navn
    (5, "Thomas", 30, "København"),  # Duplikatrække på navn
]

dataframe = spark.createDataFrame(data, ["ID", "Navn", "Alder", "By"])

dataframe.show()

spark_config = str(uuid.uuid4()) 

# i know it is not meant to be used like this, but this is just to reproduce it for you guys 
context = gx.get_context()
create_spark_configuration = context.data_sources.add_or_update_spark(name=spark_config).add_dataframe_asset(name=spark_config).add_batch_definition_whole_dataframe(spark_config) # opretter konfigurationerne
get_spark_configuration = context.data_sources.get(spark_config).get_asset(spark_config).get_batch_definition(spark_config) # henter konfigurationerne
# i know it is not meant to be used like this, but this is just to reproduce it for you guys 
gx_suite = context.suites.add(gx.ExpectationSuite(name=datetime.now().strftime("%Y-%m-%d %H:%M:%S"))) # opretter en TEMPLATE, så vi kan indsætte vores regler for vores data
expectation = gx_suite.add_expectation(gxe.ExpectColumnValuesToNotBeNull(column="ID"))  # rule1: ExpectColumnValuesToNotBeNull

expectation = gx_suite.add_expectation(gxe.ExpectColumnValuesToBeUnique(column="Navn"))   # rule3: ExpectColumnValuesToBeUnique


validator = get_spark_configuration.get_batch(batch_parameters={"dataframe": dataframe}).validate(gx_suite, result_format={
        "result_format": "COMPLETE"
    }) 
validator

Expected behavior
I expect it to return valid and usable queries that can be used and queried immediately and where we do not have to be edited manually.

Environment (please complete the following information):

Operating System: [e.g. Linux, MacOS, Windows]
Great Expectations Version: [e.g. 1.2.2]
Data Source: [e.g. Pandas, Snowflake]
Cloud environment: [e.g. Airflow, AWS, Azure, Databricks, GCP]

Additional context
This is the key to be able to remove bad records and would make the library so much more than data testing and data quality. I have seen from others that this is a wished feature.
Our alternative right now is to for each expectation to create the code dynamically on run-time but for EACH expectation, which creates overhead. simply all the queries just need "" around the expression and they work.

This would be a biiig help, which can make it more enterprise scale :-)

The text was updated successfully, but these errors were encountered:

adeola-ak · 2024-11-27T22:32:01Z

Hi @niatos99, thank you for your detailed description. I'll share this with the team—please stay tuned for updates!

github-project-automation bot added this to GX Core Issues Board Nov 24, 2024

github-project-automation bot moved this to To Do in GX Core Issues Board Nov 24, 2024

adeola-ak added the feature-request feature request label Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unexpected_index_query #10701

unexpected_index_query #10701

niatos99 commented Nov 24, 2024 •

edited

Loading

adeola-ak commented Nov 27, 2024

unexpected_index_query #10701

unexpected_index_query #10701

Comments

niatos99 commented Nov 24, 2024 • edited Loading

adeola-ak commented Nov 27, 2024

niatos99 commented Nov 24, 2024 •

edited

Loading