Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unexpected_index_query #10701

Open
niatos99 opened this issue Nov 24, 2024 · 1 comment
Open

unexpected_index_query #10701

niatos99 opened this issue Nov 24, 2024 · 1 comment
Labels
feature-request feature request

Comments

@niatos99
Copy link

niatos99 commented Nov 24, 2024

Describe the bug
I want to execute the unexpected_index_query, so that i can find the broken records in run-time.

But the unexpected_index_query does not work as-is.

  1. For the output of the gxe.ExpectColumnValuesToBeUnique(column="Navn")
    it seems to be broken with the "unspecifiedframe$".
    The result format/output of the validation gives this query: df.filter(F.expr(NOT (count(1) OVER (PARTITION BY Navn unspecifiedframe$()) <= 1)))
    and when editing it to dataframe.filter(F.expr(f"NOT (count(1) OVER (PARTITION BY Navn) <= 1)"))
    it says "AnalysisException: It is not allowed to use window functions inside WHERE clause".

It could be nice to use it in run-time, so that we can find and quarentine bad records

  1. The same is for the exepctation: gxe.ExpectColumnValuesToNotBeNull(column="ID")
    which gives following index_query: df.filter(F.expr(NOT (ID IS NOT NULL)))
    But we have to manually give it quotation marks df.filter(F.expr("NOT (ID IS NOT NULL)")) to make it work properly.

It could be very nice, to have the mark "" around the expression per default. in this way, we can filter out bad records and do so many flows.

To Reproduce

from pyspark.sql import SparkSession
import great_expectations as gx
import great_expectations.expectations as gxe

# Start en Spark session
spark = SparkSession.builder.appName("example").getOrCreate()

# Opret data og DataFrame
data = [
    (2, "Brian", 25, "Aarhus"),
    (3, "Cecilie", 28, "Odense"),
    (4, "David", 22, "Aalborg"),  ## Duplikatrække
    (4, "David", 22, "Aalborg"),  ## Duplikatrække
    (1, "Anna", 30, "København"), # Duplikatrække
    (1, "Anna", 30, "København"), # Duplikatrække
    (1, "Anna", 30, "København"), # Duplikatrække
    (1, "Anna", 30, "København"), # Duplikatrække
    (1, "Anna", 30, "København"), # Duplikatrække
    (1, "Anna", 30, "København"), # Duplikatrække
    (1, "Anna", 30, "København"), # Duplikatrække
    (1, "Anna", 30, "København"),  # Duplikatrække
    (6, "Thomas", 30, "København"),  # Duplikatrække på navn
    (5, "Thomas", 30, "København"),  # Duplikatrække på navn
]

dataframe = spark.createDataFrame(data, ["ID", "Navn", "Alder", "By"])

dataframe.show()

spark_config = str(uuid.uuid4()) 

# i know it is not meant to be used like this, but this is just to reproduce it for you guys 
context = gx.get_context()
create_spark_configuration = context.data_sources.add_or_update_spark(name=spark_config).add_dataframe_asset(name=spark_config).add_batch_definition_whole_dataframe(spark_config) # opretter konfigurationerne
get_spark_configuration = context.data_sources.get(spark_config).get_asset(spark_config).get_batch_definition(spark_config) # henter konfigurationerne
# i know it is not meant to be used like this, but this is just to reproduce it for you guys 
gx_suite = context.suites.add(gx.ExpectationSuite(name=datetime.now().strftime("%Y-%m-%d %H:%M:%S"))) # opretter en TEMPLATE, så vi kan indsætte vores regler for vores data
expectation = gx_suite.add_expectation(gxe.ExpectColumnValuesToNotBeNull(column="ID"))  # rule1: ExpectColumnValuesToNotBeNull

expectation = gx_suite.add_expectation(gxe.ExpectColumnValuesToBeUnique(column="Navn"))   # rule3: ExpectColumnValuesToBeUnique


validator = get_spark_configuration.get_batch(batch_parameters={"dataframe": dataframe}).validate(gx_suite, result_format={
        "result_format": "COMPLETE"
    }) 
validator

Expected behavior
I expect it to return valid and usable queries that can be used and queried immediately and where we do not have to be edited manually.

Environment (please complete the following information):

  • Operating System: [e.g. Linux, MacOS, Windows]
  • Great Expectations Version: [e.g. 1.2.2]
  • Data Source: [e.g. Pandas, Snowflake]
  • Cloud environment: [e.g. Airflow, AWS, Azure, Databricks, GCP]

Additional context
This is the key to be able to remove bad records and would make the library so much more than data testing and data quality. I have seen from others that this is a wished feature.
Our alternative right now is to for each expectation to create the code dynamically on run-time but for EACH expectation, which creates overhead. simply all the queries just need "" around the expression and they work.

This would be a biiig help, which can make it more enterprise scale :-)

@adeola-ak
Copy link
Contributor

Hi @niatos99, thank you for your detailed description. I'll share this with the team—please stay tuned for updates!

@adeola-ak adeola-ak added the feature-request feature request label Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request feature request
Projects
Status: To Do
Development

No branches or pull requests

2 participants