You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I want to execute the unexpected_index_query, so that i can find the broken records in run-time.
But the unexpected_index_query does not work as-is.
For the output of the gxe.ExpectColumnValuesToBeUnique(column="Navn")
it seems to be broken with the "unspecifiedframe$".
The result format/output of the validation gives this query: df.filter(F.expr(NOT (count(1) OVER (PARTITION BY Navn unspecifiedframe$()) <= 1)))
and when editing it to dataframe.filter(F.expr(f"NOT (count(1) OVER (PARTITION BY Navn) <= 1)"))
it says "AnalysisException: It is not allowed to use window functions inside WHERE clause".
It could be nice to use it in run-time, so that we can find and quarentine bad records
The same is for the exepctation: gxe.ExpectColumnValuesToNotBeNull(column="ID")
which gives following index_query: df.filter(F.expr(NOT (ID IS NOT NULL)))
But we have to manually give it quotation marks df.filter(F.expr("NOT (ID IS NOT NULL)")) to make it work properly.
It could be very nice, to have the mark "" around the expression per default. in this way, we can filter out bad records and do so many flows.
To Reproduce
from pyspark.sql import SparkSession
import great_expectations as gx
import great_expectations.expectations as gxe
# Start en Spark session
spark = SparkSession.builder.appName("example").getOrCreate()
# Opret data og DataFrame
data = [
(2, "Brian", 25, "Aarhus"),
(3, "Cecilie", 28, "Odense"),
(4, "David", 22, "Aalborg"), ## Duplikatrække
(4, "David", 22, "Aalborg"), ## Duplikatrække
(1, "Anna", 30, "København"), # Duplikatrække
(1, "Anna", 30, "København"), # Duplikatrække
(1, "Anna", 30, "København"), # Duplikatrække
(1, "Anna", 30, "København"), # Duplikatrække
(1, "Anna", 30, "København"), # Duplikatrække
(1, "Anna", 30, "København"), # Duplikatrække
(1, "Anna", 30, "København"), # Duplikatrække
(1, "Anna", 30, "København"), # Duplikatrække
(6, "Thomas", 30, "København"), # Duplikatrække på navn
(5, "Thomas", 30, "København"), # Duplikatrække på navn
]
dataframe = spark.createDataFrame(data, ["ID", "Navn", "Alder", "By"])
dataframe.show()
spark_config = str(uuid.uuid4())
# i know it is not meant to be used like this, but this is just to reproduce it for you guys
context = gx.get_context()
create_spark_configuration = context.data_sources.add_or_update_spark(name=spark_config).add_dataframe_asset(name=spark_config).add_batch_definition_whole_dataframe(spark_config) # opretter konfigurationerne
get_spark_configuration = context.data_sources.get(spark_config).get_asset(spark_config).get_batch_definition(spark_config) # henter konfigurationerne
# i know it is not meant to be used like this, but this is just to reproduce it for you guys
gx_suite = context.suites.add(gx.ExpectationSuite(name=datetime.now().strftime("%Y-%m-%d %H:%M:%S"))) # opretter en TEMPLATE, så vi kan indsætte vores regler for vores data
expectation = gx_suite.add_expectation(gxe.ExpectColumnValuesToNotBeNull(column="ID")) # rule1: ExpectColumnValuesToNotBeNull
expectation = gx_suite.add_expectation(gxe.ExpectColumnValuesToBeUnique(column="Navn")) # rule3: ExpectColumnValuesToBeUnique
validator = get_spark_configuration.get_batch(batch_parameters={"dataframe": dataframe}).validate(gx_suite, result_format={
"result_format": "COMPLETE"
})
validator
Expected behavior
I expect it to return valid and usable queries that can be used and queried immediately and where we do not have to be edited manually.
Environment (please complete the following information):
Additional context
This is the key to be able to remove bad records and would make the library so much more than data testing and data quality. I have seen from others that this is a wished feature.
Our alternative right now is to for each expectation to create the code dynamically on run-time but for EACH expectation, which creates overhead. simply all the queries just need "" around the expression and they work.
This would be a biiig help, which can make it more enterprise scale :-)
The text was updated successfully, but these errors were encountered:
Describe the bug
I want to execute the unexpected_index_query, so that i can find the broken records in run-time.
But the unexpected_index_query does not work as-is.
gxe.ExpectColumnValuesToBeUnique(column="Navn")
it seems to be broken with the
"unspecifiedframe$"
.The result format/output of the validation gives this query:
df.filter(F.expr(NOT (count(1) OVER (PARTITION BY Navn unspecifiedframe$()) <= 1)))
and when editing it to
dataframe.filter(F.expr(f"NOT (count(1) OVER (PARTITION BY Navn) <= 1)"))
it says "AnalysisException: It is not allowed to use window functions inside WHERE clause".
It could be nice to use it in run-time, so that we can find and quarentine bad records
gxe.ExpectColumnValuesToNotBeNull(column="ID")
which gives following index_query:
df.filter(F.expr(NOT (ID IS NOT NULL)))
But we have to manually give it quotation marks
df.filter(F.expr("NOT (ID IS NOT NULL)"))
to make it work properly.It could be very nice, to have the mark "" around the expression per default. in this way, we can filter out bad records and do so many flows.
To Reproduce
Expected behavior
I expect it to return valid and usable queries that can be used and queried immediately and where we do not have to be edited manually.
Environment (please complete the following information):
Additional context
This is the key to be able to remove bad records and would make the library so much more than data testing and data quality. I have seen from others that this is a wished feature.
Our alternative right now is to for each expectation to create the code dynamically on run-time but for EACH expectation, which creates overhead. simply all the queries just need "" around the expression and they work.
This would be a biiig help, which can make it more enterprise scale :-)
The text was updated successfully, but these errors were encountered: