-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Api, Spark: Make StrictMetricsEvaluator not fail on nested column predicates #11261
Api, Spark: Make StrictMetricsEvaluator not fail on nested column predicates #11261
Conversation
5922b4b
to
4864458
Compare
This issue is still around in spark 3.5 and would really be a big capability to have for data that is all in structured format |
Same here, we have a big dataset and almost all of the data is in nested structs. Now we need to delete data based on the nested struct value and this issue is blocking us. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change is logically sound, thank you @zhongyujiang!
I'll want to double check though why this wasn't done originally, that's a bit intriguing to me. I'll take a look with fresh eyes tomorrow.
api/src/main/java/org/apache/iceberg/expressions/StrictMetricsEvaluator.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall @zhongyujiang I think the change is good, just some cleanup in tests would be great before we get this in!
api/src/test/java/org/apache/iceberg/expressions/TestStrictMetricsEvaluator.java
Outdated
Show resolved
Hide resolved
spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestDelete.java
Show resolved
Hide resolved
@amogh-jahagirdar Thanks for reviewing, tests updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zhongyujiang, from my side the changes look good. I'll give some time for others to review before merging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for fixing this @zhongyujiang
@zhongyujiang @nastra @amogh-jahagirdar great to see this merged! Is it known the timeline for 1.6.2 iceberg release and will this be for spark 3.3x or just spark 3.5x? |
@blakewhatley82 I am not clear on the timeline for the 1.6.2 release. This fix is effective for all Spark versions. |
The community is planning a 1.7.0 release end of October, so this will be shipped with that |
Currently, the
StrictMetricsEvaluator
fails when evaluating expressions with nested columns, causing Spark'sDELETE FROM
statement to throw an exception if theWHERE
condition uses nested columns as predicates.The
StrictMetricsEvaluator
requires null count data for columns during evaluation, but the null count data for nested columns collected in the current metadata might be incorrect (see #8611). Therefore, theStrictMetricsEvaluator
cannot support the evaluation of filters on nested columns.However, I think we can at least return
ROWS_MIGHT_NOT_MATCH
when encountering filters with nested columns instead of directly throwing exceptions and causing job failures. We are currently alreadt doing this in the evaluation ofstartsWith
:iceberg/api/src/main/java/org/apache/iceberg/expressions/StrictMetricsEvaluator.java
Lines 464 to 467 in 5922b4b
This fixes #7065.
@aokolnychyi @szehon-ho @RussellSpitzer @Fokko can you please take a look when you have time? Thanks.