Api, Spark: Make StrictMetricsEvaluator not fail on nested column predicates #11261

zhongyujiang · 2024-10-05T08:38:59Z

Currently, the StrictMetricsEvaluator fails when evaluating expressions with nested columns, causing Spark's DELETE FROM statement to throw an exception if the WHERE condition uses nested columns as predicates.

The StrictMetricsEvaluator requires null count data for columns during evaluation, but the null count data for nested columns collected in the current metadata might be incorrect (see #8611). Therefore, the StrictMetricsEvaluator cannot support the evaluation of filters on nested columns.

However, I think we can at least return ROWS_MIGHT_NOT_MATCH when encountering filters with nested columns instead of directly throwing exceptions and causing job failures. We are currently alreadt doing this in the evaluation of startsWith:

iceberg/api/src/main/java/org/apache/iceberg/expressions/StrictMetricsEvaluator.java

Lines 464 to 467 in 5922b4b

    
           public <T> Boolean startsWith(BoundReference<T> ref, Literal<T> lit) { 
        
             return ROWS_MIGHT_NOT_MATCH; 
        
           }

This fixes #7065.

@aokolnychyi @szehon-ho @RussellSpitzer @Fokko can you please take a look when you have time? Thanks.

blakewhatley82 · 2024-10-07T15:02:09Z

This issue is still around in spark 3.5 and would really be a big capability to have for data that is all in structured format

Leonti · 2024-10-08T02:56:55Z

Same here, we have a big dataset and almost all of the data is in nested structs. Now we need to delete data based on the nested struct value and this issue is blocking us.
Would love to see this merged!

amogh-jahagirdar

I think this change is logically sound, thank you @zhongyujiang!
I'll want to double check though why this wasn't done originally, that's a bit intriguing to me. I'll take a look with fresh eyes tomorrow.

api/src/main/java/org/apache/iceberg/expressions/StrictMetricsEvaluator.java

amogh-jahagirdar

Overall @zhongyujiang I think the change is good, just some cleanup in tests would be great before we get this in!

api/src/test/java/org/apache/iceberg/expressions/TestStrictMetricsEvaluator.java

spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestDelete.java

zhongyujiang · 2024-10-13T10:42:30Z

@amogh-jahagirdar Thanks for reviewing, tests updated.

amogh-jahagirdar

Thanks @zhongyujiang, from my side the changes look good. I'll give some time for others to review before merging

nastra

LGTM, thanks for fixing this @zhongyujiang

blakewhatley82 · 2024-10-14T08:50:52Z

@zhongyujiang @nastra @amogh-jahagirdar great to see this merged! Is it known the timeline for 1.6.2 iceberg release and will this be for spark 3.3x or just spark 3.5x?

zhongyujiang · 2024-10-14T09:55:53Z

@blakewhatley82 I am not clear on the timeline for the 1.6.2 release. This fix is effective for all Spark versions.

nastra · 2024-10-14T10:09:54Z

The community is planning a 1.7.0 release end of October, so this will be shipped with that

…dicates (apache#11261)

github-actions bot added API spark labels Oct 5, 2024

zhongyujiang mentioned this pull request Oct 5, 2024

Core, Spark: Fix delete with filter on nested columns #7132

Closed

Make StrictMetricsEvaluator not fail on nested column predicates.

4864458

zhongyujiang force-pushed the gh/strict-metric-evaluator-nested-col branch from 5922b4b to 4864458 Compare October 5, 2024 08:49

amogh-jahagirdar self-requested a review October 11, 2024 04:55

amogh-jahagirdar reviewed Oct 11, 2024

View reviewed changes

api/src/main/java/org/apache/iceberg/expressions/StrictMetricsEvaluator.java Show resolved Hide resolved

amogh-jahagirdar reviewed Oct 12, 2024

View reviewed changes

api/src/test/java/org/apache/iceberg/expressions/TestStrictMetricsEvaluator.java Outdated Show resolved Hide resolved

spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestDelete.java Show resolved Hide resolved

Review comments.

720032c

amogh-jahagirdar approved these changes Oct 13, 2024

View reviewed changes

amogh-jahagirdar requested a review from nastra October 13, 2024 15:02

nastra approved these changes Oct 14, 2024

View reviewed changes

nastra merged commit ca8a3a4 into apache:main Oct 14, 2024
50 checks passed

zhongyujiang deleted the gh/strict-metric-evaluator-nested-col branch October 14, 2024 07:16

zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024

API, Spark: Make StrictMetricsEvaluator not fail on nested column pre…

0d34601

…dicates (apache#11261)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Api, Spark: Make StrictMetricsEvaluator not fail on nested column predicates #11261

Api, Spark: Make StrictMetricsEvaluator not fail on nested column predicates #11261

zhongyujiang commented Oct 5, 2024

blakewhatley82 commented Oct 7, 2024

Leonti commented Oct 8, 2024 •

edited

Loading

amogh-jahagirdar left a comment

amogh-jahagirdar left a comment

zhongyujiang commented Oct 13, 2024

amogh-jahagirdar left a comment

nastra left a comment

blakewhatley82 commented Oct 14, 2024 •

edited

Loading

zhongyujiang commented Oct 14, 2024

nastra commented Oct 14, 2024

	public <T> Boolean startsWith(BoundReference<T> ref, Literal<T> lit) {
	return ROWS_MIGHT_NOT_MATCH;
	}

Api, Spark: Make StrictMetricsEvaluator not fail on nested column predicates #11261

Api, Spark: Make StrictMetricsEvaluator not fail on nested column predicates #11261

Conversation

zhongyujiang commented Oct 5, 2024

blakewhatley82 commented Oct 7, 2024

Leonti commented Oct 8, 2024 • edited Loading

amogh-jahagirdar left a comment

Choose a reason for hiding this comment

amogh-jahagirdar left a comment

Choose a reason for hiding this comment

zhongyujiang commented Oct 13, 2024

amogh-jahagirdar left a comment

Choose a reason for hiding this comment

nastra left a comment

Choose a reason for hiding this comment

blakewhatley82 commented Oct 14, 2024 • edited Loading

zhongyujiang commented Oct 14, 2024

nastra commented Oct 14, 2024

Leonti commented Oct 8, 2024 •

edited

Loading

blakewhatley82 commented Oct 14, 2024 •

edited

Loading