Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Api, Spark: Make StrictMetricsEvaluator not fail on nested column predicates #11261

Merged

Conversation

zhongyujiang
Copy link
Contributor

Currently, the StrictMetricsEvaluator fails when evaluating expressions with nested columns, causing Spark's DELETE FROM statement to throw an exception if the WHERE condition uses nested columns as predicates.

The StrictMetricsEvaluator requires null count data for columns during evaluation, but the null count data for nested columns collected in the current metadata might be incorrect (see #8611). Therefore, the StrictMetricsEvaluator cannot support the evaluation of filters on nested columns.

However, I think we can at least return ROWS_MIGHT_NOT_MATCH when encountering filters with nested columns instead of directly throwing exceptions and causing job failures. We are currently alreadt doing this in the evaluation of startsWith:

public <T> Boolean startsWith(BoundReference<T> ref, Literal<T> lit) {
return ROWS_MIGHT_NOT_MATCH;
}

This fixes #7065.

@aokolnychyi @szehon-ho @RussellSpitzer @Fokko can you please take a look when you have time? Thanks.

@zhongyujiang zhongyujiang force-pushed the gh/strict-metric-evaluator-nested-col branch from 5922b4b to 4864458 Compare October 5, 2024 08:49
@blakewhatley82
Copy link

This issue is still around in spark 3.5 and would really be a big capability to have for data that is all in structured format

@Leonti
Copy link

Leonti commented Oct 8, 2024

Same here, we have a big dataset and almost all of the data is in nested structs. Now we need to delete data based on the nested struct value and this issue is blocking us.
Would love to see this merged!

@amogh-jahagirdar amogh-jahagirdar self-requested a review October 11, 2024 04:55
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is logically sound, thank you @zhongyujiang!
I'll want to double check though why this wasn't done originally, that's a bit intriguing to me. I'll take a look with fresh eyes tomorrow.

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall @zhongyujiang I think the change is good, just some cleanup in tests would be great before we get this in!

@zhongyujiang
Copy link
Contributor Author

@amogh-jahagirdar Thanks for reviewing, tests updated.

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zhongyujiang, from my side the changes look good. I'll give some time for others to review before merging

Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for fixing this @zhongyujiang

@nastra nastra merged commit ca8a3a4 into apache:main Oct 14, 2024
50 checks passed
@zhongyujiang zhongyujiang deleted the gh/strict-metric-evaluator-nested-col branch October 14, 2024 07:16
@blakewhatley82
Copy link

blakewhatley82 commented Oct 14, 2024

@zhongyujiang @nastra @amogh-jahagirdar great to see this merged! Is it known the timeline for 1.6.2 iceberg release and will this be for spark 3.3x or just spark 3.5x?

@zhongyujiang
Copy link
Contributor Author

@blakewhatley82 I am not clear on the timeline for the 1.6.2 release. This fix is effective for all Spark versions.

@nastra
Copy link
Contributor

nastra commented Oct 14, 2024

The community is planning a 1.7.0 release end of October, so this will be shipped with that

zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot delete column with nested field filter
5 participants