-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core, Spark: Fix delete with filter on nested columns #7132
Core, Spark: Fix delete with filter on nested columns #7132
Conversation
sql("INSERT INTO TABLE %s VALUES (1, named_struct(\"c1\", 3, \"c2\", \"v1\"))", tableName); | ||
sql("INSERT INTO TABLE %s VALUES (2, named_struct(\"c1\", 2, \"c2\", \"v2\"))", tableName); | ||
|
||
sql("DELETE FROM %s WHERE complex.c1 = 3", tableName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete conditions in testDeleteWithConditionOnNestedColumn
can not be push down, so added this UT to cover the corresponding scenario.
@aokolnychyi @rdblue can you help review this? |
while (parent != null) { | ||
Type type = schema.findType(parent); | ||
if (type.isListType() || type.isMapType()) { | ||
evaluable = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you are skipping if list/map type, but allowing struct, if I understand? I think it makes sense to me, as I feel we have nested column stats. but definitely like @rdblue @RussellSpitzer @aokolnychyi to have a sanity check here on the overall direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, your understanding is correct.
would love to see it merged |
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
This issue is still around in spark 3.5 and would really be a big capability to have for data that is all in structured format |
Agreed. Can this be revived, @szehon-ho? Are you able to re-open it, @zhongyujiang? |
@blakewhatley82 @mdub I think this fix is incorrect because the null count data of nested columns in metadata might be incorrect for now, see #8611. I am not able to reopen this, I've created a new PR #11261 with a different approach to address this issue. |
Fixed by #11261. |
Fixes #7065.
This fixes Spark delete data when using a filter on nested columns. Now such operations will fail because Spark calls
canDeleteUsingMetadata
which usesStrictMetricsEvaluator
to evaluate whether a file should be completely deleted, howeverStrictMetricsEvaluator
doesn't support evaluate on nested columns now, and a NPE will be thrown out, see #7065.This updates
StrictMetricsEvaluator
to support evaluation on nested columns(only for columns nested in a chain of Struct fileds, will returnROWS_MIGHT_NOT_MATCH
if columns are nested in Map or List fields), which solve this problem.