Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core, Spark: Fix delete with filter on nested columns #7132

Closed

Conversation

zhongyujiang
Copy link
Contributor

@zhongyujiang zhongyujiang commented Mar 17, 2023

Fixes #7065.

This fixes Spark delete data when using a filter on nested columns. Now such operations will fail because Spark calls canDeleteUsingMetadata which uses StrictMetricsEvaluator to evaluate whether a file should be completely deleted, however StrictMetricsEvaluator doesn't support evaluate on nested columns now, and a NPE will be thrown out, see #7065.

This updates StrictMetricsEvaluator to support evaluation on nested columns(only for columns nested in a chain of Struct fileds, will return ROWS_MIGHT_NOT_MATCH if columns are nested in Map or List fields), which solve this problem.

sql("INSERT INTO TABLE %s VALUES (1, named_struct(\"c1\", 3, \"c2\", \"v1\"))", tableName);
sql("INSERT INTO TABLE %s VALUES (2, named_struct(\"c1\", 2, \"c2\", \"v2\"))", tableName);

sql("DELETE FROM %s WHERE complex.c1 = 3", tableName);
Copy link
Contributor Author

@zhongyujiang zhongyujiang Mar 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete conditions in testDeleteWithConditionOnNestedColumn can not be push down, so added this UT to cover the corresponding scenario.

@zhongyujiang
Copy link
Contributor Author

@aokolnychyi @rdblue can you help review this?

while (parent != null) {
Type type = schema.findType(parent);
if (type.isListType() || type.isMapType()) {
evaluable = false;
Copy link
Collaborator

@szehon-ho szehon-ho Mar 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you are skipping if list/map type, but allowing struct, if I understand? I think it makes sense to me, as I feel we have nested column stats. but definitely like @rdblue @RussellSpitzer @aokolnychyi to have a sanity check here on the overall direction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, your understanding is correct.

@bluzy
Copy link

bluzy commented Dec 28, 2023

@eshishki
Copy link

eshishki commented Jul 7, 2024

would love to see it merged

Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Aug 28, 2024
Copy link

github-actions bot commented Sep 5, 2024

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Sep 5, 2024
@blakewhatley82
Copy link

This issue is still around in spark 3.5 and would really be a big capability to have for data that is all in structured format

@mdub
Copy link

mdub commented Oct 3, 2024

Agreed. Can this be revived, @szehon-ho? Are you able to re-open it, @zhongyujiang?

@zhongyujiang
Copy link
Contributor Author

@blakewhatley82 @mdub I think this fix is incorrect because the null count data of nested columns in metadata might be incorrect for now, see #8611. I am not able to reopen this, I've created a new PR #11261 with a different approach to address this issue.

@zhongyujiang
Copy link
Contributor Author

Fixed by #11261.

@zhongyujiang zhongyujiang deleted the fix_delete_using_nested_column branch December 18, 2024 09:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot delete column with nested field filter
6 participants