Core, Spark: Fix delete with filter on nested columns #7132

zhongyujiang · 2023-03-17T15:58:13Z

This fixes Spark delete data when using a filter on nested columns. Now such operations will fail because Spark calls canDeleteUsingMetadata which uses StrictMetricsEvaluator to evaluate whether a file should be completely deleted, however StrictMetricsEvaluator doesn't support evaluate on nested columns now, and a NPE will be thrown out, see #7065.

This updates StrictMetricsEvaluator to support evaluation on nested columns(only for columns nested in a chain of Struct fileds, will return ROWS_MIGHT_NOT_MATCH if columns are nested in Map or List fields), which solve this problem.

zhongyujiang · 2023-03-17T16:02:20Z

spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestDelete.java

+    sql("INSERT INTO TABLE %s VALUES (1, named_struct(\"c1\", 3, \"c2\", \"v1\"))", tableName);
+    sql("INSERT INTO TABLE %s VALUES (2, named_struct(\"c1\", 2, \"c2\", \"v2\"))", tableName);
+
+    sql("DELETE FROM %s WHERE complex.c1 = 3", tableName);


Delete conditions in testDeleteWithConditionOnNestedColumn can not be push down, so added this UT to cover the corresponding scenario.

zhongyujiang · 2023-03-17T16:03:07Z

@aokolnychyi @rdblue can you help review this?

szehon-ho · 2023-03-23T18:24:48Z

api/src/main/java/org/apache/iceberg/expressions/StrictMetricsEvaluator.java

+        while (parent != null) {
+          Type type = schema.findType(parent);
+          if (type.isListType() || type.isMapType()) {
+            evaluable = false;


So you are skipping if list/map type, but allowing struct, if I understand? I think it makes sense to me, as I feel we have nested column stats. but definitely like @rdblue @RussellSpitzer @aokolnychyi to have a sanity check here on the overall direction.

Yes, your understanding is correct.

bluzy · 2023-12-28T02:20:27Z

PTAL @rdblue @RussellSpitzer @aokolnychyi @szehon-ho

eshishki · 2024-07-07T12:45:48Z

would love to see it merged

github-actions · 2024-08-28T00:13:56Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions · 2024-09-05T00:13:41Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

blakewhatley82 · 2024-09-23T08:53:57Z

This issue is still around in spark 3.5 and would really be a big capability to have for data that is all in structured format

mdub · 2024-10-03T03:09:23Z

Agreed. Can this be revived, @szehon-ho? Are you able to re-open it, @zhongyujiang?

zhongyujiang · 2024-10-05T08:45:36Z

@blakewhatley82 @mdub I think this fix is incorrect because the null count data of nested columns in metadata might be incorrect for now, see #8611. I am not able to reopen this, I've created a new PR #11261 with a different approach to address this issue.

zhongyujiang · 2024-10-14T07:18:35Z

Fixed by #11261.

Core, Spark: Fix delete with filter on nested columns.

08f5455

github-actions bot added API spark labels Mar 17, 2023

zhongyujiang commented Mar 17, 2023

View reviewed changes

szehon-ho reviewed Mar 23, 2023

View reviewed changes

github-actions bot added the stale label Aug 28, 2024

github-actions bot closed this Sep 5, 2024

zhongyujiang deleted the fix_delete_using_nested_column branch December 18, 2024 09:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core, Spark: Fix delete with filter on nested columns #7132

Core, Spark: Fix delete with filter on nested columns #7132

zhongyujiang commented Mar 17, 2023 •

edited

Loading

zhongyujiang Mar 17, 2023 •

edited

Loading

zhongyujiang commented Mar 17, 2023

szehon-ho Mar 23, 2023 •

edited

Loading

zhongyujiang Mar 24, 2023

bluzy commented Dec 28, 2023

eshishki commented Jul 7, 2024

github-actions bot commented Aug 28, 2024

github-actions bot commented Sep 5, 2024

blakewhatley82 commented Sep 23, 2024

mdub commented Oct 3, 2024 •

edited

Loading

zhongyujiang commented Oct 5, 2024

zhongyujiang commented Oct 14, 2024

Core, Spark: Fix delete with filter on nested columns #7132

Core, Spark: Fix delete with filter on nested columns #7132

Conversation

zhongyujiang commented Mar 17, 2023 • edited Loading

zhongyujiang Mar 17, 2023 • edited Loading

Choose a reason for hiding this comment

zhongyujiang commented Mar 17, 2023

szehon-ho Mar 23, 2023 • edited Loading

Choose a reason for hiding this comment

zhongyujiang Mar 24, 2023

Choose a reason for hiding this comment

bluzy commented Dec 28, 2023

eshishki commented Jul 7, 2024

github-actions bot commented Aug 28, 2024

github-actions bot commented Sep 5, 2024

blakewhatley82 commented Sep 23, 2024

mdub commented Oct 3, 2024 • edited Loading

zhongyujiang commented Oct 5, 2024

zhongyujiang commented Oct 14, 2024

zhongyujiang commented Mar 17, 2023 •

edited

Loading

zhongyujiang Mar 17, 2023 •

edited

Loading

szehon-ho Mar 23, 2023 •

edited

Loading

mdub commented Oct 3, 2024 •

edited

Loading