Iceberg delete files are read multiple times during query processing causing delays #6527

deniskuzZ · 2023-01-05T16:30:07Z

ref https://issues.apache.org/jira/browse/HIVE-26714

Current optimization covers only positional deletes by creating a PositionDeleteIndex bitmap for every task in a combined TableScan, avoiding multiple delete file reads for each task.

rdblue · 2023-05-21T19:36:17Z

core/src/main/java/org/apache/iceberg/deletes/Deletes.java

+    @Override
+    @SuppressWarnings("CollectionUndefinedEquality")
+    protected boolean shouldKeep(T posDelete) {
+      return dataLocation.contains(FILENAME_ACCESSOR.get(posDelete));


Why does this remove the optimization above?

rdblue · 2023-05-21T19:40:57Z

core/src/main/java/org/apache/iceberg/util/ThreadPools.java

+   *
+   * @return an {@link ExecutorService} that uses delete worker pool
+   */
+  public static ExecutorService getDeleteWorkerPool() {


Can you remove the changes that overlap with #6432?

removed, however, #6432 looks abandoned

rdblue · 2023-05-21T19:42:52Z

core/src/main/java/org/apache/iceberg/deletes/BitmapPositionDeleteIndex.java

+      roaring64Bitmap.or(((BitmapPositionDeleteIndex) deleteIndex).roaring64Bitmap);
+      return this;
+    }
+    throw new IllegalArgumentException();


This should not throw an exception with no context. Please use Preconditions and produce a helpful error message.

rdblue · 2023-05-21T19:48:20Z

core/src/main/java/org/apache/iceberg/deletes/Deletes.java

+            deleteFiles,
+            deletes ->
+                CloseableIterable.transform(
+                    locationFilter.filter(deletes),


Changing the filter has removed the need for having one in the first place. Instead, I think this should use CloseabileIterable.filter directly.

rdblue · 2023-05-21T19:51:17Z

core/src/main/java/org/apache/iceberg/deletes/Deletes.java

+                CloseableIterable.transform(
+                    locationFilter.filter(deletes),
+                    row ->
+                        Pair.of(


There's no need to convert to Pair only to consume those pairs in the same function. Just use the accessors below.

rdblue · 2023-05-21T19:52:20Z

core/src/main/java/org/apache/iceberg/deletes/Deletes.java

+      deletes.forEach(
+          entry ->
+              positionDeleteIndex
+                  .computeIfAbsent(entry.first(), f -> new BitmapPositionDeleteIndex())


Instead of using computeIfAbsent on every row, this should pre-populate the map using dataLocations, since those are all known.

i don't see any issues here as nothing extra is done when key exists:

default V computeIfAbsent(K key, Function<? super K, ? extends V> mappingFunction) { Objects.requireNonNull(mappingFunction); V v; if ((v = get(key)) == null) { V newValue; if ((newValue = mappingFunction.apply(key)) != null) { put(key, newValue); return newValue; } } return v; }

rdblue · 2023-05-21T19:59:47Z

data/src/main/java/org/apache/iceberg/data/DeleteFilter.java

@@ -266,13 +291,23 @@ private CloseableIterable<T> createDeleteIterable(
        : Deletes.filterDeleted(records, isDeleted, counter);
  }

+  static CloseableIterable<Record> openPosDeletes(FileIO io, DeleteFile file) {
+    InputFile input = io.newInputFile(file.path().toString());


This should use getInputFile instead of calling io.newInputFile. In Spark and Flink, the input files are already created in a bulk operation.

rdblue · 2023-05-21T20:02:27Z

data/src/main/java/org/apache/iceberg/data/DeleteFilter.java

+    List<DeleteFile> posDeletes = distinctPosDeletes(fileTasks);
+    if (posDeletes.isEmpty()) {
+      return ImmutableMap.of();
+    }


Please follow style guidelines. Control flow blocks should be separated from the following statement by a line of whitespace.

rdblue · 2023-05-21T20:05:17Z

spark/v3.3/build.gradle

@@ -62,6 +62,7 @@ project(":iceberg-spark:iceberg-spark-${sparkMajorVersion}_${scalaVersion}") {
    implementation("org.scala-lang.modules:scala-collection-compat_${scalaVersion}")

    compileOnly "com.google.errorprone:error_prone_annotations"
+    compileOnly "com.github.ben-manes.caffeine:caffeine"


These changes look incorrect. Why is this new compile dependency needed when there is no code change?

there is a change in GenericReader that could reuse positionalDeletes info between the tasks from the same split, see https://github.com/apache/iceberg/pull/6527/files#diff-98d1b57871903c422d33d86cc7781f33b844cef31c58938218d8fcc439b12131R76-R80

rdblue · 2023-05-21T20:07:33Z

data/src/main/java/org/apache/iceberg/data/DeleteFilter.java

+
+      return posIndexMap;
+    }
+    return null;


Looks like a correctness bug. This can't ignore deletes if there are too many.

it's not. there is a special handling logic in PositionalDeletes class:

Optional<PositionDeleteIndex> positionIndex = Optional.ofNullable(positionIndexMap).map(cache -> cache.get(filePath)); boolean skipPosDeletes = positionIndexMap != null && !positionIndex.isPresent();

i tried to return just empty Optional when we can't push the delete ids into memory, however, I think that just complicated the code

deniskuzZ · 2023-05-22T14:01:36Z

@rdblue, thank you for the review! I've addressed most of the comments and provided answers for others

deniskuzZ · 2023-05-30T07:54:01Z

@rdblue, gentle reminder. Please take a look once you get a chance.

deniskuzZ · 2023-10-30T11:42:04Z

@aokolnychyi, could you please help with the review? thanks!

Fokko · 2023-10-30T12:44:53Z

I believe @bryanck also ran into this, he might be interested in reviewing this as well

…elays

github-actions bot added core data labels Jan 5, 2023

deniskuzZ force-pushed the master branch 9 times, most recently from 3fb3746 to b576956 Compare January 12, 2023 08:42

deniskuzZ changed the title ~~DRAFT: Iceberg delete files are read twice during query processing causing delays~~ Iceberg delete files are read twice during query processing causing delays Jan 12, 2023

deniskuzZ force-pushed the master branch 2 times, most recently from 17509c5 to 8962bc9 Compare January 19, 2023 11:41

github-actions bot added the build label Jan 19, 2023

deniskuzZ force-pushed the master branch from 228b67a to 490934c Compare January 25, 2023 11:24

github-actions bot added the spark label Jan 25, 2023

deniskuzZ force-pushed the master branch from 490934c to 62c0cb8 Compare January 25, 2023 11:32

github-actions bot added the MR label Feb 7, 2023

deniskuzZ force-pushed the master branch 3 times, most recently from 9c4dc58 to dc7af64 Compare February 8, 2023 08:26

deniskuzZ force-pushed the master branch from dc7af64 to 16cf4b1 Compare April 5, 2023 19:38

deniskuzZ changed the title ~~Iceberg delete files are read twice during query processing causing delays~~ Iceberg delete files are read multiple times during query processing causing delays Apr 24, 2023

deniskuzZ mentioned this pull request Apr 24, 2023

Consider moving to ParallelIterable in Deletes::toPositionIndex #6432

Closed

rdblue reviewed May 21, 2023

View reviewed changes

deniskuzZ force-pushed the master branch from 16cf4b1 to 0dd97b2 Compare May 22, 2023 13:59

github-actions bot added API flink labels May 22, 2023

deniskuzZ requested a review from rdblue May 22, 2023 15:57

deniskuzZ force-pushed the master branch from d40d34d to a931a78 Compare May 23, 2023 07:08

deniskuzZ closed this Oct 30, 2023

deniskuzZ force-pushed the master branch from a931a78 to d8f2daf Compare October 30, 2023 12:30

deniskuzZ reopened this Oct 30, 2023

deniskuzZ force-pushed the master branch from e63e7da to bc7852f Compare October 30, 2023 14:29

Iceberg delete files are read twice during query processing causing d…

003d71a

…elays

deniskuzZ force-pushed the master branch from bc7852f to 003d71a Compare October 30, 2023 14:37

spotless

d29dd33

deniskuzZ closed this Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iceberg delete files are read multiple times during query processing causing delays #6527

Iceberg delete files are read multiple times during query processing causing delays #6527

deniskuzZ commented Jan 5, 2023 •

edited

Loading

rdblue May 21, 2023

deniskuzZ May 22, 2023

rdblue May 21, 2023

deniskuzZ May 22, 2023 •

edited

Loading

rdblue May 21, 2023

deniskuzZ May 22, 2023

rdblue May 21, 2023

deniskuzZ May 22, 2023

rdblue May 21, 2023

deniskuzZ May 22, 2023

rdblue May 21, 2023

deniskuzZ May 22, 2023

rdblue May 21, 2023

deniskuzZ May 22, 2023

rdblue May 21, 2023

deniskuzZ May 22, 2023

rdblue May 21, 2023

deniskuzZ May 22, 2023 •

edited

Loading

rdblue May 21, 2023

deniskuzZ May 22, 2023 •

edited

Loading

deniskuzZ commented May 22, 2023

deniskuzZ commented May 30, 2023

deniskuzZ commented Oct 30, 2023

Fokko commented Oct 30, 2023

Iceberg delete files are read multiple times during query processing causing delays #6527

Iceberg delete files are read multiple times during query processing causing delays #6527

Conversation

deniskuzZ commented Jan 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deniskuzZ May 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deniskuzZ May 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deniskuzZ May 22, 2023 • edited Loading

Choose a reason for hiding this comment

deniskuzZ commented May 22, 2023

deniskuzZ commented May 30, 2023

deniskuzZ commented Oct 30, 2023

Fokko commented Oct 30, 2023

deniskuzZ commented Jan 5, 2023 •

edited

Loading

deniskuzZ May 22, 2023 •

edited

Loading

deniskuzZ May 22, 2023 •

edited

Loading

deniskuzZ May 22, 2023 •

edited

Loading