Flink: implement range partitioner for map data statistics #9321

stevenzwu · 2023-12-17T05:42:44Z

No description provided.

flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/MapRangePartitioner.java

pvary · 2023-12-18T11:20:53Z

flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/MapRangePartitioner.java

+      assignedSubtasks.add(subtaskId);
+      // assign the remaining weight of key to the current subtask if it is the last subtask
+      // or if the subtask has more capacity than the remaining key weight
+      if (subtaskId == numPartitions - 1 || keyRemainingWeight < subtaskRemainingWeight) {


If I understand correctly keyRemainingWeight < subtaskRemainingWeight should always be true for subtaskId == numPartitions - 1.
How confident is you in the algorithm above? (I did not find any issue, but...) Would it worth to log a message at least if something is off, and we put keys to the last task just because we made some issue during calculation?

If I understand correctly keyRemainingWeight < subtaskRemainingWeight should always be true for subtaskId == numPartitions - 1.

not sure I fully understand the comment. this is an or condition. fully assign the remaining key weight to the subtask

if it is the last subtask

(or) if the weight is less than the subtask remaining capacity

If we done the calculation correctly, then even for the last subtask, the size of the last subtask should be smaller than the subtaskRemainingWeight
If we depend on the subtaskId == numPartitions - 1 part of the if clause, then we have a wrong distribution.

we compute the target weigh using ceiling function. so last subtask should only get less or equal to the fair share.

long targetWeightPerSubtaskWithCloseFileCost = (long) Math.ceil(((double) totalWeightWithCloseFileCost) / numPartitions);

I agree that we shouldn't need subtaskId == numPartitions - 1 in theory. it was added for extra safety. Please let me know your opinion on the current behavior (option 1) vs the alternatives below.

Option 2: log an error while maintaining the permissive behavior

if (subtaskId == numPartitions - 1 && keyRemainingWeight > subtaskRemainingWeight) { LOG.error("Invalid assignment: last subtask is assigned more weight than target"); }

Option 3: throw an exception

if (subtaskId == numPartitions - 1 && keyRemainingWeight > subtaskRemainingWeight) { throw new InvalidStateException("Invalid assignment: last subtask is assigned more weight than target"); }

I would go for the option 2, and maybe a metric?

With option 3 we just restart the job. The first checkpoint will run without a known distribution, and the job would probably continue to run without an issue.

Both option 2 and 3 requires conscious monitoring from the job owners, and option 2 is better in many ways

I am leaning toward option 3 (failure). I agree with your assessment that the first checkpoint will still succeed and job will restart after it. but at least, the job is constantly restarting so that the algorithm error can be surfaced.

the job is constantly restarting

Do we store the distribution in the state? That’s the reason why the job will fail again after a restart?

flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/MapRangePartitioner.java

jmh.gradle

...v1.17/flink/src/test/java/org/apache/iceberg/flink/sink/shuffle/TestMapRangePartitioner.java

pvary · 2023-12-22T08:15:01Z

flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/MapRangePartitioner.java

+      this.sortedStatsWithCloseFileCost = Maps.newTreeMap(comparator);
+      mapStatistics.forEach(
+          (k, v) -> {
+            int estimatedSplits = (int) Math.ceil(v / targetWeightPerSubtask);


We should not forget, that this is an estimation only.
The number is correct for the first key, but could be off for subsequent keys, as we are filling out the remaining places.

Example:

targetWeightPerSubtask = 10

SORT_KEY_0 = 5, SORT_KEY_1 = 20

In this case we estimate 2 for SORT_KEY_1, but it will be definitely distributed to 3 splits.

stevenzwu · 2024-03-06T21:30:34Z

flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/MapRangePartitioner.java

+    return assignmentInfo;
+  }
+
+  private Map<SortKey, KeyAssignment> buildAssignment(


@pvary @yegangy0718 this is the main/tricky part.

stevenzwu · 2024-03-06T21:48:53Z

...7/flink/src/jmh/java/org/apache/iceberg/flink/sink/shuffle/MapRangePartitionerBenchmark.java

+@Warmup(iterations = 3)
+@Measurement(iterations = 5)
+@BenchmarkMode(Mode.SingleShotTime)
+public class MapRangePartitionerBenchmark {


benchmark shows about the cost of partitioner.partition(row, numPartitions) is about 0.1 us per call.

the following screenshot is for 100K calls

flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/MapRangePartitioner.java

…n statistics

flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/MapRangePartitioner.java

pvary · 2024-03-14T06:50:16Z

flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/MapRangePartitioner.java

+        // If assigned weight is less than close file cost, pad it up with close file cost.
+        // This might cause the subtask assigned weight over the target weight.
+        // But it should be no more than one close file cost. Small skew is acceptable.
+        if (assignedWeight <= closeFileCostInWeight) {


So, instead of not assigning key if there is not enough weight left for the current task, we push a bit more there to warrant opening a new file?

How do we handle if the current key doesn't have enough weight left for the next task for warrant opening a new file?

Oh... I see, we ignore those

this might lead to some inaccuracy in weight calculation. E.g., assuming the key weight is 2
and close file cost is 2. key weight with close cost is 4. Let's assume the previous task
has a weight of 3 available. So weight of 3 for this key is assigned to the task and the
residual weight of 1 is dropped. Then the routing weight for this key is 1 (minus the close
file cost), which is inaccurate as the accurate weight should be 2.

I thought about add the residual weight to the previous assignment. but it is also not alway accurate.
e.g. key weight is 11, target task weight before close cost is 10. so this key should be split into 2 files.
assuming close file cost is 1. key weight with close cost would be 13 (11 + 1x2). let's say the target
task weight with close cost is 12. with add-back, the task would be assigned with weight of 13.
routing weight would be 12 (13 - 1 close file cost). that is also inaccurate.

With this simple greedy heuristic, there is always some inaccuracy one way or the other.
but the inaccuracy should be small and doesn't skew the traffic distribution much.

3 things to consider:

If we merge, we might prefer merging files instead of splitting out (to help future readers)

If the key weight is very small we might end up removing it altogether

We might prefer you simpler algorithm to handle all of the edge cases mentioned above

Will try to take another, more serious look at this soon

pvary

Thanks @stevenzwu!
LGTM

stevenzwu · 2024-03-27T20:47:06Z

thanks @pvary and @yegangy0718 for the code review

)

apache#10061)

github-actions bot added flink build labels Dec 17, 2023

stevenzwu force-pushed the range-partitioner branch 5 times, most recently from 66d6976 to f742a3e Compare December 18, 2023 05:07

stevenzwu changed the title ~~Flink: implement range partitioner that leverages traffic distributio…~~ Flink: implement range partitioner for map data statistics Dec 18, 2023

stevenzwu requested a review from pvary December 18, 2023 05:09

stevenzwu force-pushed the range-partitioner branch from f742a3e to 1503181 Compare December 18, 2023 05:13

pvary reviewed Dec 18, 2023

View reviewed changes

flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/MapRangePartitioner.java Outdated Show resolved Hide resolved

pvary reviewed Dec 18, 2023

View reviewed changes

flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/MapRangePartitioner.java Outdated Show resolved Hide resolved

pvary reviewed Dec 18, 2023

View reviewed changes

jmh.gradle Show resolved Hide resolved

pvary reviewed Dec 18, 2023

View reviewed changes

...v1.17/flink/src/test/java/org/apache/iceberg/flink/sink/shuffle/TestMapRangePartitioner.java Outdated Show resolved Hide resolved

pvary reviewed Dec 18, 2023

View reviewed changes

...v1.17/flink/src/test/java/org/apache/iceberg/flink/sink/shuffle/TestMapRangePartitioner.java Outdated Show resolved Hide resolved

pvary reviewed Dec 18, 2023

View reviewed changes

...v1.17/flink/src/test/java/org/apache/iceberg/flink/sink/shuffle/TestMapRangePartitioner.java Outdated Show resolved Hide resolved

pvary reviewed Dec 18, 2023

View reviewed changes

...v1.17/flink/src/test/java/org/apache/iceberg/flink/sink/shuffle/TestMapRangePartitioner.java Outdated Show resolved Hide resolved

pvary reviewed Dec 18, 2023

View reviewed changes

...v1.17/flink/src/test/java/org/apache/iceberg/flink/sink/shuffle/TestMapRangePartitioner.java Outdated Show resolved Hide resolved

pvary reviewed Dec 22, 2023

View reviewed changes

stevenzwu force-pushed the range-partitioner branch 2 times, most recently from 82d4854 to 46745b3 Compare January 18, 2024 00:35

stevenzwu force-pushed the range-partitioner branch 4 times, most recently from 25152b9 to c6c7837 Compare March 6, 2024 22:35

stevenzwu commented Mar 7, 2024

View reviewed changes

stevenzwu force-pushed the range-partitioner branch from c6c7837 to 66c60c2 Compare March 7, 2024 00:20

yegangy0718 reviewed Mar 8, 2024

View reviewed changes

flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/MapRangePartitioner.java Show resolved Hide resolved

flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/MapRangePartitioner.java Outdated Show resolved Hide resolved

Flink: implement range partitioner that leverages traffic distributio…

36cd418

…n statistics

stevenzwu force-pushed the range-partitioner branch from 6c2dcc2 to 36cd418 Compare March 8, 2024 15:15

pvary reviewed Mar 14, 2024

View reviewed changes

flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/MapRangePartitioner.java Show resolved Hide resolved

pvary reviewed Mar 14, 2024

View reviewed changes

flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/MapRangePartitioner.java Show resolved Hide resolved

pvary reviewed Mar 14, 2024

View reviewed changes

stevenzwu added 2 commits March 14, 2024 15:07

update comment

f3f7124

update comment

d067d76

pvary approved these changes Mar 26, 2024

View reviewed changes

yegangy0718 approved these changes Mar 26, 2024

View reviewed changes

stevenzwu merged commit 81b62c7 into apache:main Mar 27, 2024
41 checks passed

stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Mar 28, 2024

Flink: backport PR apache#9321 for range partitioner on map statistics

0482150

stevenzwu added a commit that referenced this pull request Mar 30, 2024

Flink: backport PR #9321 for range partitioner on map statistics (#10061

a86e1b3

)

nk1506 pushed a commit to nk1506/iceberg that referenced this pull request Apr 2, 2024

Flink: backport PR apache#9321 for range partitioner on map statistics (

6f1c1ba

apache#10061)

sasankpagolu pushed a commit to sasankpagolu/iceberg that referenced this pull request Oct 27, 2024

Flink: implement range partitioner for map data statistics (apache#9321)

c9315de

sasankpagolu pushed a commit to sasankpagolu/iceberg that referenced this pull request Oct 27, 2024

Flink: backport PR apache#9321 for range partitioner on map statistics (

4325b88

apache#10061)

zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024

Flink: implement range partitioner for map data statistics (apache#9321)

efb24f0

zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024

Flink: backport PR apache#9321 for range partitioner on map statistics (

4e58d27

apache#10061)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flink: implement range partitioner for map data statistics #9321

Flink: implement range partitioner for map data statistics #9321

stevenzwu commented Dec 17, 2023 •

edited

Loading

pvary Dec 18, 2023

stevenzwu Dec 20, 2023

pvary Dec 20, 2023

stevenzwu Dec 21, 2023

pvary Dec 22, 2023

stevenzwu Jan 17, 2024 •

edited

Loading

pvary Jan 19, 2024

pvary Dec 22, 2023

stevenzwu Mar 6, 2024

stevenzwu Mar 6, 2024

pvary Mar 14, 2024 •

edited

Loading

pvary Mar 14, 2024

stevenzwu Mar 14, 2024 •

edited

Loading

pvary Mar 15, 2024

pvary left a comment

stevenzwu commented Mar 27, 2024

Flink: implement range partitioner for map data statistics #9321

Flink: implement range partitioner for map data statistics #9321

Conversation

stevenzwu commented Dec 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevenzwu Jan 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pvary Mar 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevenzwu Mar 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pvary left a comment

Choose a reason for hiding this comment

stevenzwu commented Mar 27, 2024

stevenzwu commented Dec 17, 2023 •

edited

Loading

stevenzwu Jan 17, 2024 •

edited

Loading

pvary Mar 14, 2024 •

edited

Loading

stevenzwu Mar 14, 2024 •

edited

Loading