[GLUTEN-6863][VL] Pre-alloc and reuse compress buffer to avoid OOM in spill #6869

marin-ma · 2024-08-15T14:05:49Z

During sort-shuffle spill, allocating compressed buffer can trigger another spill and lead to OOM. Because sortBuffer has fixed-size, the maximum compressed buffer size can be computed at the very begining, and the compressed buffer can be pre-allocated and reused for spill.

Use the configurations spark.io.compression.lz4.blockSize and spark.io.compression.zstd.bufferSize to align with spark. Allocate the sort buffer and compress buffer using the default memory pool as Spark counts this part of allocation into memory overhead.
Fix the issue if input data contains a row larger than the compress buffer size

github-actions · 2024-08-15T14:06:16Z

#6863

github-actions · 2024-08-21T08:18:04Z

Run Gluten Clickhouse CI

github-actions · 2024-08-25T14:46:59Z

Run Gluten Clickhouse CI

FelixYBW · 2024-08-25T19:02:23Z

Should we allocate the buffer using global allocator which is counted into overhead memory?

github-actions · 2024-08-26T01:14:46Z

Run Gluten Clickhouse CI

marin-ma · 2024-08-26T01:15:52Z

cpp/velox/shuffle/VeloxSortShuffleWriter.cc

-  sortedBuffer_ = facebook::velox::AlignedBuffer::allocate<char>(kSortedBufferSize, veloxPool_.get());
-  rawBuffer_ = sortedBuffer_->asMutable<uint8_t>();
+  // In Spark, sortedBuffer_ memory and compressionBuffer_ memory are pre-allocated and counted into executor
+  // memory overhead. To align with Spark, we use arrow::default_memory_pool() to avoid counting these memory in Gluten.


@FelixYBW arrow::default_memory_pool is used to allocate the sort buffer and compress buffer.

@zhztheplayer Can you help look at here? Thanks!

Do we need to add a function defaultArrowMemoryPool to VeloxMemoryManager to unify the memory pool usage?

I think the code looks fine now as we don't have a mechanism to count global allocation of Arrow into Spark overhead memory.

In future we may report both Arrow and Velox's global pool usages to one counter which requires for some designs. So far we don't have that.

FelixYBW · 2024-08-31T04:12:57Z

@jinchengchenghh can you take a look?

jinchengchenghh

Thanks! Add some comments.

jinchengchenghh · 2024-09-03T02:28:47Z

cpp/core/shuffle/LocalPartitionWriter.cc

@@ -548,42 +543,14 @@ arrow::Status LocalPartitionWriter::finishSpill(bool close) {
  return arrow::Status::OK();
 }

-arrow::Status LocalPartitionWriter::evict(
+arrow::Status LocalPartitionWriter::hashEvict(
    uint32_t partitionId,
    std::unique_ptr<InMemoryPayload> inMemoryPayload,
    Evict::type evictType,


Looks like we don't need evictType.

hashEvict need this param to know whether the evict source is a spill or not. If it's spill, the partition writer will write the payload to disk immediately, otherwise it will cache the payload.

jinchengchenghh · 2024-09-03T02:36:37Z

cpp/core/shuffle/Payload.cc

+              "Compressed buffer length < maxCompressedLength. (", compressed->size(), " vs ", maxLength, ")"));
+      output = const_cast<uint8_t*>(compressed->data());
+    } else {
+      ARROW_ASSIGN_OR_RAISE(compressedBuffer, arrow::AllocateResizableBuffer(maxLength, pool));


Can we reuse the buffer for uncompressed payload type?

We hold the original evicted buffer for uncompressed payload. There are no extra copy.

jinchengchenghh · 2024-09-03T02:38:37Z

cpp/core/shuffle/Payload.cc

@@ -329,6 +329,21 @@ int64_t BlockPayload::rawSize() {
  return getBufferSize(buffers_);
 }

+int64_t BlockPayload::maxCompressedLength(


Can we move it to anonymous namespace?

It's a public api for BlockPayload and is used by other components.

jinchengchenghh · 2024-09-03T02:41:02Z

cpp/velox/shuffle/VeloxShuffleReader.cc

@@ -314,7 +314,7 @@ std::shared_ptr<ColumnarBatch> VeloxHashShuffleReaderDeserializer::next() {
    uint32_t numRows;
    GLUTEN_ASSIGN_OR_THROW(
        auto arrowBuffers, BlockPayload::deserialize(in_.get(), codec_, memoryPool_, numRows, decompressTime_));
-    if (numRows == 0) {
+    if (arrowBuffers.empty()) {


Why do we have this change?

Before this PR, numRows is set to zero in BlockPayload::deserialize once reach EOS. This PR remove this logic and use numRows = 0 to represent a segment of a large row that cannot be compressed within one block.

jinchengchenghh · 2024-09-03T02:44:13Z

cpp/velox/shuffle/VeloxShuffleReader.cc

+      cachedInputs_.emplace_back(numRows, wrapInBufferViewAsOwner(buffer->data(), buffer->size(), buffer));
+      cachedRows_ += numRows;
+    } else {
+      // For a large row, read all segments.


Can you explain a bit more? I don't catch the context here.

Add some comments here to indicate this cases only occurs in sort buffer writer, and the numRows is 0. Do we have a more friendly way to specify the large row that is splited?

jinchengchenghh · 2024-09-03T02:45:53Z

cpp/velox/shuffle/VeloxSortShuffleWriter.cc

-  sortedBuffer_ = facebook::velox::AlignedBuffer::allocate<char>(kSortedBufferSize, veloxPool_.get());
-  rawBuffer_ = sortedBuffer_->asMutable<uint8_t>();
+  // In Spark, sortedBuffer_ memory and compressionBuffer_ memory are pre-allocated and counted into executor
+  // memory overhead. To align with Spark, we use arrow::default_memory_pool() to avoid counting these memory in Gluten.


@zhztheplayer Can you help look at here? Thanks!

jinchengchenghh · 2024-09-03T02:46:12Z

cpp/velox/shuffle/VeloxSortShuffleWriter.cc

@@ -266,6 +273,7 @@ arrow::Status VeloxSortShuffleWriter::evictAllPartitions() {
 }

 arrow::Status VeloxSortShuffleWriter::evictPartition(uint32_t partitionId, size_t begin, size_t end) {
+  VELOX_CHECK(begin < end);


VELOX_DCHECK

jinchengchenghh · 2024-09-03T02:54:04Z

cpp/velox/tests/VeloxShuffleWriterTest.cc

-    for (auto useRadixSort : {true, false}) {
-      params.push_back(ShuffleTestParams{
-          ShuffleWriterType::kSortShuffle, PartitionWriterType::kLocal, compression, 0, 0, useRadixSort});
+    for (const auto compressionBufferSize : {4, 56, 32 * 1024}) {


Do we have the test for split large row?

Yes. The condition of splitting a large row is the row size > compressionBufferSize. When compressionBufferSize is 4, most of the rows will be split.

Do we need to support this case or require the compression buffer should be larger than one row size at least, throw exception? I think we should have a check for the minimum config value. @FelixYBW

Spark doesn't throw exception. It copies the row to a default 32k buffer for compressing.

Ok, It's fine to align with Spark behavior here.

jinchengchenghh · 2024-09-03T02:55:36Z

gluten-core/src/main/scala/org/apache/spark/shuffle/GlutenShuffleUtils.scala

+    if ("lz4" == codec) {
+      Math.max(
+        conf.get(IO_COMPRESSION_LZ4_BLOCKSIZE).toInt,
+        GlutenConfig.GLUTEN_SHUFFLE_COMPRESSION_BUFFER_MIN_SIZE)


Can we support set the config? GLUTEN_SHUFFLE_COMPRESSION_BUFFER_MIN_SIZE

Looks like the default value 64 is much less than other compression kind default value 32 * 1024

64 is not the default value, unless user set IO_COMPRESSION_LZ4_BLOCKSIZE to a very small size.

WE could set a more reasonable value, maybe 32 * 1024?

Per discussion, we will throw exception if IO_COMPRESSION_LZ4_BLOCKSIZE < 4. For each serialized row, the row size takes 4 bytes. Therefore 4 bytes is the minimum acceptable compression block size in Gluten.

Note here for the Spark exceptions:

lz4: spark.io.compression.lz4.blockSize=0

Caused by: java.lang.IllegalArgumentException: blockSize must be >= 64, got 0 at net.jpountz.lz4.LZ4BlockOutputStream.compressionLevel(LZ4BlockOutputStream.java:60) at net.jpountz.lz4.LZ4BlockOutputStream.<init>(LZ4BlockOutputStream.java:101) at org.apache.spark.io.LZ4CompressionCodec.compressedOutputStream(CompressionCodec.scala:151) at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$blockifyObject$2(TorrentBroadcast.scala:361) at scala.Option.map(Option.scala:230) at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:361) at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:161) at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:99) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:38) at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:78) at org.apache.spark.SparkContext.broadcastInternal(SparkContext.scala:1662) at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1644) at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1585) at org.apache.spark.scheduler.DAGScheduler.submitStage(DAGScheduler.scala:1402) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1337) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3003) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2994) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2983) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)

zstd: spark.io.compression.zstd.bufferSize=0

Caused by: java.lang.IllegalArgumentException: Buffer size <= 0 at java.io.BufferedOutputStream.<init>(BufferedOutputStream.java:74) at org.apache.spark.io.ZStdCompressionCodec.compressedOutputStream(CompressionCodec.scala:237) at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$blockifyObject$2(TorrentBroadcast.scala:361) at scala.Option.map(Option.scala:230) at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:361) at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:161) at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:99) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:38) at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:78) at org.apache.spark.SparkContext.broadcastInternal(SparkContext.scala:1662) at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1644) at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1585) at org.apache.spark.scheduler.DAGScheduler.submitStage(DAGScheduler.scala:1402) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1337) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3003) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2994) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2983) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)

github-actions · 2024-09-05T02:01:01Z

Run Gluten Clickhouse CI

marin-ma · 2024-09-05T03:53:13Z

@jinchengchenghh Do you have further comments? Thanks!

jinchengchenghh · 2024-09-05T06:13:09Z

cpp/velox/shuffle/VeloxShuffleReader.cc

+      cachedInputs_.emplace_back(numRows, wrapInBufferViewAsOwner(buffer->data(), buffer->size(), buffer));
+      cachedRows_ += numRows;
+    } else {
+      // numRows = 0 indicates a segment of a large row.


Can we extract the numRows = 0 logic to a function to make code more readable?

jinchengchenghh · 2024-09-05T06:15:52Z

cpp/velox/shuffle/VeloxShuffleReader.cc

+      RowSizeType bytes = 0;
+      auto* dst = rowBuffer->mutable_data();
+      for (const auto& buffer : buffers) {
+        VELOX_CHECK_NOT_NULL(buffer);


VELOX_DCHECK, code logic should use DCHECK

github-actions · 2024-09-05T07:12:18Z

Run Gluten Clickhouse CI

marin-ma · 2024-09-05T07:13:39Z

@jinchengchenghh Do you have further comments? Thanks!

jinchengchenghh · 2024-09-05T07:45:45Z

Thanks!

… spill (apache#6869)

github-actions bot added the VELOX label Aug 15, 2024

marin-ma force-pushed the issue-6863 branch from f017b2f to 261fdbe Compare August 15, 2024 14:13

marin-ma changed the title ~~[GLUTEN-6863][VL] Pre-alloc and reused compress buffer to avoid OOM in spill~~ [GLUTEN-6863][VL] Pre-alloc and reuse compress buffer to avoid OOM in spill Aug 16, 2024

marin-ma force-pushed the issue-6863 branch from 261fdbe to f040410 Compare August 16, 2024 05:29

marin-ma mentioned this pull request Aug 16, 2024

[VL] Umbrella issue for sort-based shuffle #6896

Open

marin-ma force-pushed the issue-6863 branch from 5fdb18b to 81ee413 Compare August 21, 2024 08:17

github-actions bot added CORE works for Gluten Core RSS labels Aug 21, 2024

marin-ma commented Aug 26, 2024

View reviewed changes

jinchengchenghh reviewed Sep 3, 2024

View reviewed changes

marin-ma added 6 commits September 5, 2024 01:54

pre-alloc and reused compressed buffer

b0500bf

fix

c27e0f9

use spark compression buffer size and fix sorting large row

1c0146a

fix ut

2a1385c

update comment

e34ce56

address comments

89b902c

marin-ma force-pushed the issue-6863 branch from af6d22f to 89b902c Compare September 5, 2024 02:00

jinchengchenghh reviewed Sep 5, 2024

View reviewed changes

address comments

3d2f472

jinchengchenghh approved these changes Sep 5, 2024

View reviewed changes

marin-ma merged commit eeb0ca1 into apache:main Sep 5, 2024
45 checks passed

dcoliversun pushed a commit to dcoliversun/gluten that referenced this pull request Sep 11, 2024

[GLUTEN-6863][VL] Pre-alloc and reuse compress buffer to avoid OOM in…

33dfe99

… spill (apache#6869)

sharkdtu pushed a commit to sharkdtu/gluten that referenced this pull request Nov 11, 2024

[GLUTEN-6863][VL] Pre-alloc and reuse compress buffer to avoid OOM in…

b9eed74

… spill (apache#6869)

[GLUTEN-6863][VL] Pre-alloc and reuse compress buffer to avoid OOM in spill #6869

[GLUTEN-6863][VL] Pre-alloc and reuse compress buffer to avoid OOM in spill #6869

Conversation

marin-ma commented Aug 15, 2024 • edited Loading

github-actions bot commented Aug 15, 2024

github-actions bot commented Aug 21, 2024

github-actions bot commented Aug 25, 2024

FelixYBW commented Aug 25, 2024

github-actions bot commented Aug 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FelixYBW commented Aug 31, 2024

jinchengchenghh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Sep 5, 2024

marin-ma commented Sep 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Sep 5, 2024

marin-ma commented Sep 5, 2024

jinchengchenghh commented Sep 5, 2024

marin-ma commented Aug 15, 2024 •

edited

Loading