-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ParallelIterable may cause memory pressure for presto coordinator #3179
Comments
Limiting the queue size and blocking the tasks that are reading manifests feels like it would be the best trade off between fixing the problem without too much added complexity. Say the queue allows for a buffer of, just throwing a number out there, ~10,000 tasks. It seems unlikely that the consumer would first be slow enough that the queue fills and then suddenly be reading tasks off fast enough to clear the queue and outpace producers. Here's a stalled attempt from the last time this came up: #7844 (comment) |
This got replaced by and merged as #9402 |
No limit for split queue
Presto iceberg connector use
BaseTableScan#planTasks()
to generate splits which is processed by aParallelIterable
to plan parallelly. The base logic ofParallelIterable
is that: whenhasNext()
is called, plan tasks(each task for each manifest) are submitted to a thread pool to plan data files from each manifest, and the generatedFileScanTask
are stored in a queue and then are consumed bynext()
. However there is no limit size or target size for the queue. That is to say when a task is submitted, all data files of the manifest in the task will be added to the queue continuously unitl finished. So in extreme cases, if a manifest has a huge amount of a datafiles and the consume speed is slow meanwhile, the size of the queue will grow quickly and cause memory pressure of the presto coordinator , especially when theFileScanTask
size is big, eg. planning table with extremelly complicated schema.Split queue is not cleared after iterator closed
Another problem is that when the
ParallelIterable
is closed, the queue is not cleared as the code shows below.Since presto keeps history queries for a while, the
FileScanTask
remaining in the queue are referenced by scheduler and then reference by a presto query which can not collected by gc. This may cause the coordinator OOM in extremelly cases.In our test, we set up a presto coordinator (worker node as well) of -Xmx=40GB, and query a sql like
select col from table limit 1
in which thetable
has a verg complicated schema. When this sql is submitted continuously 8-10 times, the coordinator crashed caused of a oom error. We found that theFileScanTask
s in theParallelIterator#queue
reached 2.5GB per query in the heap dump, and filled the heap of jvm quickly.Of course, we can set
null
explicitly to the iterator reference ofcombinedScanIterable
in presto iceberg connector to let gc collect the splits after query completed, but I think it's resonable to clear the queue when closing theParallelIterator
since the iterator can not be accessed anymore after closed.Maybe we can add some improvement for the
ParallelIterable
eg. add limit size and clear the queue after closed.The text was updated successfully, but these errors were encountered: