Parallel Runner & AutoScaling Issues #1816

JavierGOrdonnez · 2024-12-21T10:16:12Z

Long Story Short
Using the Parallel Runner to run 50 instantiations of a large computational template, I see unexpected behaviours in the autoscaling. Private cluster is enabled. Also the parallel runner fails to keep 10 jobs running simultaneously. And the timeout fails - it waits until the job is done but then kills the worker?

Expected Behavior
10x 8xlarge machines spawn.

Actual behaviour
Just 5 of those spawn, and 2x machines are also created (maybe for the smaller, accesory services in the pipeline? But the pipeline is fully linear).

This screenshot is after 10hours of running, with 10 parallel jobs (no batching). I would expect 1-2h per job, thus 5-10h total runtime. However, there is still significant computation to get done.

JavierGOrdonnez · 2024-12-21T10:19:30Z

TLDR of the issues:

@sanderegg why no 10 8xlarge machines spawn?
@wvangeit the timeout seems to be functioning wrong - kills the job AFTER it runs forever, and then doesnt spawn a new one.

These issues can be tackled after the break, no hurry. Will make this run work one way or the other over the weekend, and we can tackle this later. Just wanted to report it & document it.

JavierGOrdonnez · 2024-12-21T19:02:23Z

Btw this made the full run go to waste (eventually there werent any runners left) - and thus the associated large-machine costs. On the other hand, I believe this identifies the issue we were having a few weeks back :)

JavierGOrdonnez added type:bug Issue that prevents to perform a certain task, features that don't work as t Feedback Feedback through frontend labels Dec 21, 2024

JavierGOrdonnez assigned wvangeit and sanderegg Dec 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel Runner & AutoScaling Issues #1816

Parallel Runner & AutoScaling Issues #1816

JavierGOrdonnez commented Dec 21, 2024

JavierGOrdonnez commented Dec 21, 2024 •

edited

Loading

JavierGOrdonnez commented Dec 21, 2024

Parallel Runner & AutoScaling Issues #1816

Parallel Runner & AutoScaling Issues #1816

Comments

JavierGOrdonnez commented Dec 21, 2024

JavierGOrdonnez commented Dec 21, 2024 • edited Loading

JavierGOrdonnez commented Dec 21, 2024

JavierGOrdonnez commented Dec 21, 2024 •

edited

Loading