You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Long Story Short
Using the Parallel Runner to run 50 instantiations of a large computational template, I see unexpected behaviours in the autoscaling. Private cluster is enabled. Also the parallel runner fails to keep 10 jobs running simultaneously. And the timeout fails - it waits until the job is done but then kills the worker?
Expected Behavior
10x 8xlarge machines spawn.
Actual behaviour
Just 5 of those spawn, and 2x machines are also created (maybe for the smaller, accesory services in the pipeline? But the pipeline is fully linear).
This screenshot is after 10hours of running, with 10 parallel jobs (no batching). I would expect 1-2h per job, thus 5-10h total runtime. However, there is still significant computation to get done.
The text was updated successfully, but these errors were encountered:
@wvangeit the timeout seems to be functioning wrong - kills the job AFTER it runs forever, and then doesnt spawn a new one.
These issues can be tackled after the break, no hurry. Will make this run work one way or the other over the weekend, and we can tackle this later. Just wanted to report it & document it.
Btw this made the full run go to waste (eventually there werent any runners left) - and thus the associated large-machine costs. On the other hand, I believe this identifies the issue we were having a few weeks back :)
Long Story Short
Using the Parallel Runner to run 50 instantiations of a large computational template, I see unexpected behaviours in the autoscaling. Private cluster is enabled. Also the parallel runner fails to keep 10 jobs running simultaneously. And the timeout fails - it waits until the job is done but then kills the worker?
Expected Behavior
10x 8xlarge machines spawn.
Actual behaviour
Just 5 of those spawn, and 2x machines are also created (maybe for the smaller, accesory services in the pipeline? But the pipeline is fully linear).
This screenshot is after 10hours of running, with 10 parallel jobs (no batching). I would expect 1-2h per job, thus 5-10h total runtime. However, there is still significant computation to get done.
The text was updated successfully, but these errors were encountered: