Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel Runner & AutoScaling Issues #1816

Open
JavierGOrdonnez opened this issue Dec 21, 2024 · 2 comments
Open

Parallel Runner & AutoScaling Issues #1816

JavierGOrdonnez opened this issue Dec 21, 2024 · 2 comments
Assignees
Labels
Feedback Feedback through frontend type:bug Issue that prevents to perform a certain task, features that don't work as t

Comments

@JavierGOrdonnez
Copy link
Contributor

Long Story Short
Using the Parallel Runner to run 50 instantiations of a large computational template, I see unexpected behaviours in the autoscaling. Private cluster is enabled. Also the parallel runner fails to keep 10 jobs running simultaneously. And the timeout fails - it waits until the job is done but then kills the worker?

Expected Behavior
10x 8xlarge machines spawn.

Actual behaviour
Just 5 of those spawn, and 2x machines are also created (maybe for the smaller, accesory services in the pipeline? But the pipeline is fully linear).

This screenshot is after 10hours of running, with 10 parallel jobs (no batching). I would expect 1-2h per job, thus 5-10h total runtime. However, there is still significant computation to get done.
image
image
image

@JavierGOrdonnez JavierGOrdonnez added type:bug Issue that prevents to perform a certain task, features that don't work as t Feedback Feedback through frontend labels Dec 21, 2024
@JavierGOrdonnez
Copy link
Contributor Author

JavierGOrdonnez commented Dec 21, 2024

TLDR of the issues:

  • @sanderegg why no 10 8xlarge machines spawn?
  • @wvangeit the timeout seems to be functioning wrong - kills the job AFTER it runs forever, and then doesnt spawn a new one.

These issues can be tackled after the break, no hurry. Will make this run work one way or the other over the weekend, and we can tackle this later. Just wanted to report it & document it.

@JavierGOrdonnez
Copy link
Contributor Author

Btw this made the full run go to waste (eventually there werent any runners left) - and thus the associated large-machine costs. On the other hand, I believe this identifies the issue we were having a few weeks back :)
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feedback Feedback through frontend type:bug Issue that prevents to perform a certain task, features that don't work as t
Projects
None yet
Development

No branches or pull requests

3 participants