-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use sacct to get slurm job information #3422
Conversation
This is a smart idea. I can see many benefits of using |
ok this looks interesting. give me some time to understand it. |
@tylern4 so I think my understanding is that this pushes job status to come from a later, possibly time delayed information source - and I think that means that under not much load, this PR won't change behaviour, but under load, I guess that means that sacct data might be a bit delayed compared to squeue? If so, then I guess there are two things I would consider: i) shortly after job submission, does this mean that sacct might report a job does not exist at all, even though it has been submitted? if so, there's a still a "job missing" case but now around the start of a job, rather than around the end of a job. If so, how that might manifest in Parsl without more work is that: launch a job/block, see it is missing, launch another job/block to replace it, see it is missing, launch another job/block to replace it, ... now you have infinity jobs/blocks when you only wanted one. If this race condition can exist, then I guess this PR should do something with jobs we know were submitted but haven't yet appeared in sacct? (including ... what happens I submit a job and it never appears in sacct, perhaps?) ii) at the end of a job, when a job is finished, sacct might not report it as finished yet, and so if Parsl should then be starting up a new job to replace it, that won't happen until the info propagates to sacct. In this case, I'm less concerned about changing anything, but more about being aware (or documenting?) this behaviour: if the system is under load so much that we can't see jobs finished, it's probably better behaviour to be slowing down our new job submissions anyway. |
There shouldn't be delay between the information since the slurm job id is the index in the database. This should just change "who" is querying the database, either the command asks the manager to query the database or we query the database directly.
(sorry clicked enter too soon) |
parsl/providers/slurm/slurm.py
Outdated
@@ -193,13 +202,15 @@ def _status(self): | |||
stderr_path=self.resources[job_id]['job_stderr_path']) | |||
jobs_missing.remove(job_id) | |||
|
|||
# TODO: This code can probably be depricated with the use of sacct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess then if this code path is being hit, something is going wrong, rather than this being a normal/expected code path? in which case the log line right below might be better upgraded to a warning
from a debug
? I feel like the provider should be doing something to handle the missing jobs case rather than removing this path entirely - even if it's not expected any more, because I've seen enough stuff happen in the world of schedulers with unexpected output/lack of output...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it makes sense to leave it to catch any slurm weirdness for missing jobs. I've updated the message to log as a warning
now.
I encountered a NERSC user using an older version of Parsl, where it looks like they encountered a situation where a job disappeared from squeue, and so then scaling broken, and their workflow didn't finish - I think this PR would have avoided that problem. This PR is on my list of things to test, but I haven't got round to it yet - we don't have automated tests for most providers so I would like to test this manually. |
Glad this might help! I'm staff at NERSC so let me know if there's anything you need for testing on Perlmutter. |
@tylern4 sent you a private email on a different but related topic to do with slurm testing |
Description
Changes from the slurm
squeue
command to thesacct
command. Thesacct
command is a bit easier on the Slurm scheduler as it connects to the slurm database instead of the slurm controller. There's a larger discussion I found on it if you're curious. One other part of usingsacct
is you can get job information for jobs which has finished as well so there may not be a need for thejobs_missing
checks that are currently in the code.Changed Behaviour
Users shouldn't see a difference in running their jobs unless they are scraping the logs for the
squeue
keyword.Type of change
Choose which options apply, and delete the ones which do not apply.