Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use sacct to get slurm job information #3422

Merged
merged 7 commits into from
Jun 4, 2024

Conversation

tylern4
Copy link
Contributor

@tylern4 tylern4 commented May 8, 2024

Description

Changes from the slurm squeue command to the sacct command. The sacct command is a bit easier on the Slurm scheduler as it connects to the slurm database instead of the slurm controller. There's a larger discussion I found on it if you're curious. One other part of using sacct is you can get job information for jobs which has finished as well so there may not be a need for the jobs_missing checks that are currently in the code.

Changed Behaviour

Users shouldn't see a difference in running their jobs unless they are scraping the logs for the squeue keyword.

Type of change

Choose which options apply, and delete the ones which do not apply.

  • New feature

@Andrew-S-Rosen
Copy link
Contributor

This is a smart idea. I can see many benefits of using sacct over squeue.

@benclifford
Copy link
Collaborator

ok this looks interesting. give me some time to understand it.

@benclifford
Copy link
Collaborator

@tylern4 so I think my understanding is that this pushes job status to come from a later, possibly time delayed information source - and I think that means that under not much load, this PR won't change behaviour, but under load, I guess that means that sacct data might be a bit delayed compared to squeue?

If so, then I guess there are two things I would consider:

i) shortly after job submission, does this mean that sacct might report a job does not exist at all, even though it has been submitted? if so, there's a still a "job missing" case but now around the start of a job, rather than around the end of a job. If so, how that might manifest in Parsl without more work is that: launch a job/block, see it is missing, launch another job/block to replace it, see it is missing, launch another job/block to replace it, ... now you have infinity jobs/blocks when you only wanted one. If this race condition can exist, then I guess this PR should do something with jobs we know were submitted but haven't yet appeared in sacct? (including ... what happens I submit a job and it never appears in sacct, perhaps?)

ii) at the end of a job, when a job is finished, sacct might not report it as finished yet, and so if Parsl should then be starting up a new job to replace it, that won't happen until the info propagates to sacct. In this case, I'm less concerned about changing anything, but more about being aware (or documenting?) this behaviour: if the system is under load so much that we can't see jobs finished, it's probably better behaviour to be slowing down our new job submissions anyway.

@tylern4
Copy link
Contributor Author

tylern4 commented May 15, 2024

There shouldn't be delay between the information since the slurm job id is the index in the database. This should just change "who" is querying the database, either the command asks the manager to query the database or we query the database directly.

  1. Since jobs are submitted with the sbatch command and sbatch blocks until it gets a job id from the database there shouldn't be a time parsl wouldn't be able to query the job id. There may be a time when you could query sacct (without a job id) and not see a job that is currently being sbatched simultaneously in another process, but since parsl blocks on submitting a job and then once it gets a job id puts it in the pending state automatically this shouldn't be a problem.

(sorry clicked enter too soon)
2. This comes down to the CG, completing, state in slurm. The slurm manager knows a job is stopping but it hasn't been updated in the database yet, but the current code still has that job as JobState.RUNNING when it's completing which would be the same case if you queried the database with sacct and still saw the job as running.

@benclifford benclifford changed the title Proposing to use sacct to get slurm job information Use sacct to get slurm job information May 16, 2024
@@ -193,13 +202,15 @@ def _status(self):
stderr_path=self.resources[job_id]['job_stderr_path'])
jobs_missing.remove(job_id)

# TODO: This code can probably be depricated with the use of sacct
Copy link
Collaborator

@benclifford benclifford May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess then if this code path is being hit, something is going wrong, rather than this being a normal/expected code path? in which case the log line right below might be better upgraded to a warning from a debug? I feel like the provider should be doing something to handle the missing jobs case rather than removing this path entirely - even if it's not expected any more, because I've seen enough stuff happen in the world of schedulers with unexpected output/lack of output...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it makes sense to leave it to catch any slurm weirdness for missing jobs. I've updated the message to log as a warning now.

@benclifford
Copy link
Collaborator

I encountered a NERSC user using an older version of Parsl, where it looks like they encountered a situation where a job disappeared from squeue, and so then scaling broken, and their workflow didn't finish - I think this PR would have avoided that problem.

This PR is on my list of things to test, but I haven't got round to it yet - we don't have automated tests for most providers so I would like to test this manually.

@tylern4
Copy link
Contributor Author

tylern4 commented May 29, 2024

Glad this might help! I'm staff at NERSC so let me know if there's anything you need for testing on Perlmutter.

@benclifford
Copy link
Collaborator

@tylern4 sent you a private email on a different but related topic to do with slurm testing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants