Use sacct to get slurm job information #3422

tylern4 · 2024-05-08T17:49:52Z

Description

Changes from the slurm squeue command to the sacct command. The sacct command is a bit easier on the Slurm scheduler as it connects to the slurm database instead of the slurm controller. There's a larger discussion I found on it if you're curious. One other part of using sacct is you can get job information for jobs which has finished as well so there may not be a need for the jobs_missing checks that are currently in the code.

Changed Behaviour

Users shouldn't see a difference in running their jobs unless they are scraping the logs for the squeue keyword.

Type of change

Choose which options apply, and delete the ones which do not apply.

New feature

Andrew-S-Rosen · 2024-05-09T04:18:23Z

This is a smart idea. I can see many benefits of using sacct over squeue.

benclifford · 2024-05-09T14:17:18Z

ok this looks interesting. give me some time to understand it.

benclifford · 2024-05-15T09:11:20Z

@tylern4 so I think my understanding is that this pushes job status to come from a later, possibly time delayed information source - and I think that means that under not much load, this PR won't change behaviour, but under load, I guess that means that sacct data might be a bit delayed compared to squeue?

If so, then I guess there are two things I would consider:

i) shortly after job submission, does this mean that sacct might report a job does not exist at all, even though it has been submitted? if so, there's a still a "job missing" case but now around the start of a job, rather than around the end of a job. If so, how that might manifest in Parsl without more work is that: launch a job/block, see it is missing, launch another job/block to replace it, see it is missing, launch another job/block to replace it, ... now you have infinity jobs/blocks when you only wanted one. If this race condition can exist, then I guess this PR should do something with jobs we know were submitted but haven't yet appeared in sacct? (including ... what happens I submit a job and it never appears in sacct, perhaps?)

ii) at the end of a job, when a job is finished, sacct might not report it as finished yet, and so if Parsl should then be starting up a new job to replace it, that won't happen until the info propagates to sacct. In this case, I'm less concerned about changing anything, but more about being aware (or documenting?) this behaviour: if the system is under load so much that we can't see jobs finished, it's probably better behaviour to be slowing down our new job submissions anyway.

tylern4 · 2024-05-15T16:19:21Z

There shouldn't be delay between the information since the slurm job id is the index in the database. This should just change "who" is querying the database, either the command asks the manager to query the database or we query the database directly.

Since jobs are submitted with the sbatch command and sbatch blocks until it gets a job id from the database there shouldn't be a time parsl wouldn't be able to query the job id. There may be a time when you could query sacct (without a job id) and not see a job that is currently being sbatched simultaneously in another process, but since parsl blocks on submitting a job and then once it gets a job id puts it in the pending state automatically this shouldn't be a problem.

(sorry clicked enter too soon)
2. This comes down to the CG, completing, state in slurm. The slurm manager knows a job is stopping but it hasn't been updated in the database yet, but the current code still has that job as JobState.RUNNING when it's completing which would be the same case if you queried the database with sacct and still saw the job as running.

benclifford · 2024-05-16T13:35:16Z

parsl/providers/slurm/slurm.py

@@ -193,13 +202,15 @@ def _status(self):
                                                         stderr_path=self.resources[job_id]['job_stderr_path'])
            jobs_missing.remove(job_id)

+        # TODO: This code can probably be depricated with the use of sacct


I guess then if this code path is being hit, something is going wrong, rather than this being a normal/expected code path? in which case the log line right below might be better upgraded to a warning from a debug? I feel like the provider should be doing something to handle the missing jobs case rather than removing this path entirely - even if it's not expected any more, because I've seen enough stuff happen in the world of schedulers with unexpected output/lack of output...

Yeah it makes sense to leave it to catch any slurm weirdness for missing jobs. I've updated the message to log as a warning now.

benclifford · 2024-05-29T11:43:52Z

I encountered a NERSC user using an older version of Parsl, where it looks like they encountered a situation where a job disappeared from squeue, and so then scaling broken, and their workflow didn't finish - I think this PR would have avoided that problem.

This PR is on my list of things to test, but I haven't got round to it yet - we don't have automated tests for most providers so I would like to test this manually.

tylern4 · 2024-05-29T15:12:20Z

Glad this might help! I'm staff at NERSC so let me know if there's anything you need for testing on Perlmutter.

benclifford · 2024-05-29T15:26:54Z

@tylern4 sent you a private email on a different but related topic to do with slurm testing

tylern4 added 3 commits May 8, 2024 10:22

First pass squeue to sacct

155eacc

Update Slurm output splitting

71bdbca

Add comment on output spacing

66cb58a

benclifford changed the title ~~Proposing to use sacct to get slurm job information~~ Use sacct to get slurm job information May 16, 2024

benclifford reviewed May 16, 2024

View reviewed changes

tylern4 added 2 commits May 17, 2024 11:05

Change message to warning

b1dd197

Change message to warning

d761d21

benclifford added 2 commits June 4, 2024 11:59

Merge branch 'master' into using_sacct_for_slurm

80a1c34

Fix spelling

8e4f8dc

benclifford approved these changes Jun 4, 2024

View reviewed changes

benclifford merged commit bf98e50 into Parsl:master Jun 4, 2024
6 checks passed

benclifford mentioned this pull request Jun 13, 2024

SlurmProvider does not ask for status of completed jobs, then issues a warning when it does not get status #3488

Open

raymondEhlers mentioned this pull request Aug 16, 2024

Regression: parsl fails on small slurm clusters without slurm accounting #3590

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use sacct to get slurm job information #3422

Use sacct to get slurm job information #3422

tylern4 commented May 8, 2024

Andrew-S-Rosen commented May 9, 2024

benclifford commented May 9, 2024

benclifford commented May 15, 2024

tylern4 commented May 15, 2024 •

edited

Loading

benclifford May 16, 2024 •

edited

Loading

tylern4 May 17, 2024

benclifford commented May 29, 2024

tylern4 commented May 29, 2024

benclifford commented May 29, 2024

Use sacct to get slurm job information #3422

Use sacct to get slurm job information #3422

Conversation

tylern4 commented May 8, 2024

Description

Changed Behaviour

Type of change

Andrew-S-Rosen commented May 9, 2024

benclifford commented May 9, 2024

benclifford commented May 15, 2024

tylern4 commented May 15, 2024 • edited Loading

benclifford May 16, 2024 • edited Loading

Choose a reason for hiding this comment

tylern4 May 17, 2024

Choose a reason for hiding this comment

benclifford commented May 29, 2024

tylern4 commented May 29, 2024

benclifford commented May 29, 2024

tylern4 commented May 15, 2024 •

edited

Loading

benclifford May 16, 2024 •

edited

Loading