Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression: parsl fails on small slurm clusters without slurm accounting #3590

Closed
raymondEhlers opened this issue Aug 16, 2024 · 6 comments · Fixed by #3591
Closed

Regression: parsl fails on small slurm clusters without slurm accounting #3590

raymondEhlers opened this issue Aug 16, 2024 · 6 comments · Fixed by #3591
Labels

Comments

@raymondEhlers
Copy link

raymondEhlers commented Aug 16, 2024

Describe the bug
Although larger systems will consistently have slurm accounting, not all small clusters do (it's useful, but not required). #3422 introduced a change to retrieve the job information from the slurm accounting rather than from the scheduler. This breaks parsl on those clusters.

To Reproduce
On any cluster without slurm accounting, call sacct from a shell. It will fail

Expected behavior
From reading through #3422, I understand using sacct reduces load on the scheduler, which is likely beneficial, especially on larger systems. However, it unexpected breaks systems which previously worked. Although we don't want to add a big support burden, a fairly straightforward fix could be to conditionally try sacct and then fall back to the previous behavior based on squeue.

If sacct will be a requirement going forward, this would be useful to discuss and understand. Thanks!

Environment
A cluster without slurm accounting available

Distributed Environment
N/A

cc: @fjonasALICE

@benclifford
Copy link
Collaborator

looks reasonable to try both. tagging @tylern4 who has more understanding of things than me.

@tylern4
Copy link
Contributor

tylern4 commented Aug 16, 2024

Right I think that makes sense to still support squeue for systems without the accounting database. I wonder if this should be a one time test in the __init__ function and then have two _status functions, the original one called _status_squeue and the new one _status_sacct so the condition only has to be checked once.

It might be good for @raymondEhlers to give the outputs from a system without accounting as well but checking on a local install of slurm without accounting enabled it seems sacct gives return code of 1.

$ sacct; echo $?
Slurm accounting storage is disabled
1

So we should be able to do something like this:

# in the SlurmProvider.__init__
cmd = "sacct"
retcode, stdout, stderr = self.execute_wait(cmd)
if retcode == 0:
    self._status = self._status_sacct
else:
    self._status = self._status_squeue

If that sounds reasonable I'm happy to implement it.

@benclifford
Copy link
Collaborator

@tylern4 sounds reasonable

@benclifford
Copy link
Collaborator

(please add lots of DEBUG level logging around this decision point)

@raymondEhlers
Copy link
Author

raymondEhlers commented Aug 16, 2024

Thanks for your quick responses and actions!

I can confirm the return value of sacct on our small cluster without accounting:

rehlers@pc059 ~ $ sacct; echo $?
Slurm accounting storage is disabled
1

@benclifford
Copy link
Collaborator

fast fix - @raymondEhlers can you test PR #3591 in your environment to check it fixes your problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants