-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression: parsl fails on small slurm clusters without slurm accounting #3590
Comments
looks reasonable to try both. tagging @tylern4 who has more understanding of things than me. |
Right I think that makes sense to still support It might be good for @raymondEhlers to give the outputs from a system without accounting as well but checking on a local install of slurm without accounting enabled it seems $ sacct; echo $?
Slurm accounting storage is disabled
1 So we should be able to do something like this: # in the SlurmProvider.__init__
cmd = "sacct"
retcode, stdout, stderr = self.execute_wait(cmd)
if retcode == 0:
self._status = self._status_sacct
else:
self._status = self._status_squeue If that sounds reasonable I'm happy to implement it. |
@tylern4 sounds reasonable |
(please add lots of |
Thanks for your quick responses and actions! I can confirm the return value of rehlers@pc059 ~ $ sacct; echo $?
Slurm accounting storage is disabled
1 |
fast fix - @raymondEhlers can you test PR #3591 in your environment to check it fixes your problem? |
Describe the bug
Although larger systems will consistently have slurm accounting, not all small clusters do (it's useful, but not required). #3422 introduced a change to retrieve the job information from the slurm accounting rather than from the scheduler. This breaks parsl on those clusters.
To Reproduce
On any cluster without slurm accounting, call
sacct
from a shell. It will failExpected behavior
From reading through #3422, I understand using
sacct
reduces load on the scheduler, which is likely beneficial, especially on larger systems. However, it unexpected breaks systems which previously worked. Although we don't want to add a big support burden, a fairly straightforward fix could be to conditionally trysacct
and then fall back to the previous behavior based onsqueue
.If sacct will be a requirement going forward, this would be useful to discuss and understand. Thanks!
Environment
A cluster without slurm accounting available
Distributed Environment
N/A
cc: @fjonasALICE
The text was updated successfully, but these errors were encountered: