Regression: parsl fails on small slurm clusters without slurm accounting #3590

raymondEhlers · 2024-08-16T16:17:20Z

Describe the bug
Although larger systems will consistently have slurm accounting, not all small clusters do (it's useful, but not required). #3422 introduced a change to retrieve the job information from the slurm accounting rather than from the scheduler. This breaks parsl on those clusters.

To Reproduce
On any cluster without slurm accounting, call sacct from a shell. It will fail

Expected behavior
From reading through #3422, I understand using sacct reduces load on the scheduler, which is likely beneficial, especially on larger systems. However, it unexpected breaks systems which previously worked. Although we don't want to add a big support burden, a fairly straightforward fix could be to conditionally try sacct and then fall back to the previous behavior based on squeue.

If sacct will be a requirement going forward, this would be useful to discuss and understand. Thanks!

Environment
A cluster without slurm accounting available

Distributed Environment
N/A

cc: @fjonasALICE

The text was updated successfully, but these errors were encountered:

benclifford · 2024-08-16T17:00:46Z

looks reasonable to try both. tagging @tylern4 who has more understanding of things than me.

tylern4 · 2024-08-16T17:31:15Z

Right I think that makes sense to still support squeue for systems without the accounting database. I wonder if this should be a one time test in the __init__ function and then have two _status functions, the original one called _status_squeue and the new one _status_sacct so the condition only has to be checked once.

It might be good for @raymondEhlers to give the outputs from a system without accounting as well but checking on a local install of slurm without accounting enabled it seems sacct gives return code of 1.

$ sacct; echo $?
Slurm accounting storage is disabled
1

So we should be able to do something like this:

# in the SlurmProvider.__init__
cmd = "sacct"
retcode, stdout, stderr = self.execute_wait(cmd)
if retcode == 0:
    self._status = self._status_sacct
else:
    self._status = self._status_squeue

If that sounds reasonable I'm happy to implement it.

benclifford · 2024-08-16T17:32:35Z

@tylern4 sounds reasonable

benclifford · 2024-08-16T17:41:32Z

(please add lots of DEBUG level logging around this decision point)

raymondEhlers · 2024-08-16T19:53:36Z

Thanks for your quick responses and actions!

I can confirm the return value of sacct on our small cluster without accounting:

rehlers@pc059 ~ $ sacct; echo $?
Slurm accounting storage is disabled
1

benclifford · 2024-08-17T07:43:49Z

fast fix - @raymondEhlers can you test PR #3591 in your environment to check it fixes your problem?

raymondEhlers added the bug label Aug 16, 2024

tylern4 mentioned this issue Aug 16, 2024

Fallback to squeue when sacct is missing in SlurmProvider #3591

Merged

benclifford closed this as completed in #3591 Aug 20, 2024

benclifford closed this as completed in bdfbb26 Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression: parsl fails on small slurm clusters without slurm accounting #3590

Regression: parsl fails on small slurm clusters without slurm accounting #3590

raymondEhlers commented Aug 16, 2024 •

edited

Loading

benclifford commented Aug 16, 2024

tylern4 commented Aug 16, 2024

benclifford commented Aug 16, 2024

benclifford commented Aug 16, 2024

raymondEhlers commented Aug 16, 2024 •

edited

Loading

benclifford commented Aug 17, 2024

Regression: parsl fails on small slurm clusters without slurm accounting #3590

Regression: parsl fails on small slurm clusters without slurm accounting #3590

Comments

raymondEhlers commented Aug 16, 2024 • edited Loading

benclifford commented Aug 16, 2024

tylern4 commented Aug 16, 2024

benclifford commented Aug 16, 2024

benclifford commented Aug 16, 2024

raymondEhlers commented Aug 16, 2024 • edited Loading

benclifford commented Aug 17, 2024

raymondEhlers commented Aug 16, 2024 •

edited

Loading

raymondEhlers commented Aug 16, 2024 •

edited

Loading