slurmd nodes fail health check, enter drain state #22

bnordgren · 2024-08-21T19:27:32Z

Bug Description

Nodes come up temporarily, but are quickly set to the drain state.

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
slurmd       up   infinite      2  drain node-b,node-d
ubuntu@hpc-login:~$ sinfo -R 
REASON               USER      TIMESTAMP           NODELIST
NHC: check_ps_servic root      2024-08-21T13:10:00 node-b,node-d

To Reproduce

scontrol update nodename=node-b state=resume
Wait. Usually not more than 5 min.
sinfo

Environment

Base OS is Ubuntu 22.04 deployed by MAAS. Channel is latest/edge.

Relevant log output

juju exec -u slurmctld/0 -- sudo journalctl -u slurmctld -x
...
Aug 21 13:01:21 hpc-login slurmctld[523332]: slurmctld: update_node: node node-b state set to IDLE
Aug 21 13:01:21 hpc-login slurmctld[523332]: slurmctld: update_node: node node-d state set to IDLE
Aug 21 13:01:21 hpc-login slurmctld[523332]: slurmctld: Node node-d now responding
Aug 21 13:01:21 hpc-login slurmctld[523332]: slurmctld: Node node-b now responding
Aug 21 13:01:30 hpc-login slurmctld[523332]: slurmctld: sched/backfill: _start_job: Started JobId=3 in slurmd on node-b,node-d
Aug 21 13:01:30 hpc-login slurmctld[523332]: slurmctld: _job_complete: JobId=3 WTERMSIG 53
Aug 21 13:01:30 hpc-login slurmctld[523332]: slurmctld: _job_complete: JobId=3 done
Aug 21 13:01:53 hpc-login slurmctld[523332]: slurmctld: _slurm_rpc_submit_batch_job: JobId=4 InitPrio=4294901756 usec=909
Aug 21 13:01:53 hpc-login slurmctld[523332]: slurmctld: sched: Allocate JobId=4 NodeList=node-b,node-d #CPUs=4 Partition=slurmd
Aug 21 13:01:53 hpc-login slurmctld[523332]: slurmctld: _job_complete: JobId=4 WEXITSTATUS 0
Aug 21 13:01:53 hpc-login slurmctld[523332]: slurmctld: _job_complete: JobId=4 done
Aug 21 13:10:00 hpc-login slurmctld[523332]: slurmctld: update_node: node node-d reason set to: NHC: check_ps_service:  Service sshd (process sshd) owned by root not running; start in progress
Aug 21 13:10:00 hpc-login slurmctld[523332]: slurmctld: update_node: node node-d state set to DRAINED
Aug 21 13:10:00 hpc-login slurmctld[523332]: slurmctld: update_node: node node-b reason set to: NHC: check_ps_service:  Service sshd (process sshd) owned by root not running; start in progress
Aug 21 13:10:00 hpc-login slurmctld[523332]: slurmctld: update_node: node node-b state set to DRAINED

Additional context

No response

The text was updated successfully, but these errors were encountered:

NucciTheBoss added the needs triage Needs further investigation to determine cause and/or work required to implement fix/feature label Aug 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slurmd nodes fail health check, enter drain state #22

slurmd nodes fail health check, enter drain state #22

bnordgren commented Aug 21, 2024

slurmd nodes fail health check, enter drain state #22

slurmd nodes fail health check, enter drain state #22

Comments

bnordgren commented Aug 21, 2024

Bug Description

To Reproduce

Environment

Relevant log output

Additional context