You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Nodes come up temporarily, but are quickly set to the drain state.
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
slurmd up infinite 2 drain node-b,node-d
ubuntu@hpc-login:~$ sinfo -R
REASON USER TIMESTAMP NODELIST
NHC: check_ps_servic root 2024-08-21T13:10:00 node-b,node-d
To Reproduce
scontrol update nodename=node-b state=resume
Wait. Usually not more than 5 min.
sinfo
Environment
Base OS is Ubuntu 22.04 deployed by MAAS. Channel is latest/edge.
Relevant log output
juju exec -u slurmctld/0 -- sudo journalctl -u slurmctld -x
...
Aug 21 13:01:21 hpc-login slurmctld[523332]: slurmctld: update_node: node node-b state set to IDLE
Aug 21 13:01:21 hpc-login slurmctld[523332]: slurmctld: update_node: node node-d state set to IDLE
Aug 21 13:01:21 hpc-login slurmctld[523332]: slurmctld: Node node-d now responding
Aug 21 13:01:21 hpc-login slurmctld[523332]: slurmctld: Node node-b now responding
Aug 21 13:01:30 hpc-login slurmctld[523332]: slurmctld: sched/backfill: _start_job: Started JobId=3 in slurmd on node-b,node-d
Aug 21 13:01:30 hpc-login slurmctld[523332]: slurmctld: _job_complete: JobId=3 WTERMSIG 53
Aug 21 13:01:30 hpc-login slurmctld[523332]: slurmctld: _job_complete: JobId=3 done
Aug 21 13:01:53 hpc-login slurmctld[523332]: slurmctld: _slurm_rpc_submit_batch_job: JobId=4 InitPrio=4294901756 usec=909
Aug 21 13:01:53 hpc-login slurmctld[523332]: slurmctld: sched: Allocate JobId=4 NodeList=node-b,node-d #CPUs=4 Partition=slurmd
Aug 21 13:01:53 hpc-login slurmctld[523332]: slurmctld: _job_complete: JobId=4 WEXITSTATUS 0
Aug 21 13:01:53 hpc-login slurmctld[523332]: slurmctld: _job_complete: JobId=4 done
Aug 21 13:10:00 hpc-login slurmctld[523332]: slurmctld: update_node: node node-d reason set to: NHC: check_ps_service: Service sshd (process sshd) owned by root not running; start in progress
Aug 21 13:10:00 hpc-login slurmctld[523332]: slurmctld: update_node: node node-d state set to DRAINED
Aug 21 13:10:00 hpc-login slurmctld[523332]: slurmctld: update_node: node node-b reason set to: NHC: check_ps_service: Service sshd (process sshd) owned by root not running; start in progress
Aug 21 13:10:00 hpc-login slurmctld[523332]: slurmctld: update_node: node node-b state set to DRAINED
Additional context
No response
The text was updated successfully, but these errors were encountered:
Bug Description
Nodes come up temporarily, but are quickly set to the drain state.
To Reproduce
scontrol update nodename=node-b state=resume
sinfo
Environment
Base OS is Ubuntu 22.04 deployed by MAAS. Channel is latest/edge.
Relevant log output
Additional context
No response
The text was updated successfully, but these errors were encountered: