Skip to content

Latest commit

 

History

History
105 lines (77 loc) · 3.08 KB

slurm-troubleshoot.md

File metadata and controls

105 lines (77 loc) · 3.08 KB
attr_reveal author title theme
:frag (none none appear)
Janne Blomqvist
Slurm problems
white

Reason for queuing jobs

  • See the list of pending jobs with squeue -t PD
  • The last column, NODELIST (REASON) shows why the job isn't running
  • Open the manual page for squeue with man squeue
    • See the section JOB REASON CODES towards the end
      • The list is NOT exhaustive

The most common reason codes

  • Priority: There is another pending job with higher priority
  • Resources: The job has the highest priority, but waiting for resources to become available, that is, for some running job to finish.
  • AssocMaxJobsLimit: The association (=user/group) has a limit on the number of resources (e.g. GPU's) that can be used by running jobs. You must wait for some of your running jobs to finish before this one can be started.
  • QOSResourceLimit: Hitting QoS limits.

Most common reason codes 2

  • launch failed requeued held: Job launch failed, slurm requeued the job but it must be released with scontrol release JOBID
  • Dependency: Job cannot start before some other job is finished.
  • DependencyNeverSatisfied: Job cannot start before some other job is finished, but that other job failed. You must cancel the job with scancel JOBID

GrpTRESRunMins?

To prevent a single user from monopolizing all resources, we use GrpTRESRunMins. See your limits with

sacctmgr list user $USER witha \
format=Account,MaxCPUs,GrpTRESRunMins%50,GrpTRES%40

Queue priority

In triton we use a hierarchical fair share algorithm which assings a priority value to each pending job.

  • The goal is to balance usage so that each department and user gets their own fair share of the available resources.
  • Historical usage is taken into account, exponetially decayed with half-life of 2 weeks.

Queue prio 2

  • That is:
    • Use lots of resources => lower priority
    • Use less resources => higher priority
  • See the sprio tool
  • Also: scontrol show job JOBID will show the estimated start time of your job (if the scheduler has ever reached it)

Job failure

  • Check the your recently completed jobs: slurm history 3days
  • Check an individual job record: sacct -j JOBID
  • More info with -l switch
  • See the State and ExitCode columns
    • State=COMPLETED and ExitCode=0:0 means (slurm thinks) the job completed successfully

Job failure 2

  • If the job failed
    • The stdout file (e.g. slurm-JOBID.out by default) often contains the reason.
      • E.g. job was killed due to breaching the time limit, or memory limit