Skip to content

Commit

Permalink
Fix test & Add sync boost throttle reason
Browse files Browse the repository at this point in the history
  • Loading branch information
Deezzir committed Oct 1, 2024
1 parent 8416584 commit 78c31bc
Show file tree
Hide file tree
Showing 2 changed files with 209 additions and 20 deletions.
46 changes: 33 additions & 13 deletions src/prometheus_alert_rules/dcgm.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,50 +5,68 @@ groups:
# isolate the least significant 8 bits with % 256
# check whether bit 7 (starts from bit 0) has been set with the >= 128 comparison
expr: DCGM_FI_DEV_CLOCK_THROTTLE_REASONS % 256 >= 128
for: 3m
for: 5m
labels:
severity: warning
annotations:
summary: GPU Hardware Power Brake Slowdown throttling detected. (instance {{ $labels.Hostname }})
description: |
HW Power Brake Slowdown (reducing the core clocks by a factor of 2 or more) is engaged on NVIDIA GPU: {{ $labels.gpu }}.
HW Power Brake Slowdown (reducing the core clocks by a factor of 2 or more) is engaged on NVIDIA GPU: {{ $labels.gpu }}
This is an indicator of:
- External Power Brake Assertion being triggered (e.g. by the system power supply)
LABELS = {{ $labels }}
Throttle reasons (bitmask): {{ $value }}
LABELS = {{ $labels }}
- alert: HWThermalThrottle
# isolate the least significant 7 bits with % 128
# check whether bit 6 (starts from bit 0) has been set with the >= 64 comparison
expr: DCGM_FI_DEV_CLOCK_THROTTLE_REASONS % 128 >= 64
for: 3m
for: 5m
labels:
severity: warning
annotations:
summary: GPU Hardware Thermal throttling detected. (instance {{ $labels.Hostname }})
description: |
HW Thermal Slowdown (reducing the core clocks by a factor of 2 or more) is engaged on NVIDIA GPU: {{ $labels.gpu }}.
HW Thermal Slowdown (reducing the core clocks by a factor of 2 or more) is engaged on NVIDIA GPU: {{ $labels.gpu }}
This is an indicator of:
- Temperature being too high
LABELS = {{ $labels }}
Throttle reasons (bitmask): {{ $value }}
LABELS = {{ $labels }}
- alert: SWThermalThrottle
# isolate the least significant 6 bits with % 64
# check whether bit 5 (starts from bit 0) has been set with the >= 32 comparison
expr: DCGM_FI_DEV_CLOCK_THROTTLE_REASONS % 64 >= 32
for: 3m
for: 5m
labels:
severity: warning
annotations:
summary: GPU Software Thermal throttling detected. (instance {{ $labels.Hostname }})
description: |
SW Thermal Slowdown is engaged on NVIDIA GPU: {{ $labels.gpu }}.
SW Thermal Slowdown is engaged on NVIDIA GPU: {{ $labels.gpu }}
This is an indicator of:
- Current GPU temperature above the GPU Max Operating Temperature
- Current memory temperature above the Memory Max Operating Temperature
LABELS = {{ $labels }}
Throttle reasons (bitmask): {{ $value }}
LABELS = {{ $labels }}
- alert: SyncBoostThrottle
# isolate the least significant 5 bits with % 32
# check whether bit 4 (starts from bit 0) has been set with the >= 16 comparison
expr: DCGM_FI_DEV_CLOCK_THROTTLE_REASONS % 32 >= 16
for: 5m
labels:
severity: warning
annotations:
summary: GPU Sync Boost throttling detected. (instance {{ $labels.Hostname }})
description: |
This NVIDIA GPU: {{ $labels.gpu }} has been added to a Sync boost group with nvidia-smi or DCGM in order to maximize performance per watt.
All GPUs in the sync boost group will boost to the minimum possible clocks across the entire group.
Look at the throttle reasons for other GPUs in the system to see why those GPUs are holding this one at lower clocks.
Throttle reasons (bitmask): {{ $value }}
LABELS = {{ $labels }}
- alert: HWSlowdownThrottle
# isolate the least significant 4 bits with % 16
# check whether bit 3 (starts from bit 0) has been set with the >= 8 comparison
expr: DCGM_FI_DEV_CLOCK_THROTTLE_REASONS % 16 >= 8
for: 3m
for: 5m
labels:
severity: warning
annotations:
Expand All @@ -60,7 +78,8 @@ groups:
- External Power Brake Assertion is triggered (e.g. by the system power supply)
- Power draw is too high and Fast Trigger protection is reducing the clocks
- May be also reported during PState or clock change
LABELS = {{ $labels }}
Throttle reasons (bitmask): {{ $value }}
LABELS = {{ $labels }}
- alert: SWPowerThrottle
# isolate the least significant 3 bits with % 8
# check whether bit 2 (starts from bit 0) has been set with the >= 4 comparison
Expand All @@ -71,5 +90,6 @@ groups:
annotations:
summary: GPU Software Power throttling detected. (instance {{ $labels.Hostname }})
description: |
SW Power Scaling algorithm is reducing the clocks below requested clocks on NVIDIA GPU: {{ $labels.gpu }}.
LABELS = {{ $labels }}
SW Power Scaling algorithm is reducing the clocks below requested clocks on NVIDIA GPU: {{ $labels.gpu }}
Throttle reasons (bitmask): {{ $value }}
LABELS = {{ $labels }}
Loading

0 comments on commit 78c31bc

Please sign in to comment.