Skip to content

Commit

Permalink
Fix test labels
Browse files Browse the repository at this point in the history
  • Loading branch information
Deezzir committed Oct 1, 2024
1 parent 78c31bc commit 4cce89d
Showing 1 changed file with 15 additions and 15 deletions.
30 changes: 15 additions & 15 deletions tests/unit/test_alert_rules/test_dcgm.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ tests:
This is an indicator of:
- External Power Brake Assertion being triggered (e.g. by the system power supply)
Throttle reasons (bitmask): 128
LABELS = map[__name__:DCGM_FI_DEV_CLOCK_THROTTLE_REASONS gpu:0 Hostname:ubuntu-0]
LABELS = map[Hostname:ubuntu-0 gpu:0]
- eval_time: 5m
alertname: HWThermalThrottle
exp_alerts: []
Expand Down Expand Up @@ -61,7 +61,7 @@ tests:
This is an indicator of:
- Temperature being too high
Throttle reasons (bitmask): 64
LABELS = map[__name__:DCGM_FI_DEV_CLOCK_THROTTLE_REASONS gpu:1 Hostname:ubuntu-0]
LABELS = map[Hostname:ubuntu-0 gpu:1]
- eval_time: 5m
alertname: HWPowerBrakeThrottle
exp_alerts: []
Expand Down Expand Up @@ -99,7 +99,7 @@ tests:
- Current GPU temperature above the GPU Max Operating Temperature
- Current memory temperature above the Memory Max Operating Temperature
Throttle reasons (bitmask): 32
LABELS = map[__name__:DCGM_FI_DEV_CLOCK_THROTTLE_REASONS gpu:0 Hostname:ubuntu-1]
LABELS = map[Hostname:ubuntu-1 gpu:0]
- eval_time: 5m
alertname: HWPowerBrakeThrottle
exp_alerts: []
Expand Down Expand Up @@ -136,7 +136,7 @@ tests:
All GPUs in the sync boost group will boost to the minimum possible clocks across the entire group.
Look at the throttle reasons for other GPUs in the system to see why those GPUs are holding this one at lower clocks.
Throttle reasons (bitmask): 16
LABELS = map[__name__:DCGM_FI_DEV_CLOCK_THROTTLE_REASONS gpu:1 Hostname:ubuntu-1]
LABELS = map[Hostname:ubuntu-1 gpu:1 ]
- eval_time: 5m
alertname: HWPowerBrakeThrottle
exp_alerts: []
Expand Down Expand Up @@ -176,7 +176,7 @@ tests:
- Power draw is too high and Fast Trigger protection is reducing the clocks
- May be also reported during PState or clock change
Throttle reasons (bitmask): 8
LABELS = map[__name__:DCGM_FI_DEV_CLOCK_THROTTLE_REASONS gpu:0 Hostname:ubuntu-2]
LABELS = map[Hostname:ubuntu-2 gpu:0]
- eval_time: 5m
alertname: HWPowerBrakeThrottle
exp_alerts: []
Expand Down Expand Up @@ -211,7 +211,7 @@ tests:
description: |
SW Power Scaling algorithm is reducing the clocks below requested clocks on NVIDIA GPU: 1
Throttle reasons (bitmask): 4
LABELS = map[__name__:DCGM_FI_DEV_CLOCK_THROTTLE_REASONS gpu:1 Hostname:ubuntu-2]
LABELS = map[Hostname:ubuntu-2 gpu:1]
- eval_time: 5m
alertname: HWPowerBrakeThrottle
exp_alerts: []
Expand Down Expand Up @@ -273,7 +273,7 @@ tests:
This is an indicator of:
- External Power Brake Assertion being triggered (e.g. by the system power supply)
Throttle reasons (bitmask): 511
LABELS = map[__name__:DCGM_FI_DEV_CLOCK_THROTTLE_REASONS gpu:2 Hostname:ubuntu-3]
LABELS = map[Hostname:ubuntu-3 gpu:2]
- eval_time: 5m
alertname: HWThermalThrottle
exp_alerts:
Expand All @@ -288,7 +288,7 @@ tests:
This is an indicator of:
- Temperature being too high
Throttle reasons (bitmask): 511
LABELS = map[__name__:DCGM_FI_DEV_CLOCK_THROTTLE_REASONS gpu:2 Hostname:ubuntu-3]
LABELS = map[Hostname:ubuntu-3 gpu:2]
- eval_time: 5m
alertname: SWThermalThrottle
exp_alerts:
Expand All @@ -304,7 +304,7 @@ tests:
- Current GPU temperature above the GPU Max Operating Temperature
- Current memory temperature above the Memory Max Operating Temperature
Throttle reasons (bitmask): 511
LABELS = map[__name__:DCGM_FI_DEV_CLOCK_THROTTLE_REASONS gpu:2 Hostname:ubuntu-3]
LABELS = map[Hostname:ubuntu-3 gpu:2]
- eval_time: 5m
alertname: SyncBoostThrottle
exp_alerts:
Expand All @@ -319,7 +319,7 @@ tests:
All GPUs in the sync boost group will boost to the minimum possible clocks across the entire group.
Look at the throttle reasons for other GPUs in the system to see why those GPUs are holding this one at lower clocks.
Throttle reasons (bitmask): 511
LABELS = map[__name__:DCGM_FI_DEV_CLOCK_THROTTLE_REASONS gpu:2 Hostname:ubuntu-3]
LABELS = map[Hostname:ubuntu-3 gpu:2]
- eval_time: 5m
alertname: HWSlowdownThrottle
exp_alerts:
Expand All @@ -337,7 +337,7 @@ tests:
- Power draw is too high and Fast Trigger protection is reducing the clocks
- May be also reported during PState or clock change
Throttle reasons (bitmask): 511
LABELS = map[__name__:DCGM_FI_DEV_CLOCK_THROTTLE_REASONS gpu:2 Hostname:ubuntu-3]
LABELS = map[Hostname:ubuntu-3 gpu:2]
- eval_time: 5m
alertname: SWPowerThrottle
exp_alerts:
Expand All @@ -350,7 +350,7 @@ tests:
description: |
SW Power Scaling algorithm is reducing the clocks below requested clocks on NVIDIA GPU: 2
Throttle reasons (bitmask): 511
LABELS = map[__name__:DCGM_FI_DEV_CLOCK_THROTTLE_REASONS gpu:2 Hostname:ubuntu-3]
LABELS = map[Hostname:ubuntu-3 gpu:2]
# Multiple throttling reasons
- interval: 1m
Expand All @@ -372,7 +372,7 @@ tests:
This is an indicator of:
- External Power Brake Assertion being triggered (e.g. by the system power supply)
Throttle reasons (bitmask): 196
LABELS = map[__name__:DCGM_FI_DEV_CLOCK_THROTTLE_REASONS gpu:0 Hostname:ubuntu-0]
LABELS = map[Hostname:ubuntu-0 gpu:0]
- eval_time: 5m
alertname: HWThermalThrottle
exp_alerts:
Expand All @@ -387,7 +387,7 @@ tests:
This is an indicator of:
- Temperature being too high
Throttle reasons (bitmask): 196
LABELS = map[__name__:DCGM_FI_DEV_CLOCK_THROTTLE_REASONS gpu:0 Hostname:ubuntu-0]
LABELS = map[Hostname:ubuntu-0 gpu:0]
- eval_time: 5m
alertname: SWPowerThrottle
exp_alerts:
Expand All @@ -400,7 +400,7 @@ tests:
description: |
SW Power Scaling algorithm is reducing the clocks below requested clocks on NVIDIA GPU: 0
Throttle reasons (bitmask): 196
LABELS = map[__name__:DCGM_FI_DEV_CLOCK_THROTTLE_REASONS gpu:0 Hostname:ubuntu-0]
LABELS = map[Hostname:ubuntu-0 gpu:0]
- eval_time: 5m
alertname: SyncBoostThrottle
exp_alerts: []
Expand Down

0 comments on commit 4cce89d

Please sign in to comment.