diff --git a/tests/unit/test_alert_rules/test_dcgm.yaml b/tests/unit/test_alert_rules/test_dcgm.yaml index 541cbe9c..36d15914 100644 --- a/tests/unit/test_alert_rules/test_dcgm.yaml +++ b/tests/unit/test_alert_rules/test_dcgm.yaml @@ -24,7 +24,7 @@ tests: This is an indicator of: - External Power Brake Assertion being triggered (e.g. by the system power supply) Throttle reasons (bitmask): 128 - LABELS = map[__name__:DCGM_FI_DEV_CLOCK_THROTTLE_REASONS gpu:0 Hostname:ubuntu-0] + LABELS = map[Hostname:ubuntu-0 gpu:0] - eval_time: 5m alertname: HWThermalThrottle exp_alerts: [] @@ -61,7 +61,7 @@ tests: This is an indicator of: - Temperature being too high Throttle reasons (bitmask): 64 - LABELS = map[__name__:DCGM_FI_DEV_CLOCK_THROTTLE_REASONS gpu:1 Hostname:ubuntu-0] + LABELS = map[Hostname:ubuntu-0 gpu:1] - eval_time: 5m alertname: HWPowerBrakeThrottle exp_alerts: [] @@ -99,7 +99,7 @@ tests: - Current GPU temperature above the GPU Max Operating Temperature - Current memory temperature above the Memory Max Operating Temperature Throttle reasons (bitmask): 32 - LABELS = map[__name__:DCGM_FI_DEV_CLOCK_THROTTLE_REASONS gpu:0 Hostname:ubuntu-1] + LABELS = map[Hostname:ubuntu-1 gpu:0] - eval_time: 5m alertname: HWPowerBrakeThrottle exp_alerts: [] @@ -136,7 +136,7 @@ tests: All GPUs in the sync boost group will boost to the minimum possible clocks across the entire group. Look at the throttle reasons for other GPUs in the system to see why those GPUs are holding this one at lower clocks. Throttle reasons (bitmask): 16 - LABELS = map[__name__:DCGM_FI_DEV_CLOCK_THROTTLE_REASONS gpu:1 Hostname:ubuntu-1] + LABELS = map[Hostname:ubuntu-1 gpu:1 ] - eval_time: 5m alertname: HWPowerBrakeThrottle exp_alerts: [] @@ -176,7 +176,7 @@ tests: - Power draw is too high and Fast Trigger protection is reducing the clocks - May be also reported during PState or clock change Throttle reasons (bitmask): 8 - LABELS = map[__name__:DCGM_FI_DEV_CLOCK_THROTTLE_REASONS gpu:0 Hostname:ubuntu-2] + LABELS = map[Hostname:ubuntu-2 gpu:0] - eval_time: 5m alertname: HWPowerBrakeThrottle exp_alerts: [] @@ -211,7 +211,7 @@ tests: description: | SW Power Scaling algorithm is reducing the clocks below requested clocks on NVIDIA GPU: 1 Throttle reasons (bitmask): 4 - LABELS = map[__name__:DCGM_FI_DEV_CLOCK_THROTTLE_REASONS gpu:1 Hostname:ubuntu-2] + LABELS = map[Hostname:ubuntu-2 gpu:1] - eval_time: 5m alertname: HWPowerBrakeThrottle exp_alerts: [] @@ -273,7 +273,7 @@ tests: This is an indicator of: - External Power Brake Assertion being triggered (e.g. by the system power supply) Throttle reasons (bitmask): 511 - LABELS = map[__name__:DCGM_FI_DEV_CLOCK_THROTTLE_REASONS gpu:2 Hostname:ubuntu-3] + LABELS = map[Hostname:ubuntu-3 gpu:2] - eval_time: 5m alertname: HWThermalThrottle exp_alerts: @@ -288,7 +288,7 @@ tests: This is an indicator of: - Temperature being too high Throttle reasons (bitmask): 511 - LABELS = map[__name__:DCGM_FI_DEV_CLOCK_THROTTLE_REASONS gpu:2 Hostname:ubuntu-3] + LABELS = map[Hostname:ubuntu-3 gpu:2] - eval_time: 5m alertname: SWThermalThrottle exp_alerts: @@ -304,7 +304,7 @@ tests: - Current GPU temperature above the GPU Max Operating Temperature - Current memory temperature above the Memory Max Operating Temperature Throttle reasons (bitmask): 511 - LABELS = map[__name__:DCGM_FI_DEV_CLOCK_THROTTLE_REASONS gpu:2 Hostname:ubuntu-3] + LABELS = map[Hostname:ubuntu-3 gpu:2] - eval_time: 5m alertname: SyncBoostThrottle exp_alerts: @@ -319,7 +319,7 @@ tests: All GPUs in the sync boost group will boost to the minimum possible clocks across the entire group. Look at the throttle reasons for other GPUs in the system to see why those GPUs are holding this one at lower clocks. Throttle reasons (bitmask): 511 - LABELS = map[__name__:DCGM_FI_DEV_CLOCK_THROTTLE_REASONS gpu:2 Hostname:ubuntu-3] + LABELS = map[Hostname:ubuntu-3 gpu:2] - eval_time: 5m alertname: HWSlowdownThrottle exp_alerts: @@ -337,7 +337,7 @@ tests: - Power draw is too high and Fast Trigger protection is reducing the clocks - May be also reported during PState or clock change Throttle reasons (bitmask): 511 - LABELS = map[__name__:DCGM_FI_DEV_CLOCK_THROTTLE_REASONS gpu:2 Hostname:ubuntu-3] + LABELS = map[Hostname:ubuntu-3 gpu:2] - eval_time: 5m alertname: SWPowerThrottle exp_alerts: @@ -350,7 +350,7 @@ tests: description: | SW Power Scaling algorithm is reducing the clocks below requested clocks on NVIDIA GPU: 2 Throttle reasons (bitmask): 511 - LABELS = map[__name__:DCGM_FI_DEV_CLOCK_THROTTLE_REASONS gpu:2 Hostname:ubuntu-3] + LABELS = map[Hostname:ubuntu-3 gpu:2] # Multiple throttling reasons - interval: 1m @@ -372,7 +372,7 @@ tests: This is an indicator of: - External Power Brake Assertion being triggered (e.g. by the system power supply) Throttle reasons (bitmask): 196 - LABELS = map[__name__:DCGM_FI_DEV_CLOCK_THROTTLE_REASONS gpu:0 Hostname:ubuntu-0] + LABELS = map[Hostname:ubuntu-0 gpu:0] - eval_time: 5m alertname: HWThermalThrottle exp_alerts: @@ -387,7 +387,7 @@ tests: This is an indicator of: - Temperature being too high Throttle reasons (bitmask): 196 - LABELS = map[__name__:DCGM_FI_DEV_CLOCK_THROTTLE_REASONS gpu:0 Hostname:ubuntu-0] + LABELS = map[Hostname:ubuntu-0 gpu:0] - eval_time: 5m alertname: SWPowerThrottle exp_alerts: @@ -400,7 +400,7 @@ tests: description: | SW Power Scaling algorithm is reducing the clocks below requested clocks on NVIDIA GPU: 0 Throttle reasons (bitmask): 196 - LABELS = map[__name__:DCGM_FI_DEV_CLOCK_THROTTLE_REASONS gpu:0 Hostname:ubuntu-0] + LABELS = map[Hostname:ubuntu-0 gpu:0] - eval_time: 5m alertname: SyncBoostThrottle exp_alerts: []