You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Everything was running fine with ROC-smi, all the graphics where displaying. The performance level was on Auto.
The GPU is a Radeon RX 5700 XT, and it's running in a server without Displays.
But I run the tests script from these github and all these errors where displayed:
root@srv-0003:/ROC-smi# ./test-rocm-smi.sh
===Start of ROCM-SMI test suite===
WARNING: IO or OS error
ERROR: GPU[0] : Unable to set Performance Level, exiting
ERROR: Performance Level sysfs file could not be written
ERROR: GPU[0] : Unable to reset clocks
WARNING: One or more commands failed
Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 -i...
Test complete: /home/zymvol/git/ROC-smi/rocm-smi -d 0 -i
Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 -t...
WARNING: GPU[0] : Unable to read /sys/class/hwmon/hwmon1/temp1_input
WARNING: GPU[0] : Unable to read /sys/class/hwmon/hwmon1/temp2_input
WARNING: GPU[0] : Unable to read /sys/class/hwmon/hwmon1/temp3_input
Test complete: /home/zymvol/git/ROC-smi/rocm-smi -d 0 -t
Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 -c...
ERROR: GPU[0] : Unable to display dcefclk
ERROR: GPU[0] : Unable to display fclk
ERROR: GPU[0] : Unable to display mclk
ERROR: GPU[0] : Unable to display sclk
ERROR: GPU[0] : Unable to display socclk
Test complete: /home/zymvol/git/ROC-smi/rocm-smi -d 0 -c
Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 -f...
FAILURE: GPU fan percentage from rocm-smi 100%) does not match 100
Test complete: /home/zymvol/git/ROC-smi/rocm-smi -d 0 -f
Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 -p...
Test complete: /home/zymvol/git/ROC-smi/rocm-smi -d 0 -p
Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 -s...
FAILURE: Supported PCIE clock frequencies from rocm-smi do not match sysfs values
Test complete: /home/zymvol/git/ROC-smi/rocm-smi -d 0 -s
Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 -o...
FAILURE: OverDrive level from rocm-smi 0 does not match 0%
Test complete: /home/zymvol/git/ROC-smi/rocm-smi -d 0 -o
Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 --setfan...
WARNING: IO or OS error
ERROR: GPU[0] : Unable to set fan speed to Level 0
WARNING: One or more commands failed
FAILURE: Could not set fan to minimum value 0
WARNING: IO or OS error
ERROR: GPU[0] : Unable to set fan speed to Level 255
WARNING: One or more commands failed
WARNING: IO or OS error
ERROR: GPU[0] : Unable to set fan speed to Level 255
WARNING: One or more commands failed
Test complete: /home/zymvol/git/ROC-smi/rocm-smi -d 0 --setfan
Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 --resetfans...
FAILURE: Could not set fan controls to auto (2), hwmon1 still at 1
Test complete: /home/zymvol/git/ROC-smi/rocm-smi -d 0 --resetfans
Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 --setsclk...
ERROR: GPU[0] : Unable to display dcefclk
ERROR: GPU[0] : Unable to display fclk
ERROR: GPU[0] : Unable to display mclk
ERROR: GPU[0] : Unable to display sclk
ERROR: GPU[0] : Unable to display socclk
ERROR: GPU[0] : Unable to set clock level
WARNING: GPU[0] : Unable to get max level for clock type sclk
WARNING: One or more commands failed
Test complete: /home/zymvol/git/ROC-smi/rocm-smi -d 0 --setsclk
Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 --setmclk...
ERROR: GPU[0] : Unable to display dcefclk
ERROR: GPU[0] : Unable to display fclk
ERROR: GPU[0] : Unable to display mclk
ERROR: GPU[0] : Unable to display sclk
ERROR: GPU[0] : Unable to display socclk
Test complete: /home/zymvol/git/ROC-smi/rocm-smi -d 0 --setmclk
Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 --setsclk...
ERROR: GPU[0] : Unable to display dcefclk
ERROR: GPU[0] : Unable to display fclk
ERROR: GPU[0] : Unable to display mclk
ERROR: GPU[0] : Unable to display sclk
ERROR: GPU[0] : Unable to display socclk
Test complete: /home/zymvol/git/ROC-smi/rocm-smi -d 0 --setsclk
Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 -r...
WARNING: IO or OS error
ERROR: GPU[0] : Unable to set Performance Level, exiting
ERROR: Performance Level sysfs file could not be written
ERROR: GPU[0] : Unable to reset clocks
WARNING: One or more commands failed
FAILURE: Could not reset clocks
Test complete: /home/zymvol/git/ROC-smi/rocm-smi -d 0 -r
Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 --setperflevel...
WARNING: IO or OS error
ERROR: GPU[0] : Unable to set Performance Level, exiting
ERROR: Performance Level sysfs file could not be written
ERROR: GPU[0] : Unable to set current Performance Level to low
WARNING: One or more commands failed
FAILURE: Could not set Performance Level to low
Test complete: /home/zymvol/git/ROC-smi/rocm-smi -d 0 --setperflevel
Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 --setoverdrive...
cat: /sys/class/drm/card0/device/pp_od_clk_voltage: No such file or directory
OverDrive not supported. Skipping test.
Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 --setprofile...
Testing Set Profile currently disabled
Test complete: /home/zymvol/git/ROC-smi/rocm-smi -d 0 --setprofile
Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 --resetprofile...
cat: /sys/class/drm/card0/device/pp_power_profile_mode: Input/output error
Power Profile not supported. Exiting
WARNING: IO or OS error
ERROR: GPU[0] : Unable to set Performance Level, exiting
ERROR: Performance Level sysfs file could not be written
ERROR: GPU[0] : Unable to reset clocks
WARNING: One or more commands failed
Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 --save...
WARNING: IO or OS error
ERROR: GPU[0] : Unable to set fan speed to Level 229
WARNING: One or more commands failed
WARNING: GPU[0] : Unable to read /sys/class/drm/card0/device/pp_power_profile_mode
WARNING: GPU[0] : Unable to read /sys/class/drm/card0/device/pp_power_profile_mode
ERROR: GPU[0] : Unable to get power profile
grep: /sys/class/drm/card0/device/pp_power_profile_mode: Input/output error
FAILURE: Saved OverDrive does not match current OverDrive setting 0
Test complete: /home/zymvol/git/ROC-smi/rocm-smi -d 0 --save
Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 --load...
ERROR: NOTE: GPU and MEM Overdrive have been deprecated in the kernel. Use --setslevel/--setmlevel instead
ERROR: Non-integer characters are present in value None
WARNING: One or more commands failed
WARNING: IO or OS error
ERROR: GPU[0] : Unable to set Performance Level, exiting
ERROR: Performance Level sysfs file could not be written
ERROR: GPU[0] : Unable to set current Performance Level to high
WARNING: One or more commands failed
WARNING: GPU[0] : Unable to read /sys/class/drm/card0/device/pp_power_profile_mode
ERROR: GPU[0] : Unable to set profile
ERROR: GPU[0] : Unable to set Power Profile to level 4
WARNING: GPU[0] : Unable to read /sys/class/drm/card0/device/pp_power_profile_mode
WARNING: GPU[0] : Unable to read /sys/class/drm/card0/device/pp_power_profile_mode
ERROR: GPU[0] : Unable to get power profile
WARNING: IO or OS error
ERROR: GPU[0] : Unable to set Performance Level, exiting
ERROR: Performance Level sysfs file could not be written
ERROR: GPU[0] : Unable to set current Performance Level to auto
WARNING: One or more commands failed
WARNING: GPU[0] : Unable to read /sys/class/drm/card0/device/pp_power_profile_mode
ERROR: GPU[0] : Unable to reset Power Profile
grep: /sys/class/drm/card0/device/pp_power_profile_mode: Input/output error
WARNING: IO or OS error
ERROR: GPU[0] : Unable to set fan speed to Level 255
ERROR: NOTE: GPU and MEM Overdrive have been deprecated in the kernel. Use --setslevel/--setmlevel instead
ERROR: Non-integer characters are present in value None
ERROR: NOTE: GPU and MEM Overdrive have been deprecated in the kernel. Use --setslevel/--setmlevel instead
ERROR: Non-integer characters are present in value None
WARNING: One or more commands failed
grep: /sys/class/drm/card0/device/pp_power_profile_mode: Input/output error
FAILURE: Failed to load OverDrive values from save file /tmp/tmp.hw6xwdXt4X/clocks.tmp
Test complete: /home/zymvol/git/ROC-smi/rocm-smi -d 0 --load
9 failure(s) occurred
===End of ROCM-SMI test suite===
WARNING: IO or OS error
ERROR: GPU[0] : Unable to set Performance Level, exiting
ERROR: Performance Level sysfs file could not be written
ERROR: GPU[0] : Unable to reset clocks
WARNING: One or more commands failed
After that rocm-smi is not showing the information:
root@srv-0003:/home/zymvol/git/ROC-smi# rocm-smi
======================= ROCm System Management Interface =======================
================================= Concise Info =================================
Expected integer value from monitor, but got ""
Expected integer value from monitor, but got ""
ERROR: 15 GPU[0]: power: Data (usually from reading a file) was not of the type that was expected
================================================================================
================================================================================
GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
0 N/A N/A None None 100.0% manual 220.0W 2% 0%
================================================================================
WARNING: One or more commands failed
============================= End of ROCm SMI Log ==============================
And I get all these errors in dmesg:
[334436.022960] amdgpu 0000:4b:00.0: amdgpu: Failed to export SMU metrics table!
[334436.031358] amdgpu 0000:4b:00.0: amdgpu: Msg issuing pre-check failed and SMU may be not in the right state!
[334436.031363] amdgpu 0000:4b:00.0: amdgpu: Failed to export SMU metrics table!
[334436.031365] amdgpu 0000:4b:00.0: amdgpu: Msg issuing pre-check failed and SMU may be not in the right state!
[334436.031369] amdgpu 0000:4b:00.0: amdgpu: Failed to export SMU metrics table!
[334436.039771] amdgpu 0000:4b:00.0: amdgpu: Msg issuing pre-check failed and SMU may be not in the right state!
[334436.039776] amdgpu 0000:4b:00.0: amdgpu: Failed to export SMU metrics table!
[334436.039779] amdgpu 0000:4b:00.0: amdgpu: Msg issuing pre-check failed and SMU may be not in the right state!
[334436.039784] amdgpu 0000:4b:00.0: amdgpu: Failed to export SMU metrics table!
I wanted to change the performance level to auto again but I couldn't:
When you see an error like that, it usually means that the SMU is hanging. This is often handled with a GPU reset, or requires a full system reset. Note that the test_rocm_smi.sh script was mostly designed for testing the flags (like a Conformance test) but hasn't been touched in a while. I'd suggest running rsmitst instead, as that is up-to-date. If you are still seeing the SMU hang using rsmitst, please try the ROCm 4.1 release as that features new kernel error handling and SMU firmware, which should hopefully address this issue.
Hello,
Everything was running fine with ROC-smi, all the graphics where displaying. The performance level was on Auto.
The GPU is a Radeon RX 5700 XT, and it's running in a server without Displays.
But I run the tests script from these github and all these errors where displayed:
After that rocm-smi is not showing the information:
And I get all these errors in dmesg:
I wanted to change the performance level to auto again but I couldn't:
Also the sensors are not displaying the temperature of the GPU:
Could you help me revert the changes of the test script?
Thank you,
Berta
The text was updated successfully, but these errors were encountered: