Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors after executing Test script #97

Open
BertaViader-zv opened this issue Feb 1, 2021 · 1 comment
Open

Errors after executing Test script #97

BertaViader-zv opened this issue Feb 1, 2021 · 1 comment

Comments

@BertaViader-zv
Copy link

BertaViader-zv commented Feb 1, 2021

Hello,

Everything was running fine with ROC-smi, all the graphics where displaying. The performance level was on Auto.

The GPU is a Radeon RX 5700 XT, and it's running in a server without Displays.

But I run the tests script from these github and all these errors where displayed:

root@srv-0003:/ROC-smi# ./test-rocm-smi.sh
===Start of ROCM-SMI test suite===
WARNING: IO or OS error
ERROR: GPU[0]           : Unable to set Performance Level, exiting
ERROR: Performance Level sysfs file could not be written
ERROR: GPU[0]           : Unable to reset clocks
WARNING: One or more commands failed

Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 -i...
Test complete: /home/zymvol/git/ROC-smi/rocm-smi -d 0 -i


Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 -t...
WARNING: GPU[0] : Unable to read /sys/class/hwmon/hwmon1/temp1_input
WARNING: GPU[0] : Unable to read /sys/class/hwmon/hwmon1/temp2_input
WARNING: GPU[0] : Unable to read /sys/class/hwmon/hwmon1/temp3_input
Test complete: /home/zymvol/git/ROC-smi/rocm-smi -d 0 -t


Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 -c...
ERROR: GPU[0]           : Unable to display dcefclk
ERROR: GPU[0]           : Unable to display fclk
ERROR: GPU[0]           : Unable to display mclk
ERROR: GPU[0]           : Unable to display sclk
ERROR: GPU[0]           : Unable to display socclk
Test complete: /home/zymvol/git/ROC-smi/rocm-smi -d 0 -c


Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 -f...
FAILURE: GPU fan percentage from rocm-smi 100%) does not match 100
Test complete: /home/zymvol/git/ROC-smi/rocm-smi -d 0 -f


Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 -p...
Test complete: /home/zymvol/git/ROC-smi/rocm-smi -d 0 -p


Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 -s...
FAILURE: Supported PCIE clock frequencies from rocm-smi do not match sysfs values
Test complete: /home/zymvol/git/ROC-smi/rocm-smi -d 0 -s


Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 -o...
FAILURE: OverDrive level from rocm-smi 0 does not match 0%
Test complete: /home/zymvol/git/ROC-smi/rocm-smi -d 0 -o


Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 --setfan...
WARNING: IO or OS error
ERROR: GPU[0]           : Unable to set fan speed to Level 0
WARNING: One or more commands failed
FAILURE: Could not set fan to minimum value 0
WARNING: IO or OS error
ERROR: GPU[0]           : Unable to set fan speed to Level 255
WARNING: One or more commands failed
WARNING: IO or OS error
ERROR: GPU[0]           : Unable to set fan speed to Level 255
WARNING: One or more commands failed
Test complete: /home/zymvol/git/ROC-smi/rocm-smi -d 0 --setfan


Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 --resetfans...
FAILURE: Could not set fan controls to auto (2), hwmon1 still at 1
Test complete: /home/zymvol/git/ROC-smi/rocm-smi -d 0 --resetfans


Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 --setsclk...
ERROR: GPU[0]           : Unable to display dcefclk
ERROR: GPU[0]           : Unable to display fclk
ERROR: GPU[0]           : Unable to display mclk
ERROR: GPU[0]           : Unable to display sclk
ERROR: GPU[0]           : Unable to display socclk
ERROR: GPU[0]           : Unable to set clock level
WARNING: GPU[0] : Unable to get max level for clock type sclk
WARNING: One or more commands failed
Test complete: /home/zymvol/git/ROC-smi/rocm-smi -d 0 --setsclk


Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 --setmclk...
ERROR: GPU[0]           : Unable to display dcefclk
ERROR: GPU[0]           : Unable to display fclk
ERROR: GPU[0]           : Unable to display mclk
ERROR: GPU[0]           : Unable to display sclk
ERROR: GPU[0]           : Unable to display socclk
Test complete: /home/zymvol/git/ROC-smi/rocm-smi -d 0 --setmclk


Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 --setsclk...
ERROR: GPU[0]           : Unable to display dcefclk
ERROR: GPU[0]           : Unable to display fclk
ERROR: GPU[0]           : Unable to display mclk
ERROR: GPU[0]           : Unable to display sclk
ERROR: GPU[0]           : Unable to display socclk
Test complete: /home/zymvol/git/ROC-smi/rocm-smi -d 0 --setsclk


Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 -r...
WARNING: IO or OS error
ERROR: GPU[0]           : Unable to set Performance Level, exiting
ERROR: Performance Level sysfs file could not be written
ERROR: GPU[0]           : Unable to reset clocks
WARNING: One or more commands failed
FAILURE: Could not reset clocks
Test complete: /home/zymvol/git/ROC-smi/rocm-smi -d 0 -r


Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 --setperflevel...
WARNING: IO or OS error
ERROR: GPU[0]           : Unable to set Performance Level, exiting
ERROR: Performance Level sysfs file could not be written
ERROR: GPU[0]           : Unable to set current Performance Level to low
WARNING: One or more commands failed
FAILURE: Could not set Performance Level to low
Test complete: /home/zymvol/git/ROC-smi/rocm-smi -d 0 --setperflevel


Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 --setoverdrive...
cat: /sys/class/drm/card0/device/pp_od_clk_voltage: No such file or directory
OverDrive not supported. Skipping test.

Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 --setprofile...
Testing Set Profile currently disabled
Test complete: /home/zymvol/git/ROC-smi/rocm-smi -d 0 --setprofile


Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 --resetprofile...
cat: /sys/class/drm/card0/device/pp_power_profile_mode: Input/output error
Power Profile not supported. Exiting
WARNING: IO or OS error
ERROR: GPU[0]           : Unable to set Performance Level, exiting
ERROR: Performance Level sysfs file could not be written
ERROR: GPU[0]           : Unable to reset clocks
WARNING: One or more commands failed

Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 --save...
WARNING: IO or OS error
ERROR: GPU[0]           : Unable to set fan speed to Level 229
WARNING: One or more commands failed
WARNING: GPU[0] : Unable to read /sys/class/drm/card0/device/pp_power_profile_mode
WARNING: GPU[0] : Unable to read /sys/class/drm/card0/device/pp_power_profile_mode
ERROR: GPU[0]           : Unable to get power profile
grep: /sys/class/drm/card0/device/pp_power_profile_mode: Input/output error
FAILURE: Saved OverDrive  does not match current OverDrive setting 0
Test complete: /home/zymvol/git/ROC-smi/rocm-smi -d 0 --save


Testing /home/zymvol/git/ROC-smi/rocm-smi -d 0 --load...
ERROR: NOTE: GPU and MEM Overdrive have been deprecated in the kernel. Use --setslevel/--setmlevel instead
ERROR: Non-integer characters are present in value None
WARNING: One or more commands failed
WARNING: IO or OS error
ERROR: GPU[0]           : Unable to set Performance Level, exiting
ERROR: Performance Level sysfs file could not be written
ERROR: GPU[0]           : Unable to set current Performance Level to high
WARNING: One or more commands failed
WARNING: GPU[0] : Unable to read /sys/class/drm/card0/device/pp_power_profile_mode
ERROR: GPU[0]           : Unable to set profile
ERROR: GPU[0]           : Unable to set Power Profile to level 4
WARNING: GPU[0] : Unable to read /sys/class/drm/card0/device/pp_power_profile_mode
WARNING: GPU[0] : Unable to read /sys/class/drm/card0/device/pp_power_profile_mode
ERROR: GPU[0]           : Unable to get power profile
WARNING: IO or OS error
ERROR: GPU[0]           : Unable to set Performance Level, exiting
ERROR: Performance Level sysfs file could not be written
ERROR: GPU[0]           : Unable to set current Performance Level to auto
WARNING: One or more commands failed
WARNING: GPU[0] : Unable to read /sys/class/drm/card0/device/pp_power_profile_mode
ERROR: GPU[0]           : Unable to reset Power Profile
grep: /sys/class/drm/card0/device/pp_power_profile_mode: Input/output error
WARNING: IO or OS error
ERROR: GPU[0]           : Unable to set fan speed to Level 255
ERROR: NOTE: GPU and MEM Overdrive have been deprecated in the kernel. Use --setslevel/--setmlevel instead
ERROR: Non-integer characters are present in value None
ERROR: NOTE: GPU and MEM Overdrive have been deprecated in the kernel. Use --setslevel/--setmlevel instead
ERROR: Non-integer characters are present in value None
WARNING: One or more commands failed
grep: /sys/class/drm/card0/device/pp_power_profile_mode: Input/output error
FAILURE: Failed to load OverDrive values from save file /tmp/tmp.hw6xwdXt4X/clocks.tmp
Test complete: /home/zymvol/git/ROC-smi/rocm-smi -d 0 --load

9 failure(s) occurred
===End of ROCM-SMI test suite===
WARNING: IO or OS error
ERROR: GPU[0]           : Unable to set Performance Level, exiting
ERROR: Performance Level sysfs file could not be written
ERROR: GPU[0]           : Unable to reset clocks
WARNING: One or more commands failed

After that rocm-smi is not showing the information:

root@srv-0003:/home/zymvol/git/ROC-smi# rocm-smi


======================= ROCm System Management Interface =======================
================================= Concise Info =================================
Expected integer value from monitor, but got ""
Expected integer value from monitor, but got ""
ERROR: 15 GPU[0]: power: Data (usually from reading a file) was not of the type that was expected
================================================================================
================================================================================
GPU  Temp  AvgPwr  SCLK  MCLK  Fan     Perf    PwrCap  VRAM%  GPU%
0    N/A   N/A     None  None  100.0%  manual  220.0W    2%   0%
================================================================================
WARNING:                 One or more commands failed
============================= End of ROCm SMI Log ==============================

And I get all these errors in dmesg:

[334436.022960] amdgpu 0000:4b:00.0: amdgpu: Failed to export SMU metrics table!
[334436.031358] amdgpu 0000:4b:00.0: amdgpu: Msg issuing pre-check failed and SMU may be not in the right state!
[334436.031363] amdgpu 0000:4b:00.0: amdgpu: Failed to export SMU metrics table!
[334436.031365] amdgpu 0000:4b:00.0: amdgpu: Msg issuing pre-check failed and SMU may be not in the right state!
[334436.031369] amdgpu 0000:4b:00.0: amdgpu: Failed to export SMU metrics table!
[334436.039771] amdgpu 0000:4b:00.0: amdgpu: Msg issuing pre-check failed and SMU may be not in the right state!
[334436.039776] amdgpu 0000:4b:00.0: amdgpu: Failed to export SMU metrics table!
[334436.039779] amdgpu 0000:4b:00.0: amdgpu: Msg issuing pre-check failed and SMU may be not in the right state!
[334436.039784] amdgpu 0000:4b:00.0: amdgpu: Failed to export SMU metrics table!

I wanted to change the performance level to auto again but I couldn't:

root@srv-0003:~# echo "auto" > /sys/class/drm/card0/device/power_dpm_force_performance_level
bash: echo: write error: Invalid argument

Also the sensors are not displaying the temperature of the GPU:

root@srv-0003:~# sensors

amdgpu-pci-4b00
Adapter: PCI adapter
vddgfx:        1.16 V
fan1:        12316 RPM  (min =    0 RPM, max = 3500 RPM)
edge:             N/A  (crit = +100.0°C, hyst = -273.1°C)
                       (emerg = +105.0°C)
junction:         N/A  (crit = +110.0°C, hyst = -273.1°C)
                       (emerg = +115.0°C)
mem:              N/A  (crit = +105.0°C, hyst = -273.1°C)
                       (emerg = +110.0°C)
power1:           N/A  (cap = 220.00 W)

Could you help me revert the changes of the test script?

Thank you,
Berta

@kentrussell
Copy link
Contributor

When you see an error like that, it usually means that the SMU is hanging. This is often handled with a GPU reset, or requires a full system reset. Note that the test_rocm_smi.sh script was mostly designed for testing the flags (like a Conformance test) but hasn't been touched in a while. I'd suggest running rsmitst instead, as that is up-to-date. If you are still seeing the SMU hang using rsmitst, please try the ROCm 4.1 release as that features new kernel error handling and SMU firmware, which should hopefully address this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants