Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RSMI_STATUS_PERMISSION on rocm-smi --setmclk #117

Open
sandrain opened this issue Jan 5, 2023 · 7 comments
Open

RSMI_STATUS_PERMISSION on rocm-smi --setmclk #117

sandrain opened this issue Jan 5, 2023 · 7 comments

Comments

@sandrain
Copy link

sandrain commented Jan 5, 2023

  • System: ubuntu-focal (5.4.0-109-generic)
  • rocm-5.2.1
  • GPU: MI250X

I am trying to set the memory clock frequency using rocm-smi, and it fails with the RSMI_STATUS_PERMISSION error. The performance level was set to manual:

$ rocm-smi --showhw


======================= ROCm System Management Interface =======================
============================ Concise Hardware Info =============================
GPU  DID   GFX RAS  SDMA RAS  UMC RAS  VBIOS           BUS
0    740c  ENABLED  ENABLED   ENABLED  113-D65210-063  0000:31:00.0
1    740c  ENABLED  ENABLED   ENABLED  113-D65210-063  0000:34:00.0
2    740c  ENABLED  ENABLED   ENABLED  113-D65210-063  0000:11:00.0
3    740c  ENABLED  ENABLED   ENABLED  113-D65210-063  0000:14:00.0
4    740c  ENABLED  ENABLED   ENABLED  113-D65210-063  0000:AE:00.0
5    740c  ENABLED  ENABLED   ENABLED  113-D65210-063  0000:B3:00.0
6    740c  ENABLED  ENABLED   ENABLED  113-D65210-063  0000:8E:00.0
7    740c  ENABLED  ENABLED   ENABLED  113-D65210-063  0000:93:00.0
================================================================================
============================= End of ROCm SMI Log ==============================
$ sudo rocm-smi -d 0 --showclkfrq --showperflevel


======================= ROCm System Management Interface =======================
============================ Show Performance Level ============================
GPU[0]          : Performance Level: manual
================================================================================
========================= Supported clock frequencies ==========================
GPU[0]          :
GPU[0]          : Supported fclk frequencies on GPU0
GPU[0]          : 0: 0Mhz *
GPU[0]          :
GPU[0]          : Supported mclk frequencies on GPU0
GPU[0]          : 0: 400Mhz
GPU[0]          : 1: 700Mhz
GPU[0]          : 2: 1200Mhz
GPU[0]          : 3: 1600Mhz *
GPU[0]          :
GPU[0]          : Supported sclk frequencies on GPU0
GPU[0]          : 0: 500Mhz
GPU[0]          : 1: 1700Mhz *
GPU[0]          :
GPU[0]          : Supported socclk frequencies on GPU0
GPU[0]          : 0: 666Mhz
GPU[0]          : 1: 857Mhz
GPU[0]          : 2: 1000Mhz
GPU[0]          : 3: 1090Mhz *
GPU[0]          : 4: 1333Mhz
GPU[0]          :
--------------------------------------------------------------------------------
================================================================================
============================= End of ROCm SMI Log ==============================
$ sudo rocm-smi -d 0 --setmclk 2


======================= ROCm System Management Interface =======================
============================== Set mclk Frequency ==============================
ERROR: 4 GPU[0]:RSMI_STATUS_PERMISSION: The user ID of the calling process does not have sufficient permission to execute a command.  Often this is fixed by running as root (sudo).
ERROR: GPU[0]           : Unable to set mclk bitmask to: 0x4
================================================================================
============================= End of ROCm SMI Log ==============================
$ sudo rocm-smi -d 0 --setmclk 0


======================= ROCm System Management Interface =======================
============================== Set mclk Frequency ==============================
ERROR: 4 GPU[0]:RSMI_STATUS_PERMISSION: The user ID of the calling process does not have sufficient permission to execute a command.  Often this is fixed by running as root (sudo).
ERROR: GPU[0]           : Unable to set mclk bitmask to: 0x1
================================================================================
============================= End of ROCm SMI Log ==============================

I found only sclk is configurable. Is this expected, or did I miss anything? Thanks!

@rakataprime
Copy link

did u set the feature mask and performance to manual like ?
rocm-smi --setperflevel manual
sudo rocm-smi --setvc 2 1701 915 --autorespond y
sudo rocm-smi --setsrange 808 1740 --autorespond y

@sandrain
Copy link
Author

@rakataprime Thanks for your input. I've tried the feature mask, which I didn't set properly before. However, I still cannot change the memory clock frequency as I wish.

BTW, I found the following error when the amdgpu module is loaded (regardless of the kernel parameter ppfeature):

[   14.070181] ------------[ cut here ]------------
[   14.070182] RAS ERROR: unexpected block id 15
[   14.070285] WARNING: CPU: 0 PID: 5 at /var/lib/dkms/amdgpu/5.16.9.22.20-1447096~20.04/build/amd/amdgpu/amdgpu_ras.h:579 amdgpu_ras_feature_enable+0x1b4/0x210 [amdgpu]
[   14.070285] Modules linked in: crc32_pclmul hid_generic ib_uverbs ib_core amdgpu(OE+) amd_iommu_v2 amdttm(OE) amd_sched(OE) amdkcl(OE) drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci drm usbhid nvme libahci i2c_algo_bit hid i40e nvme_core i2c_piix4 wmi
[   14.070299] CPU: 0 PID: 5 Comm: kworker/0:0 Tainted: G           OE     5.4.0-109-generic #123-Ubuntu
[   14.070300] Hardware name: Supermicro AS -4124GQ-TNMI/H12DGQ-NT6, BIOS 2.4 08/23/2022
[   14.070309] Workqueue: events work_for_cpu_fn
[   14.070366] RIP: 0010:amdgpu_ras_feature_enable+0x1b4/0x210 [amdgpu]
[   14.070368] Code: d9 63 59 00 01 e8 6c 88 50 ea 0f 0b 45 31 ff e9 79 ff ff ff 44 89 fe 48 c7 c7 80 c1 c2 c0 c6 05 b9 63 59 00 01 e8 4c 88 50 ea <0f> 0b 45 31 ff e9 ba fe ff ff 48 c7 c7 f8 c1 c2 c0 c6 05 9b 63 59
[   14.070369] RSP: 0018:ffffab01c0287bb8 EFLAGS: 00010286
[   14.070370] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000001f2a
[   14.070371] RDX: 0000000000000001 RSI: 0000000000000082 RDI: 0000000000000247
[   14.070371] RBP: ffffab01c0287be8 R08: 0000000000001f2a R09: 0000000000000004
[   14.070372] R10: 0000000000000000 R11: 0000000000000001 R12: ffff94709362c400
[   14.070372] R13: ffff9470801e0000 R14: ffffffffc0ccda20 R15: 000000000000000f
[   14.070373] FS:  0000000000000000(0000) GS:ffff94710cc00000(0000) knlGS:0000000000000000
[   14.070373] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   14.070374] CR2: 0000000000000000 CR3: 0000007d9c5a0005 CR4: 0000000000760ef0
[   14.070374] PKRU: 55555554
[   14.070374] Call Trace:
[   14.070431]  amdgpu_ras_feature_enable_on_boot+0x48/0xd0 [amdgpu]
[   14.070489]  ? sdma_v4_0_set_ecc_irq_state+0x61/0x70 [amdgpu]
[   14.070537]  amdgpu_ras_block_late_init+0x5c/0x1f0 [amdgpu]
[   14.070592]  ? amdgpu_irq_update+0x85/0xa0 [amdgpu]
[   14.070640]  ? amdgpu_irq_get+0x44/0x60 [amdgpu]
[   14.070691]  ? amdgpu_sdma_ras_late_init+0x7b/0xa0 [amdgpu]
[   14.070739]  amdgpu_ras_late_init+0x34/0x90 [amdgpu]
[   14.070787]  amdgpu_device_ip_late_init+0x7d/0x270 [amdgpu]
[   14.070867]  amdgpu_device_init.cold+0x16a3/0x1ea9 [amdgpu]
[   14.070873]  ? pci_read_config_word+0x27/0x40
[   14.070922]  amdgpu_driver_load_kms+0x1a/0x150 [amdgpu]
[   14.070970]  amdgpu_pci_probe+0x1ed/0x3f0 [amdgpu]
[   14.070975]  local_pci_probe+0x48/0x80
[   14.070976]  work_for_cpu_fn+0x1a/0x30
[   14.070978]  process_one_work+0x1eb/0x3b0
[   14.070979]  worker_thread+0x21e/0x400
[   14.070981]  kthread+0x104/0x140
[   14.070982]  ? process_one_work+0x3b0/0x3b0
[   14.070983]  ? kthread_park+0x90/0x90
[   14.070989]  ret_from_fork+0x22/0x40
[   14.070990] ---[ end trace 7be76cc2cca5f417 ]---

@ppanchad-amd
Copy link

@sandrain Apologies for the lack of response. Please check if your issue still exists with the latest ROCm 6.2. If not, please close the ticket. Thanks!

@harkgill-amd
Copy link
Contributor

Hi @sandrain, I was not able to reproduce this issue and a few different fixes have been released for similar errors since this issue was first reported. As a result, I will close this issue for now. If you are still encountering this issue on ROCm 6.2, please leave a comment and I will re-open this ticket.

@sandrain
Copy link
Author

Hi @harkgill-amd, thanks for your response. You may close the ticket. We cannot reproduce the problem anymore.

@kulnaman
Copy link

Hello, I am facing the same problem,

  • System: Rocky 9.4
  • MI210 GPU
  • rocm:6.2.
    I have set the feature mask using rocm-smi --setperflevel manual
    GPU details:
rocm-smi -a


============================ ROCm System Management Interface ============================
============================== Version of System Component ===============================
Driver version: 6.8.5
==========================================================================================
=========================================== ID ===========================================
GPU[0]          : Device Name:          Instinct MI210
GPU[0]          : Device ID:            0x740f
GPU[0]          : Device Rev:           0x02
GPU[0]          : Subsystem ID:         0x0c34
GPU[0]          : GUID:                 13566
==========================================================================================
======================================= Unique ID ========================================
GPU[0]          : Unique ID: 0xd5a1afd4ec7820c1
==========================================================================================
========================================= VBIOS ==========================================
GPU[0]          : VBIOS version: 113-D67301-059
==========================================================================================
====================================== Temperature =======================================
GPU[0]          : Temperature (Sensor edge) (C): 35.0
GPU[0]          : Temperature (Sensor junction) (C): 36.0
GPU[0]          : Temperature (Sensor memory) (C): 48.0
GPU[0]          : Temperature (Sensor HBM 0) (C): 46.0
GPU[0]          : Temperature (Sensor HBM 1) (C): 44.0
GPU[0]          : Temperature (Sensor HBM 2) (C): 48.0
GPU[0]          : Temperature (Sensor HBM 3) (C): 45.0
==========================================================================================
=============================== Current clock frequencies ================================
GPU[0]          : fclk clock level: 0: (400Mhz)
GPU[0]          : mclk clock level: 3: (1600Mhz)
GPU[0]          : sclk clock level: 0: (1700Mhz)
GPU[0]          : socclk clock level: 3: (1090Mhz)
==========================================================================================
=================================== Current Fan Metric ===================================
GPU[0]          : Not supported
==========================================================================================
================================= Show Performance Level =================================
GPU[0]          : Performance Level: manual
==========================================================================================
==================================== OverDrive Level =====================================
GPU[0]          : get_overdrive_level_sclk, Not supported on the given system
==========================================================================================
==================================== OverDrive Level =====================================
GPU[0]          : get_mem_overdrive_level_mclk, Not supported on the given system
==========================================================================================
======================================= Power Cap ========================================
GPU[0]          : Max Graphics Package Power (W): 300.0
==========================================================================================
================================== Show Power Profiles ===================================
GPU[0]          : get_power_profiles, Not supported on the given system
==========================================================================================
=================================== Power Consumption ====================================
GPU[0]          : Average Graphics Package Power (W): 60.0
==========================================================================================
============================== Supported clock frequencies ===============================
GPU[0]          :
GPU[0]          : Supported fclk frequencies on GPU0
GPU[0]          : 0: 400Mhz *
GPU[0]          :
GPU[0]          : Supported mclk frequencies on GPU0
GPU[0]          : 0: 400Mhz
GPU[0]          : 1: 700Mhz
GPU[0]          : 2: 1200Mhz
GPU[0]          : 3: 1600Mhz *
GPU[0]          :
GPU[0]          : Supported sclk frequencies on GPU0
GPU[0]          : 0: 1700Mhz *
GPU[0]          : 1: 1700Mhz
GPU[0]          :
GPU[0]          : Supported socclk frequencies on GPU0
GPU[0]          : 0: 666Mhz
GPU[0]          : 1: 857Mhz
GPU[0]          : 2: 1000Mhz
GPU[0]          : 3: 1090Mhz *
GPU[0]          : 4: 1333Mhz
GPU[0]          :
GPU[0]          :
------------------------------------------------------------------------------------------
==========================================================================================
=================================== % time GPU is busy ===================================
GPU[0]          : GPU use (%): 0
GPU[0]          : GFX Activity: 18250668
==========================================================================================
=================================== Current Memory Use ===================================
GPU[0]          : GPU Memory Allocated (VRAM%): 0
GPU[0]          : GPU Memory Read/Write Activity (%): 0
GPU[0]          : Memory Activity: 5365780
GPU[0]          : Avg. Memory Bandwidth: 0
==========================================================================================
===================================== Memory Vendor ======================================
GPU[0]          : GPU memory vendor: hynix
==========================================================================================
================================== PCIe Replay Counter ===================================
GPU[0]          : PCIe Replay Count: 0
==========================================================================================
===================================== Serial Number ======================================
GPU[0]          : Serial Number: 692221000867
==========================================================================================
===================================== KFD Processes ======================================
No KFD PIDs currently running
==========================================================================================
================================== GPUs Indexed by PID ===================================
No KFD PIDs currently running
==========================================================================================
======================= GPU Memory clock frequencies and voltages ========================
GPU[0]          : OD_SCLK:
GPU[0]          : 0: 1700Mhz
GPU[0]          : 1: 1700Mhz
GPU[0]          : OD_MCLK:
GPU[0]          : 0: 400Mhz
GPU[0]          : 1: 1600Mhz
==========================================================================================
==================================== Current voltage =====================================
GPU[0]          : Voltage (mV): 931
==========================================================================================
======================================= PCI Bus ID =======================================
GPU[0]          : PCI Bus: 0000:27:00.0
==========================================================================================
================================== Firmware Information ==================================
GPU[0]          : get_firmware_version_ASD, Not supported on the given system
GPU[0]          : get_firmware_version_CE, Not supported on the given system
GPU[0]          : get_firmware_version_DMCU, Not supported on the given system
GPU[0]          : get_firmware_version_MC, Not supported on the given system
GPU[0]          : get_firmware_version_ME, Not supported on the given system
GPU[0]          : MEC firmware version:         83
GPU[0]          : MEC2 firmware version:        83
GPU[0]          : get_firmware_version_MES, Not supported on the given system
GPU[0]          : get_firmware_version_MES KIQ, Not supported on the given system
GPU[0]          : get_firmware_version_PFP, Not supported on the given system
GPU[0]          : RLC firmware version:         17
GPU[0]          : get_firmware_version_RLC SRLC, Not supported on the given system
GPU[0]          : get_firmware_version_RLC SRLG, Not supported on the given system
GPU[0]          : get_firmware_version_RLC SRLS, Not supported on the given system
GPU[0]          : SDMA firmware version:        8
GPU[0]          : SDMA2 firmware version:       8
GPU[0]          : SMC firmware version:         00.68.60.00
GPU[0]          : SOS firmware version:         0x00270082
GPU[0]          : TA RAS firmware version:      27.00.01.60
GPU[0]          : TA XGMI firmware version:     32.00.00.19
GPU[0]          : get_firmware_version_UVD, Not supported on the given system
GPU[0]          : get_firmware_version_VCE, Not supported on the given system
GPU[0]          : VCN firmware version:         0x0110101c
==========================================================================================
====================================== Product Info ======================================
GPU[0]          : Card Series:          Instinct MI210
GPU[0]          : Card Model:           0x740f
GPU[0]          : Card Vendor:          Advanced Micro Devices, Inc. [AMD/ATI]
GPU[0]          : Card SKU:             D67301
GPU[0]          : Subsystem ID:         0x0c34
GPU[0]          : Device Rev:           0x02
GPU[0]          : Node ID:              2
GPU[0]          : GUID:                 13566
GPU[0]          : GFX Version:          gfx9010
==========================================================================================
======================================= Pages Info =======================================
==========================================================================================
================================= Show Valid sclk Range ==================================
GPU[0]          : Valid sclk range: 1700Mhz - 1700Mhz
==========================================================================================
================================= Show Valid mclk Range ==================================
GPU[0]          : Valid mclk range: 400Mhz - 1600Mhz
==========================================================================================
================================ Show Valid voltage Range ================================
ERROR: GPU[0]   : Voltage curve regions unsupported.
==========================================================================================
================================== Voltage Curve Points ==================================
ERROR: GPU[0]   : Voltage curve Points unsupported.
==========================================================================================
==================================== Consumed Energy =====================================
GPU[0]          : Energy counter: 1497096717012
GPU[0]          : Accumulated Energy (uJ): 22905580055832.14
==========================================================================================
=============================== Current Compute Partition ================================
GPU[0]          : Not supported on the given system
==========================================================================================
================================ Current Memory Partition ================================
GPU[0]          : Not supported on the given system
==========================================================================================
================================== End of ROCm SMI Log ===================================

and running:

sudo rocm-smi --setmclk 2


============================ ROCm System Management Interface ============================
=================================== Set mclk Frequency ===================================
GPU[0]          : set_gpu_clk_freq_mclk, Permission denied
ERROR: GPU[0]   : Unable to set mclk bitmask to: 0x4
==========================================================================================
================================== End of ROCm SMI Log ===================================

@harkgill-amd harkgill-amd reopened this Sep 26, 2024
@harkgill-amd
Copy link
Contributor

Hi @kulnaman, thanks for bringing this back to our attention. On MI200/MI210, there is no MCLK change support, it only operates on a single clock.

Despite this, the set_gpu_clk_freq_mclk, Permission denied error is misleading as it seems that there is a misconfiguration in the user space rather than setmclk being unsupported. We are working towards a fix that will make the error message propagation more clear for users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants