Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitor - Upgrade pyrsmi to amdsmi python library. #601

Merged
merged 6 commits into from
Dec 22, 2023

Conversation

guoshzhao
Copy link
Contributor

Description
Upgrade to amdsmi python library since pyrsmi will be retired as AMD guys suggested:

AMD SMI Python Library: https://github.com/ROCm/amdsmi/tree/develop/py-interface
pyrsmi: https://github.com/RadeonOpenCompute/pyrsmi

@guoshzhao guoshzhao requested review from cp5555 and yukirora December 20, 2023 06:17
@guoshzhao guoshzhao requested a review from a team as a code owner December 20, 2023 06:17
Copy link

codecov bot commented Dec 20, 2023

Codecov Report

Attention: 47 lines in your changes are missing coverage. Please review.

Comparison is base (6e50f02) 86.12% compared to head (754dcea) 85.78%.

Files Patch % Lines
superbench/common/utils/device_manager.py 2.08% 47 Missing ⚠️
Additional details and impacted files
@@               Coverage Diff                @@
##           release/0.10     #601      +/-   ##
================================================
- Coverage         86.12%   85.78%   -0.35%     
================================================
  Files                97       97              
  Lines              6878     6902      +24     
================================================
- Hits               5924     5921       -3     
- Misses              954      981      +27     
Flag Coverage Δ
cpu-python3.6-unit-test 71.59% <0.00%> (-0.26%) ⬇️
cpu-python3.7-unit-test 71.59% <0.00%> (-0.26%) ⬇️
cpu-python3.8-unit-test 72.01% <0.00%> (-0.26%) ⬇️
cuda-unit-test 83.86% <0.00%> (-0.30%) ⬇️
directx-unit-test 34.57% <2.08%> (-0.72%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@cp5555 cp5555 added the monitor label Dec 20, 2023
@cp5555 cp5555 mentioned this pull request Dec 20, 2023
30 tasks
@yukirora
Copy link
Contributor

hi @guoshzhao, pls check these error msg from MI300
image

@guoshzhao
Copy link
Contributor Author

guoshzhao commented Dec 21, 2023

hi @guoshzhao, pls check these error msg from MI300 image

Thanks, just checked that GPU utilization and temperature APIs can work on MI250. Looks not supported on MI300.
For GPU memory API, I have fixed it.
For ECC API, the errors are expected, I have changed the log level to 'info'.
Besides, I have change all other log level from 'error' to 'warning' to avoid the misunderstanding when incompitibility happens.

@yukirora
Copy link
Contributor

can we change the warning to only output once for each benchmark, there's too many warnings in the log by this

@yukirora yukirora enabled auto-merge (squash) December 22, 2023 12:17
@yukirora yukirora merged commit c635f75 into release/0.10 Dec 22, 2023
19 of 20 checks passed
@yukirora yukirora deleted the guzhao/amdml_upgrade branch December 22, 2023 16:01
abuccts pushed a commit that referenced this pull request Jan 3, 2024
**Description**
Upgrade to amdsmi python library since pyrsmi will be retired as AMD
guys suggested:

AMD SMI Python Library:
https://github.com/ROCm/amdsmi/tree/develop/py-interface
pyrsmi: https://github.com/RadeonOpenCompute/pyrsmi
abuccts added a commit that referenced this pull request Jan 8, 2024
**Description**

Cherry-pick bug fixes from v0.10.0 to main.

**Major Revisions**

* Benchmarks: Microbenchmark - Support different hipblasLt data types in dist_inference #590
* Benchmarks: Microbenchmark - Support in-place for NCCL/RCCL benchmark #591
* Bug Fix - Fix NUMA Domains Swap Issue in NDv4 Topology File #592
* Benchmarks: Microbenchmark - Add data type option for NCCL and RCCL tests #595
* Benchmarks: Bug Fix - Make metrics of dist-inference-cpp aligned with PyTorch version #596
* CI/CD - Add ndv5 topo file #597
* Benchmarks: Microbenchmark - Improve AMD GPU P2P performance with fine-grained GPU memory #593
* Benchmarks: Build Pipeline - fix nccl and nccl test version to 2.18.3 to resolve hang issue in cuda12.2 docker #599
* Dockerfile - Bug fix for rocm docker build and deploy #598
* Benchmarks: Microbenchmark - Adapt to hipblasLt data type changes #603
* Benchmarks: Micro benchmarks - Update hipblaslt metric unit to tflops #604
* Monitor - Upgrade pyrsmi to amdsmi python library. #601
* Benchmarks: Micro benchmarks - add fp8 and initialization for hipblaslt benchmark #605
* Dockerfile - Add rocm6.0 dockerfile #602
* Bug Fix - Bug fix for latest megatron-lm benchmark #600
* Docs - Upgrade version and release note #606

Co-authored-by: Ziyue Yang <[email protected]>
Co-authored-by: Yang Wang <[email protected]>
Co-authored-by: Yuting Jiang <[email protected]>
Co-authored-by: guoshzhao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants