Monitor - Upgrade pyrsmi to amdsmi python library. #601

guoshzhao · 2023-12-20T06:17:21Z

Description
Upgrade to amdsmi python library since pyrsmi will be retired as AMD guys suggested:

AMD SMI Python Library: https://github.com/ROCm/amdsmi/tree/develop/py-interface
pyrsmi: https://github.com/RadeonOpenCompute/pyrsmi

codecov · 2023-12-20T06:26:48Z

Codecov Report

Attention: 47 lines in your changes are missing coverage. Please review.

Comparison is base (6e50f02) 86.12% compared to head (754dcea) 85.78%.

Files	Patch %	Lines
superbench/common/utils/device_manager.py	2.08%	47 Missing ⚠️

Additional details and impacted files

@@               Coverage Diff                @@
##           release/0.10     #601      +/-   ##
================================================
- Coverage         86.12%   85.78%   -0.35%     
================================================
  Files                97       97              
  Lines              6878     6902      +24     
================================================
- Hits               5924     5921       -3     
- Misses              954      981      +27

Flag	Coverage Δ
cpu-python3.6-unit-test	`71.59% <0.00%> (-0.26%)`	⬇️
cpu-python3.7-unit-test	`71.59% <0.00%> (-0.26%)`	⬇️
cpu-python3.8-unit-test	`72.01% <0.00%> (-0.26%)`	⬇️
cuda-unit-test	`83.86% <0.00%> (-0.30%)`	⬇️
directx-unit-test	`34.57% <2.08%> (-0.72%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

yukirora · 2023-12-20T13:50:52Z

hi @guoshzhao, pls check these error msg from MI300

guoshzhao · 2023-12-21T02:34:06Z

hi @guoshzhao, pls check these error msg from MI300

Thanks, just checked that GPU utilization and temperature APIs can work on MI250. Looks not supported on MI300.
For GPU memory API, I have fixed it.
For ECC API, the errors are expected, I have changed the log level to 'info'.
Besides, I have change all other log level from 'error' to 'warning' to avoid the misunderstanding when incompitibility happens.

yukirora · 2023-12-21T10:43:36Z

can we change the warning to only output once for each benchmark, there's too many warnings in the log by this

**Description** Upgrade to amdsmi python library since pyrsmi will be retired as AMD guys suggested: AMD SMI Python Library: https://github.com/ROCm/amdsmi/tree/develop/py-interface pyrsmi: https://github.com/RadeonOpenCompute/pyrsmi

**Description** Cherry-pick bug fixes from v0.10.0 to main. **Major Revisions** * Benchmarks: Microbenchmark - Support different hipblasLt data types in dist_inference #590 * Benchmarks: Microbenchmark - Support in-place for NCCL/RCCL benchmark #591 * Bug Fix - Fix NUMA Domains Swap Issue in NDv4 Topology File #592 * Benchmarks: Microbenchmark - Add data type option for NCCL and RCCL tests #595 * Benchmarks: Bug Fix - Make metrics of dist-inference-cpp aligned with PyTorch version #596 * CI/CD - Add ndv5 topo file #597 * Benchmarks: Microbenchmark - Improve AMD GPU P2P performance with fine-grained GPU memory #593 * Benchmarks: Build Pipeline - fix nccl and nccl test version to 2.18.3 to resolve hang issue in cuda12.2 docker #599 * Dockerfile - Bug fix for rocm docker build and deploy #598 * Benchmarks: Microbenchmark - Adapt to hipblasLt data type changes #603 * Benchmarks: Micro benchmarks - Update hipblaslt metric unit to tflops #604 * Monitor - Upgrade pyrsmi to amdsmi python library. #601 * Benchmarks: Micro benchmarks - add fp8 and initialization for hipblaslt benchmark #605 * Dockerfile - Add rocm6.0 dockerfile #602 * Bug Fix - Bug fix for latest megatron-lm benchmark #600 * Docs - Upgrade version and release note #606 Co-authored-by: Ziyue Yang <[email protected]> Co-authored-by: Yang Wang <[email protected]> Co-authored-by: Yuting Jiang <[email protected]> Co-authored-by: guoshzhao <[email protected]>

upgrade pyrsmi to amdsmi python library.

62632a7

guoshzhao requested review from cp5555 and yukirora December 20, 2023 06:17

guoshzhao requested a review from a team as a code owner December 20, 2023 06:17

cp5555 approved these changes Dec 20, 2023

View reviewed changes

cp5555 added the monitor label Dec 20, 2023

cp5555 mentioned this pull request Dec 20, 2023

V0.10.0 Release Plan #559

Closed

30 tasks

guoshzhao added 3 commits December 21, 2023 10:00

change error to info

be46903

fix memory api

8fdb58e

change log level from error to warning

dd9f798

guoshzhao and others added 2 commits December 22, 2023 15:07

revise logging.

6fd5418

Merge branch 'release/0.10' into guzhao/amdml_upgrade

754dcea

yukirora approved these changes Dec 22, 2023

View reviewed changes

yukirora enabled auto-merge (squash) December 22, 2023 12:17

yukirora merged commit c635f75 into release/0.10 Dec 22, 2023
19 of 20 checks passed

yukirora deleted the guzhao/amdml_upgrade branch December 22, 2023 16:01

abuccts mentioned this pull request Jan 3, 2024

Release - SuperBench v0.10.0 #607

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitor - Upgrade pyrsmi to amdsmi python library. #601

Monitor - Upgrade pyrsmi to amdsmi python library. #601

guoshzhao commented Dec 20, 2023

codecov bot commented Dec 20, 2023 •

edited

Loading

yukirora commented Dec 20, 2023

guoshzhao commented Dec 21, 2023 •

edited

Loading

yukirora commented Dec 21, 2023

Monitor - Upgrade pyrsmi to amdsmi python library. #601

Monitor - Upgrade pyrsmi to amdsmi python library. #601

Conversation

guoshzhao commented Dec 20, 2023

codecov bot commented Dec 20, 2023 • edited Loading

Codecov Report

yukirora commented Dec 20, 2023

guoshzhao commented Dec 21, 2023 • edited Loading

yukirora commented Dec 21, 2023

codecov bot commented Dec 20, 2023 •

edited

Loading

guoshzhao commented Dec 21, 2023 •

edited

Loading