-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Monitor - Upgrade pyrsmi to amdsmi python library. #601
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## release/0.10 #601 +/- ##
================================================
- Coverage 86.12% 85.78% -0.35%
================================================
Files 97 97
Lines 6878 6902 +24
================================================
- Hits 5924 5921 -3
- Misses 954 981 +27
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
hi @guoshzhao, pls check these error msg from MI300 |
Thanks, just checked that GPU utilization and temperature APIs can work on MI250. Looks not supported on MI300. |
can we change the warning to only output once for each benchmark, there's too many warnings in the log by this |
**Description** Upgrade to amdsmi python library since pyrsmi will be retired as AMD guys suggested: AMD SMI Python Library: https://github.com/ROCm/amdsmi/tree/develop/py-interface pyrsmi: https://github.com/RadeonOpenCompute/pyrsmi
**Description** Cherry-pick bug fixes from v0.10.0 to main. **Major Revisions** * Benchmarks: Microbenchmark - Support different hipblasLt data types in dist_inference #590 * Benchmarks: Microbenchmark - Support in-place for NCCL/RCCL benchmark #591 * Bug Fix - Fix NUMA Domains Swap Issue in NDv4 Topology File #592 * Benchmarks: Microbenchmark - Add data type option for NCCL and RCCL tests #595 * Benchmarks: Bug Fix - Make metrics of dist-inference-cpp aligned with PyTorch version #596 * CI/CD - Add ndv5 topo file #597 * Benchmarks: Microbenchmark - Improve AMD GPU P2P performance with fine-grained GPU memory #593 * Benchmarks: Build Pipeline - fix nccl and nccl test version to 2.18.3 to resolve hang issue in cuda12.2 docker #599 * Dockerfile - Bug fix for rocm docker build and deploy #598 * Benchmarks: Microbenchmark - Adapt to hipblasLt data type changes #603 * Benchmarks: Micro benchmarks - Update hipblaslt metric unit to tflops #604 * Monitor - Upgrade pyrsmi to amdsmi python library. #601 * Benchmarks: Micro benchmarks - add fp8 and initialization for hipblaslt benchmark #605 * Dockerfile - Add rocm6.0 dockerfile #602 * Bug Fix - Bug fix for latest megatron-lm benchmark #600 * Docs - Upgrade version and release note #606 Co-authored-by: Ziyue Yang <[email protected]> Co-authored-by: Yang Wang <[email protected]> Co-authored-by: Yuting Jiang <[email protected]> Co-authored-by: guoshzhao <[email protected]>
Description
Upgrade to amdsmi python library since pyrsmi will be retired as AMD guys suggested:
AMD SMI Python Library: https://github.com/ROCm/amdsmi/tree/develop/py-interface
pyrsmi: https://github.com/RadeonOpenCompute/pyrsmi