Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: Support AMDGPU Data Collection #1641

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

yretenai
Copy link

@yretenai yretenai commented Dec 2, 2024

Description

Adds GPU metrics gathering via amdgpu_top's libamdgpu_top crate on Linux.

image

Some notes: the library queries /proc/{pid}/fdinfo, which can probably be parsed without amdgpu_top's libraries. Intel Arc apparently also uses this fdinfo but I cannot confirm.

Testing

If relevant, please state how this was tested. All changes must be tested to work:

If this is a code change, please also indicate which platforms were tested:

  • Windows
  • macOS
  • Linux

Checklist

If relevant, ensure the following have been met:

  • Areas your change affects have been linted using rustfmt (cargo fmt)
  • The change has been tested and doesn't appear to cause any unintended breakage
  • Documentation has been added/updated if needed (README.md, help menu, doc pages, etc.)
  • The pull request passes the provided CI pipeline
  • There are no merge conflicts
  • If relevant, new tests were added (don't worry too much about coverage)

Copy link

codecov bot commented Dec 2, 2024

Codecov Report

Attention: Patch coverage is 6.62983% with 338 lines in your changes missing coverage. Please review.

Project coverage is 41.37%. Comparing base (1fe17dd) to head (95eab0f).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/data_collection/amd.rs 3.33% 319 Missing ⚠️
src/data_collection.rs 40.62% 19 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1641      +/-   ##
==========================================
- Coverage   42.03%   41.37%   -0.67%     
==========================================
  Files         116      118       +2     
  Lines       17625    17926     +301     
==========================================
+ Hits         7409     7417       +8     
- Misses      10216    10509     +293     
Flag Coverage Δ
macos-14 37.38% <0.00%> (-0.08%) ⬇️
ubuntu-latest 43.09% <6.62%> (-0.73%) ⬇️
windows-2019 37.31% <0.00%> (-0.08%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@yretenai
Copy link
Author

yretenai commented Dec 2, 2024

CI failing due to lacking libdrm/libdrm_amdgpu libraries. What's the best way forward here?

@ClementTsang ClementTsang self-assigned this Dec 2, 2024
@ClementTsang
Copy link
Owner

Can the libraries be installed?

@yretenai
Copy link
Author

yretenai commented Dec 2, 2024

libdrm should exist in most linux distros via their package managers, i'm atm investigating a dependency-free solution by parsing /proc/*/fdinfo/*

@yretenai
Copy link
Author

yretenai commented Dec 4, 2024

Most recent commit parses AMD GPU metrics via procfs (for per-process utilization and video memory usage) and sysfs (for overall AMDGPU memory usage and temperature sensors) and as such doesn't rely on any libraries.

However the code is significantly more complex.

@ClementTsang ClementTsang changed the title Support AMDGPU Data Collection feature: Support AMDGPU Data Collection Dec 6, 2024
Copy link
Owner

@ClementTsang ClementTsang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments for now, mostly looks good though.

src/data_collection.rs Outdated Show resolved Hide resolved
}

// needs previous state for usage calculation
static PROC_DATA: LazyLock<Mutex<HashMap<PathBuf, HashMap<u32, AMDGPUProc>>>> =
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking: This doesn't need to be changed for this PR, but a mutex around this seems a little overkill.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the only way I could get a mutable reference to PROC_DATA without making the static itself mutable, and thus requiring unsafe code to access.

src/data_collection/amd.rs Outdated Show resolved Hide resolved
Comment on lines +117 to +119
// get vram memory info from sysfs
let vram_total_path = device_path.join("mem_info_vram_total");
let vram_used_path = device_path.join("mem_info_vram_used");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you happen to know whether any of these checks in this file wake up the GPU if it is currently sleeping? We had an issue with temperature checks in https://github.com/ClementTsang/bottom/blob/main/src/data_collection/temperature/linux.rs#L226 where we were waking up devices (mainly GPUs) when checking their temperature, for example.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not, but I can ask around.

})
}

pub fn get_amd_temp(device_path: &Path) -> Option<Vec<AMDGPUTemperature>> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides having better names, no. They end up symlinking to the same hwmon endpoint

@yretenai
Copy link
Author

yretenai commented Dec 6, 2024

Left a few comments for now, mostly looks good though.

I will make the necessary changes tomorrow.

@yretenai yretenai force-pushed the amdgpu branch 2 times, most recently from e2cd20d to a26e2b2 Compare December 7, 2024 17:59
@yretenai
Copy link
Author

yretenai commented Dec 7, 2024

Force pushes were rewording commit messages.

yretenai and others added 3 commits December 7, 2024 18:01
gpu: fix clippy issues

Co-authored-by: lvxnull2 <[email protected]>
…ead of current memory usage

gpu: requested syntax changes

Co-authored-by: lvxnull2 <[email protected]>
@yretenai
Copy link
Author

yretenai commented Dec 7, 2024

I accidentally reset the signature of the 4th commit from HEAD, which I just fixed by resetting the entire branch, apologies! History should be preserved now.

@jamartin9 jamartin9 mentioned this pull request Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants