-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support mig devices #53
Conversation
…. Adding some enhanced error catching for nvml queries
Signed-off-by: Bram Vogelaar <[email protected]>
Signed-off-by: Bram Vogelaar <[email protected]>
spot check:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
// A30/A100 MIG devices have no stats. | ||
// | ||
// https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#telemetry | ||
// | ||
// Is this fixed on H100 or later? Maybe? | ||
if mode == mig || mode == parent { | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be safe to attempt to call DeviceInfoAndStatusByUUID
and log/continue on error? I'd just hate for this to be something NVidia fixes in a driver update and then our plugin languishes for months without support because we don't even try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we log we end up spamming a log line for each MIG device for each period. In the sad case that's 7 MIG devices for 8 GPUs every 30 seconds which is a lot of log spam for the hope Nvidia will fix their stuff.
nvml/driver_linux.go
Outdated
} | ||
utzEncU := uint(utzEnc) | ||
memUsedU := mem.Used / (1 << 20) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment and/or a const for 1 << 20
Co-authored-by: Michael Schurter <[email protected]>
Incorporates previous PRs by @attachmentgenie and @isidentical while fixing a couple of bugs and adding MIG specific tests to the mock driver implementation.
Closes #3
Closes #27
Closes #40