Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal of dynamic GPU slice plugin #3820

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sailorvii
Copy link

NVIDIA official GPU sharing includes time-slice, MPS and MIG. Currently the MPS and MIG dynamic is not supported, we want to add this into volcano scheduler plugin

@volcano-sh-bot
Copy link
Contributor

Welcome @sailorvii!

It looks like this is your first PR to volcano-sh/volcano.

Thank you, and welcome to Volcano. 😃

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign lowang-bh
You can assign the PR to them by writing /assign @lowang-bh in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 14, 2024
@Monokaix
Copy link
Member

Hi, please squash to one commit and sign off.

Copy link
Member

@JesseStutler JesseStutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewd it, please take a look~

docs/design/dynamic-gpu-slice.md Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the logic of this AddPod, in the mig-agent of nos? I'm wondering whether our dynamic GPU slice plugin is strongly dependent on the nos project. You can see that the annotation has the watermark of nos, and nos project is not updated frequently.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. AddPod is in volcano/pkg/scheduler/api/node_info.go addResource.
  2. 3 functions can be reused from nos project: mig agent, mps agent and mps device plugin. They are not the most important part. If needed, we could rewrite them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @Monokaix , I think we'd better rewrite them as part of volcano and evolve with us.

docs/design/images/dynamicGPUSliceSlice.png Outdated Show resolved Hide resolved
docs/design/images/dynamicGPUSliceScore.png Outdated Show resolved Hide resolved
docs/design/dynamic-gpu-slice.md Outdated Show resolved Hide resolved
@archlitchi
Copy link
Contributor

A nice feature, but i have a few recommends:

  1. please add user guide for using dynamic MIG and MPS
  2. please clarify if annotations 'dynamicgpuslice' is a pod annotation or a node annotation?

Refine as JesseStutler's comments
Address the comments by archlitchi.

Signed-off-by: sailorvii <[email protected]>
Signed-off-by: chenw66 <[email protected]>
@sailorvii
Copy link
Author

archlitchi

Thanks for your time and review.

  1. Add the usage part.
  2. They're all node annotations. (the title has said “Node labels and annotations”)

@sailorvii sailorvii closed this Nov 25, 2024
@sailorvii sailorvii reopened this Nov 25, 2024
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: dynamicgpuslice
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about use the deviceshare plugin?

@Monokaix
Copy link
Member

We should clarify which dp the user should deploy and the relationship between dynamic mig slice and vgpu. The semantics of vgpu and dynamic mig slice are not completely consistent. Whether to use nvidia dp or hami needs to be discussed again.

@JesseStutler
Copy link
Member

Let’s discuss it again how to evolve this feature at the weekly meeting? Currently, it seems that there are three repos: volcano does the scheduling, hami does the dp, and nos does the mig/mps agent. It is too fragmented. @sailorvii @archlitchi @Monokaix

@sailorvii
Copy link
Author

Let’s discuss it again how to evolve this feature at the weekly meeting? Currently, it seems that there are three repos: volcano does the scheduling, hami does the dp, and nos does the mig/mps agent. It is too fragmented. @sailorvii @archlitchi @Monokaix

Thank you all for your time. It's good to discuss the details in the meeting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
retest-not-required-docs-only size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants