-
Notifications
You must be signed in to change notification settings - Fork 972
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Refine as JesseStutler's comments Signed-off-by: sailorvii <[email protected]> Signed-off-by: chenw66 <[email protected]>
- Loading branch information
chenw66
committed
Nov 20, 2024
1 parent
8b2f918
commit 7e85873
Showing
6 changed files
with
129 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,129 @@ | ||
# NVIDIA GPU MPS and MIG dynamic slice plugin | ||
|
||
## Introduction | ||
|
||
The NVIDIA GPU build-in sharing method includes: time-slice, MPS and MIG. The context switch for time slice sharing would waste some time, so we chose the MPS and MIG. The GPU MIG profile is variable, the user could acquire the MIG device in the profile definition, but current implementation only defines the dedicated profile before the user requirement. That limits the usage of MIG. We want to develop an automatic slice plugin and create the slice when the user require it. And we also add the MPS support. | ||
For the scheduling method, node-level binpack and spread will be supported. Referring to the binpack plugin, we consider the CPU, Mem, GPU memory and other user-defined resource. | ||
|
||
## Targets | ||
|
||
- CPU, Mem, and GPU combined schedule | ||
- GPU dynamic slice: includes MPS and MIG | ||
- Support node-level binpack and spread by GPU memory | ||
|
||
## Limitations | ||
|
||
- X86 structure | ||
- Kubernetes 1.26 or higher | ||
|
||
## Structure | ||
|
||
![](images/dynamicGPUSliceStructure.png) | ||
|
||
### Config maps | ||
- volcano-scheduler-configmap | ||
This configmap defines the plugin configurations including policy, weight, and MIG geometries. | ||
|
||
```yaml | ||
apiVersion: v1 | ||
data: | ||
volcano-scheduler.conf: | | ||
actions: "enqueue, allocate, backfill" | ||
tiers: | ||
- plugins: | ||
- name: dynamicgpuslice | ||
arguments: | ||
dynamicgpuslice.schedulePolicy: spread | ||
dynamicgpuslice.scheduleWeight: 10 | ||
dynamicgpuslice.weight.cpu: 1 | ||
dynamicgpuslice.weight.memory: 2 | ||
dynamicgpuslice.weight.gpu: 5 | ||
dynamicgpuslice.DevicePluginCMName: mps-configmap | ||
dynamicgpuslice.DevicePluginCMNamespace: default | ||
dynamicgpuslice.KnowGeometries: | ||
- models: ["NVIDIA-A100-SXM4-80GB", "NVIDIA-A100-80GB-PCIe"] | ||
allowedGeometries: | ||
- 1g.10gb: 7 | ||
- 2g.20gb: 3 | ||
- 3g.40gb: 2 | ||
- 4g.40gb: 1 | ||
- 7g.80gb: 1 | ||
``` | ||
- mps-configmap | ||
Define the MPS configuration | ||
```yaml | ||
apiVersion: v1 | ||
data: | ||
default: | | ||
version: v1 | ||
flags: | ||
migStrategy: none | ||
sharing: | ||
mps: | ||
failRequestsGreaterThanOne: true | ||
resources: | ||
- devices: | ||
- "0" | ||
memoryGB: 4 | ||
name: nvidia.com/gpu | ||
rename: gpu-4gb | ||
replicas: 2 | ||
- devices: | ||
- "0" | ||
memoryGB: 2 | ||
name: nvidia.com/gpu | ||
rename: gpu-2gb | ||
replicas: 2 | ||
kind: ConfigMap | ||
metadata: | ||
name: mps-configmap | ||
``` | ||
### Node labels and annotations | ||
Labels: | ||
- gpu.partitioning: control the partition method like MIG, MPS or default that no partition | ||
Annotations: | ||
- spec-gpu-: when in MIG status, define the GPU partition expected status | ||
- spec-partitioning-plan: current partition plan ID | ||
- status-gpu-: GPU partition status; reported by agent. | ||
- status-partitioning-plan: applied plan ID | ||
The annotations would be in one annotation entry and is as JSON format: | ||
``` | ||
dynamicgpuslice: '{"spec-gpu-0-2gb": 2, "spec-partitioning-plan": 123456, "status-gpu-0-2gb-used": 1, "status-partitioning-plan": 123456}' | ||
``` | ||
We use the plan ID for synchronization. When the scheduler plugin plans the GPU partition, it applies the configuration to configmap or spec-gpu annotation, and set s the plan ID. If the plan is not finished and another adding pod requirement comes, the adding process will return directly and wait for the plan finishment. | ||
### Modules | ||
- GPU device API: GPU probe and store into Node info structure; Get GPU info from Node info structure and do GPU partition plan when adding pod. When the slice type is MPS, it will write the MPS config | ||
- GPU schedule plugin: read scheduler plugin configuration including weights and MIG geometry configurations; at the allocate stage, do GPU predicate action; at the Best node stage, score the nodes by weight configurations and current resource usage. | ||
- Device plugin: inherit from NVIDIA official device plugin; add MPS device support in the device plugin and other two containers. | ||
* Config manager: read the config map mps-configmap, translate it into a config file, and share with the device plugin; kill the device plugin process and read the node annotation "status-gpu", write the annotation "status-plan" after the spec equals the status | ||
* Device plugin: fake the MPS device by the config and interact with kubelet | ||
* MPS server:NVIDIA official MPS server daemon | ||
- GPU agent: read the GPU usage by kubelet API and update the node annoatation "status-gpu" | ||
- MIG agent: call the nvml library to set MIG devices by the node annotation "spec-gpu" and set the "status-plan"; get the MIG devices usage and update the node annotation "status-gpu" | ||
- MIG manager: MIG enable and disable; for MIG enable/disable configuration has some limitations, all MIG related Pods need to be stopped and the official MIG manager has those functions, so we reuse it but change the main process: just enable&disable MIG | ||
### Proceduers | ||
#### scheduler plugin initialization | ||
In the initialization, the plugin reads the volcano-scheduler-configmap, stores the weights, policy, and the supported MIG geometries; it also registers two functions to action "allocate": Predicate and NodeOrderFn. | ||
![](images/dynamicGPUSliceInit.png) | ||
#### allocate and add pod | ||
![](images/dynamicGPUSlicePredicate.png) | ||
![](images/dynamicGPUSliceScore.png) | ||
#### GPU partition plan and status | ||
The best geometry profile is to fit more slice count, so we select the most slice profile. | ||
![](images/dynamicGPUSliceSlice.png) |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.