-
Notifications
You must be signed in to change notification settings - Fork 660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added GPU enabled sandbox image. (v2?) #4340
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Future Outlier <[email protected]>
… sandbox-enabled-gpu Signed-off-by: Future Outlier <[email protected]>
Signed-off-by: Future Outlier <[email protected]>
… sandbox-enabled-gpu
Signed-off-by: Danny Farrell <[email protected]>
Thank you for opening this pull request! 🙌 These tips will help get your PR across the finish line:
|
Thanks a lot for your help, you and the author of the first PR really make a significant contribution to Flyte. |
Hi, thanks a lot for your contributions. |
Here are some questions!
Those questions above are related to the 1st GPU PR's discussion here. |
namespace: kube-system | ||
spec: | ||
chart: nvidia-device-plugin | ||
repo: https://nvidia.github.io/k8s-device-plugin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
repo: https://nvidia.github.io/k8s-device-plugin | |
repo: https://nvidia.github.io/k8s-device-plugin | |
# enable controllers | ||
sed -e 's/ / +/g' -e 's/^/+/' <"/sys/fs/cgroup/cgroup.controllers" >"/sys/fs/cgroup/cgroup.subtree_control" | ||
sed -e 's/ / +/g' -e 's/^/+/' < /sys/fs/cgroup/cgroup.controllers > /sys/fs/cgroup/cgroup.subtree_control |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that GPU sandbox will use this command.
xargs -rn1 < /sys/fs/cgroup/cgroup.procs > /sys/fs/cgroup/init/cgroup.procs || :
Can you explain why GPU sandbox doesn't use busybox?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know the reason, it is because of the change from here.
https://github.com/moby/moby/blob/ed89041433a031cafc0a0f19cfe573c31688d377/hack/dind#L28-L37
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
busybox isn't installed on the base image (nvidia/cuda:11.8.0-base-ubuntu22.04
) by default. either we install busybox or do a check similar to this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I am not sure is it necessary or not.
@jeevb Can you take a look?
Thanks a lot
|
||
{{- if .NodeConfig.AgentConfig.PauseImage }} | ||
sandbox_image = "{{ .NodeConfig.AgentConfig.PauseImage }}" | ||
{{end}} | ||
|
||
{{- if .NodeConfig.AgentConfig.Snapshotter }} | ||
[plugins."io.containerd.grpc.v1.cri".containerd] | ||
default_runtime_name = "nvidia" | ||
snapshotter = "{{ .NodeConfig.AgentConfig.Snapshotter }}" | ||
disable_snapshot_annotations = {{ if eq .NodeConfig.AgentConfig.Snapshotter "stargz" }}false{{else}}true{{end}} | ||
{{ if eq .NodeConfig.AgentConfig.Snapshotter "stargz" }} | ||
{{ if .NodeConfig.AgentConfig.ImageServiceSocket }} | ||
[plugins."io.containerd.snapshotter.v1.stargz"] | ||
cri_keychain_image_service_path = "{{ .NodeConfig.AgentConfig.ImageServiceSocket }}" | ||
[plugins."io.containerd.snapshotter.v1.stargz".cri_keychain] | ||
enable_keychain = true | ||
{{end}} | ||
{{ if .PrivateRegistryConfig }} | ||
{{ if .PrivateRegistryConfig.Mirrors }} | ||
[plugins."io.containerd.snapshotter.v1.stargz".registry.mirrors]{{end}} | ||
{{range $k, $v := .PrivateRegistryConfig.Mirrors }} | ||
[plugins."io.containerd.snapshotter.v1.stargz".registry.mirrors."{{$k}}"] | ||
endpoint = [{{range $i, $j := $v.Endpoints}}{{if $i}}, {{end}}{{printf "%q" .}}{{end}}] | ||
{{if $v.Rewrites}} | ||
[plugins."io.containerd.snapshotter.v1.stargz".registry.mirrors."{{$k}}".rewrite] | ||
{{range $pattern, $replace := $v.Rewrites}} | ||
"{{$pattern}}" = "{{$replace}}" | ||
{{end}} | ||
{{end}} | ||
{{end}} | ||
{{range $k, $v := .PrivateRegistryConfig.Configs }} | ||
{{ if $v.Auth }} | ||
[plugins."io.containerd.snapshotter.v1.stargz".registry.configs."{{$k}}".auth] | ||
{{ if $v.Auth.Username }}username = {{ printf "%q" $v.Auth.Username }}{{end}} | ||
{{ if $v.Auth.Password }}password = {{ printf "%q" $v.Auth.Password }}{{end}} | ||
{{ if $v.Auth.Auth }}auth = {{ printf "%q" $v.Auth.Auth }}{{end}} | ||
{{ if $v.Auth.IdentityToken }}identitytoken = {{ printf "%q" $v.Auth.IdentityToken }}{{end}} | ||
{{end}} | ||
{{ if $v.TLS }} | ||
[plugins."io.containerd.snapshotter.v1.stargz".registry.configs."{{$k}}".tls] | ||
{{ if $v.TLS.CAFile }}ca_file = "{{ $v.TLS.CAFile }}"{{end}} | ||
{{ if $v.TLS.CertFile }}cert_file = "{{ $v.TLS.CertFile }}"{{end}} | ||
{{ if $v.TLS.KeyFile }}key_file = "{{ $v.TLS.KeyFile }}"{{end}} | ||
{{ if $v.TLS.InsecureSkipVerify }}insecure_skip_verify = true{{end}} | ||
{{end}} | ||
{{end}} | ||
{{end}} | ||
{{end}} | ||
{{end}} | ||
|
||
{{- if not .NodeConfig.NoFlannel }} | ||
[plugins."io.containerd.grpc.v1.cri".cni] | ||
bin_dir = "{{ .NodeConfig.AgentConfig.CNIBinDir }}" | ||
conf_dir = "{{ .NodeConfig.AgentConfig.CNIConfDir }}" | ||
{{end}} | ||
|
||
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc] | ||
runtime_type = "io.containerd.runc.v2" | ||
|
||
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options] | ||
SystemdCgroup = {{ .SystemdCgroup }} | ||
|
||
{{ if .PrivateRegistryConfig }} | ||
{{ if .PrivateRegistryConfig.Mirrors }} | ||
[plugins."io.containerd.grpc.v1.cri".registry.mirrors]{{end}} | ||
{{range $k, $v := .PrivateRegistryConfig.Mirrors }} | ||
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."{{$k}}"] | ||
endpoint = [{{range $i, $j := $v.Endpoints}}{{if $i}}, {{end}}{{printf "%q" .}}{{end}}] | ||
{{if $v.Rewrites}} | ||
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."{{$k}}".rewrite] | ||
{{range $pattern, $replace := $v.Rewrites}} | ||
"{{$pattern}}" = "{{$replace}}" | ||
{{end}} | ||
{{end}} | ||
{{end}} | ||
|
||
{{range $k, $v := .PrivateRegistryConfig.Configs }} | ||
{{ if $v.Auth }} | ||
[plugins."io.containerd.grpc.v1.cri".registry.configs."{{$k}}".auth] | ||
{{ if $v.Auth.Username }}username = {{ printf "%q" $v.Auth.Username }}{{end}} | ||
{{ if $v.Auth.Password }}password = {{ printf "%q" $v.Auth.Password }}{{end}} | ||
{{ if $v.Auth.Auth }}auth = {{ printf "%q" $v.Auth.Auth }}{{end}} | ||
{{ if $v.Auth.IdentityToken }}identitytoken = {{ printf "%q" $v.Auth.IdentityToken }}{{end}} | ||
{{end}} | ||
{{ if $v.TLS }} | ||
[plugins."io.containerd.grpc.v1.cri".registry.configs."{{$k}}".tls] | ||
{{ if $v.TLS.CAFile }}ca_file = "{{ $v.TLS.CAFile }}"{{end}} | ||
{{ if $v.TLS.CertFile }}cert_file = "{{ $v.TLS.CertFile }}"{{end}} | ||
{{ if $v.TLS.KeyFile }}key_file = "{{ $v.TLS.KeyFile }}"{{end}} | ||
{{ if $v.TLS.InsecureSkipVerify }}insecure_skip_verify = true{{end}} | ||
{{end}} | ||
{{end}} | ||
{{end}} | ||
|
||
{{range $k, $v := .ExtraRuntimes}} | ||
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."{{$k}}"] | ||
runtime_type = "{{$v.RuntimeType}}" | ||
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."{{$k}}".options] | ||
BinaryName = "{{$v.BinaryName}}" | ||
{{end}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you like to provide the source URL?
Thanks really much.
ENV CRI_CONFIG_FILE=/var/lib/rancher/k3s/agent/etc/crictl.yaml | ||
|
||
ENTRYPOINT [ "/bin/k3d-entrypoint.sh" ] | ||
CMD [ "server", "--disable=traefik", "--disable=servicelb" ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you like to explain the logic between Dockerfile and Dockerfile.gpu under the same directory?
Co-authored-by: Future-Outlier <[email protected]> Signed-off-by: Daniel Farrell <[email protected]>
Signed-off-by: Daniel Farrell <[email protected]>
Signed-off-by: Daniel Farrell <[email protected]>
I think after we solve the security issue and remove everything about the |
Co-authored-by: Future-Outlier <[email protected]> Signed-off-by: Daniel Farrell <[email protected]>
Signed-off-by: Danny Farrell <[email protected]>
Signed-off-by: Danny Farrell <[email protected]>
Signed-off-by: Danny Farrell <[email protected]>
Signed-off-by: Danny Farrell <[email protected]>
Signed-off-by: Danny Farrell <[email protected]>
@danpf , it looks good to me, I think after remove these 2 changes, it's time to merge it, thanks a lot. |
Co-authored-by: Future-Outlier <[email protected]> Signed-off-by: Daniel Farrell <[email protected]>
Co-authored-by: Future-Outlier <[email protected]> Signed-off-by: Daniel Farrell <[email protected]>
Do you think we could get anyone to try and follow/install this? does it still work for you on WSL? |
@pingsutw will use a EC2 instance to test this |
It works on WSL, but WSL has some additional settings, which is complicated for me, in my WSL, I saw all pods about GPU started, so I think it's correct. |
Hey folks. I am working on a project that would greatly benefit from being able to have tasks be able to utilize GPUs in Sandbox. What is the current status of this PR? |
It works, but haven't add tests and not reviewed by other maitainers. |
You can cd flyte
gh pr checkout 4340
make build_gpu to create the image, thank you! |
Do we still need help testing/installing this? |
@davidmirror-ops The current instructions in the OP are up to date (to my knowledge, but it has been some time). We couldn't convince anyone to test/install this. You will need an Nvidia gpu to do so. |
I am building a PC to function as private workstation. I will be getting a 4090 in about two weeks. I can test once it is finished. This contribution is extremely useful for my intent, thank you for developing the feature! |
Hey @granthamtaylor did you have a chance to try this one? |
Preface: Combining work done by @ahlgol and @Future-Outlier with some extra testing/eval/a bunch of nvidia-headache-fixes to get it working fully on ubuntu server. #3256
If ahlgol merges this into the previous PR, this one will close, otherwise we can just use this one (I kept the previous PR's commits)
Setup / testing
0. Prerequisites
Ensure you have installed them and you can run them all
Installing the NVIDIA Container Toolkit:
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
Nvidia container-toolkit sample-workload:
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/sample-workload.html
Support for Container Device Interface:
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html
NVIDIA device plugin for Kubernetes (Finish All the Quick Start)
https://github.com/NVIDIA/k8s-device-plugin#quick-start
A public docker server may be necessary (I pushed just to make sure), so login with
docker login
general reqs:
flyte envd reqs:
pip install flytekitplugins-envd
My env: (may or may not be necessary)
docker context list
lsb_release -a
nvidia-smi
1. Get branch
Download the branch, build the dockerfile, tag the image, and push it:
2. Start the cluster
3. See if you can use the gpu
4. run the final job:
create the runme.py script shown below, and then run
Testing scripts
quickly rebuild and push your docker image (change the name obviously)
start a new flyte sandbox cluster
This is the final flyte script to check if your gpu is working
Proof!
previous pr
A new Dockerfile and build-target "build-gpu" in docker/sandbox-bundled that builds a CUDA enabled image named flyte-sandbox-gpu.
Describe your changes
Check all the applicable boxes
I updated the documentation accordingly.
All new and existing tests passed.
Note to reviewers
Changes have been added following info from these sources (plus some trial and error):
https://itnext.io/enabling-nvidia-gpus-on-k3s-for-cuda-workloads-a11b96f967b0
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
https://k3d.io/v5.4.6/usage/advanced/cuda/