Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added GPU enabled sandbox image. (v2?) #4340

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

danpf
Copy link

@danpf danpf commented Nov 1, 2023

Preface: Combining work done by @ahlgol and @Future-Outlier with some extra testing/eval/a bunch of nvidia-headache-fixes to get it working fully on ubuntu server. #3256

If ahlgol merges this into the previous PR, this one will close, otherwise we can just use this one (I kept the previous PR's commits)

Setup / testing

0. Prerequisites

Ensure you have installed them and you can run them all

My env: (may or may not be necessary)

  • /etc/docker/daemon.json
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}
  • docker context list
NAME        DESCRIPTION                               DOCKER ENDPOINT               ERROR
default *   Current DOCKER_HOST based configuration   unix:///var/run/docker.sock
  • /etc/containerd/config.toml
version = 2

[plugins]

  [plugins."io.containerd.grpc.v1.cri"]

    [plugins."io.containerd.grpc.v1.cri".containerd]

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"
            SystemdCgroup = true
  • lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.3 LTS
Release:        22.04
Codename:       jammy
  • nvidia-smi
Wed Nov  1 03:52:14 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.06              Driver Version: 545.23.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       On  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P8               9W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                       On  | 00000000:00:05.0 Off |                    0 |
| N/A   32C    P8               9W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

1. Get branch

Download the branch, build the dockerfile, tag the image, and push it:

git clone https://github.com/danpf/flyte
git checkout danpf-sandbox-gpu
cd flyte/docker/sandbox-bundled
make build-gpu
docker tag flyte-sandbox-gpu:latest dancyrusbio/flyte-sandbox-gpu:latest
docker login
docker push dancyrusbio/flyte-sandbox-gpu:latest

2. Start the cluster

flytectl demo start --image dancyrusbio/flyte-sandbox-gpu:latest --disable-agent --force

3. See if you can use the gpu

$ kubectl describe node | grep -i gpu
  nvidia.com/gpu:     2
  nvidia.com/gpu:     2
  nvidia.com/gpu     0           0

4. run the final job:

create the runme.py script shown below, and then run

pyflyte run --remote runme.py  check_if_gpu_available

Testing scripts

# create_envd_context.sh
envd context create --name flyte-sandbox --builder tcp --builder-address localhost:30003 --use

quickly rebuild and push your docker image (change the name obviously)

# rebuild.sh
make build-gpu && docker tag flyte-sandbox-gpu dancyrusbio/flyte-sandbox-gpu && docker push dancyrusbio/flyte-sandbox-gpu

start a new flyte sandbox cluster

# start_new_flyte_cluster.sh
flytectl demo start --image dancyrusbio/flyte-sandbox-gpu:latest --disable-agent --force

This is the final flyte script to check if your gpu is working

# runme.py
from flytekit import ImageSpec, Resources, task

gpu = "1"

@task(
    retries=2,
    cache=True,
    cache_version="1.0",
    requests=Resources(gpu=gpu),
    environment={"PYTHONPATH": "/root"},
    container_image=ImageSpec(
            cuda="11.8.0",
            python_version="3.9.13",
            packages=["flytekit", "torch"],
            apt_packages=["git"],
            registry="localhost:30000",
    )
)
def check_if_gpu_available() -> bool:
    import torch
    return torch.cuda.is_available()

Proof!

image image
$ kubectl describe node | grep -i gpu
  nvidia.com/gpu:     2
  nvidia.com/gpu:     2
  nvidia.com/gpu     0           0

previous pr

A new Dockerfile and build-target "build-gpu" in docker/sandbox-bundled that builds a CUDA enabled image named flyte-sandbox-gpu.
Describe your changes

Build target added in Makefile for "build-gpu" that builds Dockerfile.gpu
Build target added in Makefile for "manifests-gpu" that adds gpu-operator.yaml to manifests
Dockerfile.gpu is based on existing Dockerfile, but uses a base image from nvidia and installs k3s and crictl and adds containerd config template for nvidia container runtime
Adds bin/k3d-entrypoint-gpu-check.sh that checks if container is started in an nvidia enabled image and exits otherwise.
bin/k3d-entrypoint.sh have been modified to allow for stderr to pass to output, so warning from other entrypoint scripts can be seen (now it will be missing in logfile however)

Check all the applicable boxes

I updated the documentation accordingly.
All new and existing tests passed.

All commits are signed-off.

Note to reviewers

Changes have been added following info from these sources (plus some trial and error):
https://itnext.io/enabling-nvidia-gpus-on-k3s-for-cuda-workloads-a11b96f967b0
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
https://k3d.io/v5.4.6/usage/advanced/cuda/

Future Outlier and others added 5 commits October 15, 2023 09:57
Signed-off-by: Future Outlier <[email protected]>
… sandbox-enabled-gpu

Signed-off-by: Future Outlier <[email protected]>
Signed-off-by: Future Outlier <[email protected]>
Signed-off-by: Danny Farrell <[email protected]>
Copy link

welcome bot commented Nov 1, 2023

Thank you for opening this pull request! 🙌

These tips will help get your PR across the finish line:

  • Most of the repos have a PR template; if not, fill it out to the best of your knowledge.
  • Sign off your commits (Reference: DCO Guide).

@Future-Outlier
Copy link
Member

Thanks a lot for your help, you and the author of the first PR really make a significant contribution to Flyte.

@danpf danpf marked this pull request as ready for review November 1, 2023 03:53
@Future-Outlier
Copy link
Member

Hi, thanks a lot for your contributions.
These are really amazing.

@Future-Outlier
Copy link
Member

Here are some questions!
I believe that if you can provide them, you can help lots of Flyte users to use sandbox GPU image, and also help reviewers to review it more easily.

  1. Should we need to taint the GPU node? why or why not?

  2. Should we need to set the config in flyte sandbox-config? why or why not?

  3. Should we need to Change the k3d-entrypoint-gpu-check Permissions? why or why not?

  4. Is the cuda's version necessary to be the same as your GPU cuda version? Does it have any limit?

Those questions above are related to the 1st GPU PR's discussion here.
#3256 (comment)

docker/sandbox-bundled/bin/k3d-entrypoint-gpu-check.sh Outdated Show resolved Hide resolved
docker/sandbox-bundled/kustomize/gpu-operator.yaml Outdated Show resolved Hide resolved
namespace: kube-system
spec:
chart: nvidia-device-plugin
repo: https://nvidia.github.io/k8s-device-plugin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
repo: https://nvidia.github.io/k8s-device-plugin
repo: https://nvidia.github.io/k8s-device-plugin

docker/sandbox-bundled/manifests/complete-agent.yaml Outdated Show resolved Hide resolved
# enable controllers
sed -e 's/ / +/g' -e 's/^/+/' <"/sys/fs/cgroup/cgroup.controllers" >"/sys/fs/cgroup/cgroup.subtree_control"
sed -e 's/ / +/g' -e 's/^/+/' < /sys/fs/cgroup/cgroup.controllers > /sys/fs/cgroup/cgroup.subtree_control
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that GPU sandbox will use this command.
xargs -rn1 < /sys/fs/cgroup/cgroup.procs > /sys/fs/cgroup/init/cgroup.procs || :
Can you explain why GPU sandbox doesn't use busybox?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

busybox isn't installed on the base image (nvidia/cuda:11.8.0-base-ubuntu22.04) by default. either we install busybox or do a check similar to this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I am not sure is it necessary or not.
@jeevb Can you take a look?
Thanks a lot

Comment on lines 1 to 118

{{- if .NodeConfig.AgentConfig.PauseImage }}
sandbox_image = "{{ .NodeConfig.AgentConfig.PauseImage }}"
{{end}}

{{- if .NodeConfig.AgentConfig.Snapshotter }}
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
snapshotter = "{{ .NodeConfig.AgentConfig.Snapshotter }}"
disable_snapshot_annotations = {{ if eq .NodeConfig.AgentConfig.Snapshotter "stargz" }}false{{else}}true{{end}}
{{ if eq .NodeConfig.AgentConfig.Snapshotter "stargz" }}
{{ if .NodeConfig.AgentConfig.ImageServiceSocket }}
[plugins."io.containerd.snapshotter.v1.stargz"]
cri_keychain_image_service_path = "{{ .NodeConfig.AgentConfig.ImageServiceSocket }}"
[plugins."io.containerd.snapshotter.v1.stargz".cri_keychain]
enable_keychain = true
{{end}}
{{ if .PrivateRegistryConfig }}
{{ if .PrivateRegistryConfig.Mirrors }}
[plugins."io.containerd.snapshotter.v1.stargz".registry.mirrors]{{end}}
{{range $k, $v := .PrivateRegistryConfig.Mirrors }}
[plugins."io.containerd.snapshotter.v1.stargz".registry.mirrors."{{$k}}"]
endpoint = [{{range $i, $j := $v.Endpoints}}{{if $i}}, {{end}}{{printf "%q" .}}{{end}}]
{{if $v.Rewrites}}
[plugins."io.containerd.snapshotter.v1.stargz".registry.mirrors."{{$k}}".rewrite]
{{range $pattern, $replace := $v.Rewrites}}
"{{$pattern}}" = "{{$replace}}"
{{end}}
{{end}}
{{end}}
{{range $k, $v := .PrivateRegistryConfig.Configs }}
{{ if $v.Auth }}
[plugins."io.containerd.snapshotter.v1.stargz".registry.configs."{{$k}}".auth]
{{ if $v.Auth.Username }}username = {{ printf "%q" $v.Auth.Username }}{{end}}
{{ if $v.Auth.Password }}password = {{ printf "%q" $v.Auth.Password }}{{end}}
{{ if $v.Auth.Auth }}auth = {{ printf "%q" $v.Auth.Auth }}{{end}}
{{ if $v.Auth.IdentityToken }}identitytoken = {{ printf "%q" $v.Auth.IdentityToken }}{{end}}
{{end}}
{{ if $v.TLS }}
[plugins."io.containerd.snapshotter.v1.stargz".registry.configs."{{$k}}".tls]
{{ if $v.TLS.CAFile }}ca_file = "{{ $v.TLS.CAFile }}"{{end}}
{{ if $v.TLS.CertFile }}cert_file = "{{ $v.TLS.CertFile }}"{{end}}
{{ if $v.TLS.KeyFile }}key_file = "{{ $v.TLS.KeyFile }}"{{end}}
{{ if $v.TLS.InsecureSkipVerify }}insecure_skip_verify = true{{end}}
{{end}}
{{end}}
{{end}}
{{end}}
{{end}}

{{- if not .NodeConfig.NoFlannel }}
[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "{{ .NodeConfig.AgentConfig.CNIBinDir }}"
conf_dir = "{{ .NodeConfig.AgentConfig.CNIConfDir }}"
{{end}}

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = {{ .SystemdCgroup }}

{{ if .PrivateRegistryConfig }}
{{ if .PrivateRegistryConfig.Mirrors }}
[plugins."io.containerd.grpc.v1.cri".registry.mirrors]{{end}}
{{range $k, $v := .PrivateRegistryConfig.Mirrors }}
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."{{$k}}"]
endpoint = [{{range $i, $j := $v.Endpoints}}{{if $i}}, {{end}}{{printf "%q" .}}{{end}}]
{{if $v.Rewrites}}
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."{{$k}}".rewrite]
{{range $pattern, $replace := $v.Rewrites}}
"{{$pattern}}" = "{{$replace}}"
{{end}}
{{end}}
{{end}}

{{range $k, $v := .PrivateRegistryConfig.Configs }}
{{ if $v.Auth }}
[plugins."io.containerd.grpc.v1.cri".registry.configs."{{$k}}".auth]
{{ if $v.Auth.Username }}username = {{ printf "%q" $v.Auth.Username }}{{end}}
{{ if $v.Auth.Password }}password = {{ printf "%q" $v.Auth.Password }}{{end}}
{{ if $v.Auth.Auth }}auth = {{ printf "%q" $v.Auth.Auth }}{{end}}
{{ if $v.Auth.IdentityToken }}identitytoken = {{ printf "%q" $v.Auth.IdentityToken }}{{end}}
{{end}}
{{ if $v.TLS }}
[plugins."io.containerd.grpc.v1.cri".registry.configs."{{$k}}".tls]
{{ if $v.TLS.CAFile }}ca_file = "{{ $v.TLS.CAFile }}"{{end}}
{{ if $v.TLS.CertFile }}cert_file = "{{ $v.TLS.CertFile }}"{{end}}
{{ if $v.TLS.KeyFile }}key_file = "{{ $v.TLS.KeyFile }}"{{end}}
{{ if $v.TLS.InsecureSkipVerify }}insecure_skip_verify = true{{end}}
{{end}}
{{end}}
{{end}}

{{range $k, $v := .ExtraRuntimes}}
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."{{$k}}"]
runtime_type = "{{$v.RuntimeType}}"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."{{$k}}".options]
BinaryName = "{{$v.BinaryName}}"
{{end}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like to provide the source URL?
Thanks really much.

docker/sandbox-bundled/bin/k3d-entrypoint-gpu-check.sh Outdated Show resolved Hide resolved
ENV CRI_CONFIG_FILE=/var/lib/rancher/k3s/agent/etc/crictl.yaml

ENTRYPOINT [ "/bin/k3d-entrypoint.sh" ]
CMD [ "server", "--disable=traefik", "--disable=servicelb" ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like to explain the logic between Dockerfile and Dockerfile.gpu under the same directory?

@Future-Outlier
Copy link
Member

I think after we solve the security issue and remove everything about the gpu operator file, this PR can be merged, thanks for your tons of work.

@danpf
Copy link
Author

danpf commented Nov 8, 2023

I'm not sure where else to explain this, but to answer any questions about the Dockerfile.gpu vs the Dockerfile file:

Here is a side-by-side diff screenshot of the two files:
image

The differences between the two files are shown in red. Essentially everything that is added to Dockerfile.gpu is due to the fact that the base image of k3s is scratch and the base image of our cuda is ubuntu. So you need to install a few requirements, install CRICTL, set the kubectl alias, and set some extra volumes/paths (at least according to the various docs)

docker/sandbox-bundled/Makefile Outdated Show resolved Hide resolved
docker/sandbox-bundled/Makefile Outdated Show resolved Hide resolved
@Future-Outlier
Copy link
Member

@danpf , it looks good to me, I think after remove these 2 changes, it's time to merge it, thanks a lot.

danpf and others added 2 commits November 7, 2023 22:59
Co-authored-by: Future-Outlier <[email protected]>
Signed-off-by: Daniel Farrell <[email protected]>
Co-authored-by: Future-Outlier <[email protected]>
Signed-off-by: Daniel Farrell <[email protected]>
@danpf
Copy link
Author

danpf commented Nov 8, 2023

Do you think we could get anyone to try and follow/install this? does it still work for you on WSL?

@Future-Outlier
Copy link
Member

@pingsutw will use a EC2 instance to test this

@Future-Outlier
Copy link
Member

Future-Outlier commented Nov 9, 2023

It works on WSL, but WSL has some additional settings, which is complicated for me, in my WSL, I saw all pods about GPU started, so I think it's correct.

@granthamtaylor
Copy link

Hey folks. I am working on a project that would greatly benefit from being able to have tasks be able to utilize GPUs in Sandbox. What is the current status of this PR?

@Future-Outlier
Copy link
Member

Hey folks. I am working on a project that would greatly benefit from being able to have tasks be able to utilize GPUs in Sandbox. What is the current status of this PR?

It works, but haven't add tests and not reviewed by other maitainers.

@Future-Outlier
Copy link
Member

Hey folks. I am working on a project that would greatly benefit from being able to have tasks be able to utilize GPUs in Sandbox. What is the current status of this PR?

You can

cd flyte
gh pr checkout 4340
make build_gpu

to create the image, thank you!

@davidmirror-ops
Copy link
Contributor

Do we still need help testing/installing this?
If so, what are the most up-to-date instructions?

@danpf
Copy link
Author

danpf commented Feb 20, 2024

@davidmirror-ops The current instructions in the OP are up to date (to my knowledge, but it has been some time). We couldn't convince anyone to test/install this. You will need an Nvidia gpu to do so.

@granthamtaylor
Copy link

granthamtaylor commented Feb 20, 2024

I am building a PC to function as private workstation. I will be getting a 4090 in about two weeks. I can test once it is finished.

This contribution is extremely useful for my intent, thank you for developing the feature!

@davidmirror-ops
Copy link
Contributor

Hey @granthamtaylor did you have a chance to try this one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants