Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to use more than 5 GPU cards #235

Open
junqiang1992 opened this issue Jan 11, 2024 · 2 comments
Open

Unable to use more than 5 GPU cards #235

junqiang1992 opened this issue Jan 11, 2024 · 2 comments

Comments

@junqiang1992
Copy link

What is being built is a kata environment. The host has 8 GPU cards. If 1-5 GPU cards are used to create a pod, nvidia-container-cli will run normally, but problems will occur if 6 GPUs are used. After locating, the main reason is that the code calls ns_enter, switches to the rootfs of the container, and cannot find the corresponding directory when mounting_procfs. The environment version information is as follows:

nvidia-container-toolkit version

root@ubuntu-dev:/# dpkg -l | grep nvidia-container-toolkit
ii nvidia-container-toolkit 1.14.3-1 amd64 NVIDIA Container toolkit
ii nvidia-container-toolkit-base 1.14.3-1 amd64 NVIDIA Container Toolkit Base

nvidia-container-cli,The running log information is as follows:

`
-- WARNING, the following logs are for debugging purposes only --

I0111 10:30:12.940268 149 nvc.c:376] initializing library context (version=1.14.1, build=1eb5a30a6ad0415550a9df632ac8832bf7e2bbba)
I0111 10:30:12.940332 149 nvc.c:350] using root /
I0111 10:30:12.940334 149 nvc.c:351] using ldcache /etc/ld.so.cache
I0111 10:30:12.940336 149 nvc.c:352] using unprivileged user 65534:65534
I0111 10:30:12.940347 149 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0111 10:30:12.940484 149 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
I0111 10:30:12.943220 179 nvc.c:278] loading kernel module nvidia
I0111 10:30:12.943338 179 nvc.c:282] running mknod for /dev/nvidiactl
I0111 10:30:12.943362 179 nvc.c:286] running mknod for /dev/nvidia0
I0111 10:30:12.943374 179 nvc.c:286] running mknod for /dev/nvidia1
I0111 10:30:12.943382 179 nvc.c:286] running mknod for /dev/nvidia2
I0111 10:30:12.943390 179 nvc.c:286] running mknod for /dev/nvidia3
I0111 10:30:12.943399 179 nvc.c:286] running mknod for /dev/nvidia4
I0111 10:30:12.943407 179 nvc.c:286] running mknod for /dev/nvidia5
I0111 10:30:12.943415 179 nvc.c:290] running mknod for all nvcaps in /dev/nvidia-caps
I0111 10:30:12.947452 179 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config
I0111 10:30:12.947504 179 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabilities/mig/monitor
I0111 10:30:12.949693 179 nvc.c:296] loading kernel module nvidia_uvm
I0111 10:30:12.949702 179 nvc.c:300] running mknod for /dev/nvidia-uvm
I0111 10:30:12.949735 179 nvc.c:305] loading kernel module nvidia_modeset
I0111 10:30:12.955489 179 nvc.c:309] running mknod for /dev/nvidia-modeset
I0111 10:30:12.955701 183 rpc.c:71] starting driver rpc service
I0111 10:30:19.067427 229 rpc.c:71] starting nvcgo rpc service
I0111 10:30:19.072896 149 nvc_container.c:246] configuring container with 'compute utility supervised'
I0111 10:30:19.077084 149 nvc_container.c:88] selecting /run/kata-containers/ae1f8199611632c96d7e2ef8a5d5f51894d377259f062f6336911d02f67474d0/rootfs/usr/local/cuda-12.3/compat/libcuda.so.545.23.08
I0111 10:30:19.077456 149 nvc_container.c:88] selecting /run/kata-containers/ae1f8199611632c96d7e2ef8a5d5f51894d377259f062f6336911d02f67474d0/rootfs/usr/local/cuda-12.3/compat/libcudadebugger.so.545.23.08
I0111 10:30:19.077803 149 nvc_container.c:88] selecting /run/kata-containers/ae1f8199611632c96d7e2ef8a5d5f51894d377259f062f6336911d02f67474d0/rootfs/usr/local/cuda-12.3/compat/libnvidia-nvvm.so.545.23.08
I0111 10:30:19.078148 149 nvc_container.c:88] selecting /run/kata-containers/ae1f8199611632c96d7e2ef8a5d5f51894d377259f062f6336911d02f67474d0/rootfs/usr/local/cuda-12.3/compat/libnvidia-ptxjitcompiler.so.545.23.08
I0111 10:30:19.079973 149 nvc_container.c:268] setting pid to 147
I0111 10:30:19.080003 149 nvc_container.c:269] setting rootfs to /run/kata-containers/ae1f8199611632c96d7e2ef8a5d5f51894d377259f062f6336911d02f67474d0/rootfs
I0111 10:30:19.080011 149 nvc_container.c:270] setting owner to 0:0
I0111 10:30:19.080018 149 nvc_container.c:271] setting bins directory to /usr/bin
I0111 10:30:19.080038 149 nvc_container.c:272] setting libs directory to /usr/lib/x86_64-linux-gnu
I0111 10:30:19.080045 149 nvc_container.c:273] setting libs32 directory to /usr/lib/i386-linux-gnu
I0111 10:30:19.080052 149 nvc_container.c:274] setting cudart directory to /usr/local/cuda
I0111 10:30:19.080058 149 nvc_container.c:275] setting ldconfig to @/sbin/ldconfig.real (host relative)
I0111 10:30:19.080064 149 nvc_container.c:276] setting mount namespace to /proc/147/ns/mnt
I0111 10:30:19.080070 149 nvc_container.c:278] detected cgroupv1
I0111 10:30:19.080077 149 nvc_container.c:279] setting devices cgroup to /sys/fs/cgroup/devices/663e4a00_6864_4e19_8a5f_15d850583969/ae1f8199611632c96d7e2ef8a5d5f51894d377259f062f6336911d02f67474d0
I0111 10:30:19.080134 149 nvc_info.c:798] requesting driver information with ''
I0111 10:30:19.083496 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.535.146.02
I0111 10:30:19.083715 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.535.146.02
I0111 10:30:19.083826 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-pkcs11.so.535.146.02
I0111 10:30:19.083920 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-pkcs11-openssl3.so.535.146.02
I0111 10:30:19.084025 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.535.146.02
I0111 10:30:19.084159 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.535.146.02
I0111 10:30:19.084254 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.535.146.02
I0111 10:30:19.084412 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.535.146.02
I0111 10:30:19.084542 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.535.146.02
I0111 10:30:19.084657 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.535.146.02
I0111 10:30:19.084783 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.535.146.02
I0111 10:30:19.084883 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.535.146.02
I0111 10:30:19.084963 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.535.146.02
I0111 10:30:19.085071 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libcudadebugger.so.535.146.02
I0111 10:30:19.085101 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.535.146.02
W0111 10:30:19.085145 149 nvc_info.c:402] missing library libnvidia-nscq.so
W0111 10:30:19.085150 149 nvc_info.c:402] missing library libnvidia-gpucomp.so
W0111 10:30:19.085152 149 nvc_info.c:402] missing library libnvidia-fatbinaryloader.so
W0111 10:30:19.085155 149 nvc_info.c:402] missing library libnvidia-compiler.so
W0111 10:30:19.085158 149 nvc_info.c:402] missing library libnvidia-ngx.so
W0111 10:30:19.085160 149 nvc_info.c:402] missing library libnvidia-eglcore.so
W0111 10:30:19.085163 149 nvc_info.c:402] missing library libnvidia-glcore.so
W0111 10:30:19.085165 149 nvc_info.c:402] missing library libnvidia-tls.so
W0111 10:30:19.085167 149 nvc_info.c:402] missing library libnvidia-glsi.so
W0111 10:30:19.085170 149 nvc_info.c:402] missing library libnvidia-ifr.so
W0111 10:30:19.085172 149 nvc_info.c:402] missing library libnvidia-rtcore.so
W0111 10:30:19.085175 149 nvc_info.c:402] missing library libnvoptix.so
W0111 10:30:19.085177 149 nvc_info.c:402] missing library libGLX_nvidia.so
W0111 10:30:19.085180 149 nvc_info.c:402] missing library libEGL_nvidia.so
W0111 10:30:19.085182 149 nvc_info.c:402] missing library libGLESv2_nvidia.so
W0111 10:30:19.085184 149 nvc_info.c:402] missing library libGLESv1_CM_nvidia.so
W0111 10:30:19.085187 149 nvc_info.c:402] missing library libnvidia-glvkspirv.so
W0111 10:30:19.085189 149 nvc_info.c:402] missing library libnvidia-cbl.so
W0111 10:30:19.085192 149 nvc_info.c:406] missing compat32 library libnvidia-ml.so
W0111 10:30:19.085194 149 nvc_info.c:406] missing compat32 library libnvidia-cfg.so
W0111 10:30:19.085197 149 nvc_info.c:406] missing compat32 library libnvidia-nscq.so
W0111 10:30:19.085199 149 nvc_info.c:406] missing compat32 library libcuda.so
W0111 10:30:19.085202 149 nvc_info.c:406] missing compat32 library libcudadebugger.so
W0111 10:30:19.085204 149 nvc_info.c:406] missing compat32 library libnvidia-opencl.so
W0111 10:30:19.085206 149 nvc_info.c:406] missing compat32 library libnvidia-gpucomp.so
W0111 10:30:19.085209 149 nvc_info.c:406] missing compat32 library libnvidia-ptxjitcompiler.so
W0111 10:30:19.085211 149 nvc_info.c:406] missing compat32 library libnvidia-fatbinaryloader.so
W0111 10:30:19.085214 149 nvc_info.c:406] missing compat32 library libnvidia-allocator.so
W0111 10:30:19.085216 149 nvc_info.c:406] missing compat32 library libnvidia-compiler.so
W0111 10:30:19.085219 149 nvc_info.c:406] missing compat32 library libnvidia-pkcs11.so
W0111 10:30:19.085221 149 nvc_info.c:406] missing compat32 library libnvidia-pkcs11-openssl3.so
W0111 10:30:19.085230 149 nvc_info.c:406] missing compat32 library libnvidia-nvvm.so
W0111 10:30:19.085233 149 nvc_info.c:406] missing compat32 library libnvidia-ngx.so
W0111 10:30:19.085235 149 nvc_info.c:406] missing compat32 library libvdpau_nvidia.so
W0111 10:30:19.085238 149 nvc_info.c:406] missing compat32 library libnvidia-encode.so
W0111 10:30:19.085241 149 nvc_info.c:406] missing compat32 library libnvidia-opticalflow.so
W0111 10:30:19.085243 149 nvc_info.c:406] missing compat32 library libnvcuvid.so
W0111 10:30:19.085246 149 nvc_info.c:406] missing compat32 library libnvidia-eglcore.so
W0111 10:30:19.085249 149 nvc_info.c:406] missing compat32 library libnvidia-glcore.so
W0111 10:30:19.085251 149 nvc_info.c:406] missing compat32 library libnvidia-tls.so
W0111 10:30:19.085254 149 nvc_info.c:406] missing compat32 library libnvidia-glsi.so
W0111 10:30:19.085256 149 nvc_info.c:406] missing compat32 library libnvidia-fbc.so
W0111 10:30:19.085259 149 nvc_info.c:406] missing compat32 library libnvidia-ifr.so
W0111 10:30:19.085262 149 nvc_info.c:406] missing compat32 library libnvidia-rtcore.so
W0111 10:30:19.085264 149 nvc_info.c:406] missing compat32 library libnvoptix.so
W0111 10:30:19.085267 149 nvc_info.c:406] missing compat32 library libGLX_nvidia.so
W0111 10:30:19.085270 149 nvc_info.c:406] missing compat32 library libEGL_nvidia.so
W0111 10:30:19.085272 149 nvc_info.c:406] missing compat32 library libGLESv2_nvidia.so
W0111 10:30:19.085275 149 nvc_info.c:406] missing compat32 library libGLESv1_CM_nvidia.so
W0111 10:30:19.085277 149 nvc_info.c:406] missing compat32 library libnvidia-glvkspirv.so
W0111 10:30:19.085280 149 nvc_info.c:406] missing compat32 library libnvidia-cbl.so
I0111 10:30:19.085495 149 nvc_info.c:302] selecting /usr/bin/nvidia-smi
I0111 10:30:19.085511 149 nvc_info.c:302] selecting /usr/bin/nvidia-debugdump
I0111 10:30:19.085525 149 nvc_info.c:302] selecting /usr/bin/nvidia-persistenced
I0111 10:30:19.085549 149 nvc_info.c:302] selecting /usr/bin/nvidia-cuda-mps-control
I0111 10:30:19.085563 149 nvc_info.c:302] selecting /usr/bin/nvidia-cuda-mps-server
W0111 10:30:19.085591 149 nvc_info.c:428] missing binary nv-fabricmanager
I0111 10:30:19.085667 149 nvc_info.c:488] listing firmware path /lib/firmware/nvidia/535.146.02/gsp_ga10x.bin
I0111 10:30:19.085671 149 nvc_info.c:488] listing firmware path /lib/firmware/nvidia/535.146.02/gsp_tu10x.bin
I0111 10:30:19.085688 149 nvc_info.c:561] listing device /dev/nvidiactl
I0111 10:30:19.085691 149 nvc_info.c:561] listing device /dev/nvidia-uvm
I0111 10:30:19.085693 149 nvc_info.c:561] listing device /dev/nvidia-uvm-tools
I0111 10:30:19.085696 149 nvc_info.c:561] listing device /dev/nvidia-modeset
W0111 10:30:19.085712 149 nvc_info.c:352] missing ipc path /var/run/nvidia-persistenced/socket
W0111 10:30:19.085724 149 nvc_info.c:352] missing ipc path /var/run/nvidia-fabricmanager/socket
W0111 10:30:19.085735 149 nvc_info.c:352] missing ipc path /tmp/nvidia-mps
I0111 10:30:19.085739 149 nvc_info.c:854] requesting device information with ''
I0111 10:30:19.093013 149 nvc_info.c:745] listing device /dev/nvidia0 (GPU-15dd6db0-ca52-31f5-3daf-2019882683b0 at 00000000:02:00.0)
I0111 10:30:19.101035 149 nvc_info.c:745] listing device /dev/nvidia1 (GPU-2f4bb339-fc05-e25d-512c-05c7eefd99e1 at 00000000:04:00.0)
I0111 10:30:19.109848 149 nvc_info.c:745] listing device /dev/nvidia2 (GPU-23e7b85e-0792-0721-9d22-ac2a0e9bac2b at 00000000:06:00.0)
I0111 10:30:19.118514 149 nvc_info.c:745] listing device /dev/nvidia3 (GPU-d78a3ff1-47b6-80be-95a5-ffe77853335f at 00000000:08:00.0)
I0111 10:30:19.126998 149 nvc_info.c:745] listing device /dev/nvidia4 (GPU-3dbafd65-2a6e-1445-9569-a70fa746902d at 00000000:0a:00.0)
I0111 10:30:19.135642 149 nvc_info.c:745] listing device /dev/nvidia5 (GPU-d9bc47e7-fd6d-f1dc-ff31-d9675bb73087 at 00000000:0c:00.0)
`

nvidia-container-cli,code positioning analysis is as follows:【src/nvc_mount.c 】

`
int
nvc_driver_mount(struct nvc_context *ctx, const struct nvc_container *cnt, const struct nvc_driver_info *info)
{
const char **mnt, **ptr, **tmp;
size_t nmnt;
int rv = -1;

    if (validate_context(ctx) < 0)
            return (-1);
    if (validate_args(ctx, cnt != NULL && info != NULL) < 0)
            return (-1);

    if (ns_enter(&ctx->err, cnt->mnt_ns, CLONE_NEWNS) < 0)
            return (-1);

    nmnt = 2 + info->nbins + info->nlibs + cnt->nlibs + info->nlibs32 + info->nipcs + info->ndevs + info->nfirmwares;
    mnt = ptr = (const char **)array_new(&ctx->err, nmnt);
    if (mnt == NULL)
            goto fail;

    /* Procfs mount */
    if (ctx->dxcore.initialized)
            log_warn("skipping procfs mount on WSL");
    else if ((*ptr++ = mount_procfs(&ctx->err, ctx->cfg.root, cnt)) == NULL)
            goto fail;

`
After locating, it was found that the problem occurred on ns_enter func. Under normal circumstances during the test, even if the ns_enter interface was called, the file system path inside the virtual machine could be viewed, and then it could be mounted normally. [The mounting path is: /run/kata -Containers/AE1F1999611632C96D7EF8A5D5F51894D377259F06336911d02F67474D0/ROOTFS/Prive/Driver/Nvidia], for example, the abnormal scene used 6 GPU cards and called ns_enter, will directly enter the ROOTFS of the container as the root directory, so mount [/run/kata- containers/ae1f8199611632c96d7e2ef8a5d5f51894d377259f062f6336911d02f67474d0/rootfs/proc/driver/nvidia], the path cannot be found. I don’t understand the reason here. I don’t know how to solve this problem.

@junqiang1992
Copy link
Author

The nvidia-container-cli process never exited

root@localhost:/# ps -ef | grep con
root 82 2 0 10:30 ? 00:00:00 [ipv6_addrconf]
root 103 2 0 10:30 ? 00:00:00 [ext4-rsv-conver]
nobody 183 123 0 10:30 ? 00:00:05 /usr/bin/nvidia-container-cli --load-kmods --debug=/run/nvidia-container-toolkit.log configure --ldconfig=@/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=12.3 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526 brand=tesla,driver>=535,driver<536 brand=unknown,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=geforce,driver>=535,driver<536 brand=geforcertx,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=titan,driver>=535,driver<536 brand=titanrtx,driver>=535,driver<536 --pid=147 /run/kata-containers/ae1f8199611632c96d7e2ef8a5d5f51894d377259f062f6336911d02f67474d0/rootfs

@junqiang1992
Copy link
Author

The problem has been solved. It was caused by a timeout bug in kata-agent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant