Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for nvidia gpu access #5132

Merged
merged 18 commits into from
Dec 20, 2024
Merged

add support for nvidia gpu access #5132

merged 18 commits into from
Dec 20, 2024

Conversation

MondoGao
Copy link
Contributor

@MondoGao MondoGao commented Aug 16, 2024

Close #4277

Add a new environment to enable Nvidia gpu access when creating containers. It's similar to /dev/dri support. See #1525.

Note: I didn't fully inspect and test this change since I don't have php development environment set up in my machine. Feel free to inherit this pr to enhance its maintainability.

@szaimen
Copy link
Collaborator

szaimen commented Aug 16, 2024

Hi @MondoGao thanks for your PR! Can you please fix DCO? Wee need that in order to be able to merge it later on. See https://github.com/nextcloud/all-in-one/pull/5132/checks?check_run_id=28844781358

@szaimen szaimen added 2. developing Work in progress enhancement New feature or request labels Aug 16, 2024
@szaimen szaimen added this to the next milestone Aug 16, 2024
@MondoGao
Copy link
Contributor Author

Hi @MondoGao thanks for your PR! Can you please fix DCO? Wee need that in order to be able to merge it later on. See https://github.com/nextcloud/all-in-one/pull/5132/checks?check_run_id=28844781358

Sign added :)

@szaimen
Copy link
Collaborator

szaimen commented Aug 19, 2024

Hi, I had a fast look at this and I think it would add the capability to all containers that are controlled by AIO. Better would probably be to add this as a capability to containers-schema.json and add this only to certain containers via containers.json

Also this is missing some places, e.g. adding documentation on it in the readme. See #1659 as inspiration.

Additionally, do you know if this is also going to work with AMD and Intel GPUs? In best case we create only one setting that works for all of them.

@MondoGao
Copy link
Contributor Author

Hi, I had a fast look at this and I think it would add the capability to all containers that are controlled by AIO. Better would probably be to add this as a capability to containers-schema.json and add this only to certain containers via containers.json

Also this is missing some places, e.g. adding documentation on it in the readme. See #1659 as inspiration.

Additionally, do you know if this is also going to work with AMD and Intel GPUs? In best case we create only one setting that works for all of them.

I believe docker runtime only supports Nvidia GPU passthrough. Intel & AMD CPU/GPU's hardware acceleration is exposed through /dev/dri.

Looks like I have to install php dev env to polish this pr, please expect a late response.

@luzfcb
Copy link

luzfcb commented Aug 19, 2024

@MondoGao Thank you for starting this effort.

@MondoGao @szaimen

Besides supporting Nvidia GPU passthrough, there are cases where the container may require the same specific version of the NVIDIA driver installed on the host system. It might be a good idea to create an environment variable for the Nvidia driver version and let the container handle the Nvidia driver setup and configuration logic inside the container.

That is, Nextcloud AIO could either provide some way for the user to manually specify the version or, if it is not provided by the user, then have an automatic way to obtain this information and fill an NVIDIA_DRIVER_VERSION environment variable that will be available to all containers managed by Nextcloud AIO.

I have exactly this with docker-steam-headless, and it works great.

This is the driver download and installation script: https://github.com/Steam-Headless/docker-steam-headless/blob/860451da74b397385f1b1658545d2bb891aa8e46/overlay/etc/cont-init.d/60-configure_gpu_driver.sh

This script creates the X Server configuration files and other related configurations. It probably does not apply to Nextcloud and plugins since they do not use X Server, but I am including it here just for reference.

https://github.com/Steam-Headless/docker-steam-headless/blob/860451da74b397385f1b1658545d2bb891aa8e46/overlay/etc/cont-init.d/70-configure_xorg.sh

@docjyJ
Copy link
Collaborator

docjyJ commented Aug 21, 2024

It would be nice to have this feature!

We can have Nextcloud Assistant servers with high performance easily!!!!!!

I agree with @szaimen the configuration should be done in the containers.json schema.

Otherwise it looks good to me.

AMD GPU : https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/docker.html

@szaimen szaimen modified the milestones: v9.5.0, next, v9.5.1 Sep 4, 2024
@szaimen szaimen modified the milestones: v9.6.0, next Sep 18, 2024
@szaimen szaimen modified the milestones: v9.7.0, next Oct 10, 2024
@szaimen szaimen removed this from the next milestone Oct 20, 2024
@szaimen szaimen marked this pull request as draft October 22, 2024 09:43
@gbertolottiKS
Copy link

Just to be sure to have understood this topic correctly: actually there's no way to use a dedicated GPU in Nextcloud AIO, since it is needed a development to make sure it can see (and then use) the graphic card. Is it correct?
This PR should resolve this : once finished, AIO can use "natively" the GPU, right?

Aren't some workaround to be able to use the GPU in the meantime? We are happy to help to test if it's necessary.

Thanks

@szaimen
Copy link
Collaborator

szaimen commented Dec 13, 2024

Just to be sure to have understood this topic correctly: actually there's no way to use a dedicated GPU in Nextcloud AIO, since it is needed a development to make sure it can see (and then use) the graphic card. Is it correct? This PR should resolve this : once finished, AIO can use "natively" the GPU, right?

Yes

Aren't some workaround to be able to use the GPU in the meantime?

Currently no as this PR is unfortunately not even close to being finished. You could try to find someone that takes over finishing this PR to speed up the development.

Copy link
Collaborator

@docjyJ docjyJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any news?
What's blocking the merger?

php/src/Data/ConfigurationManager.php Outdated Show resolved Hide resolved
php/src/Data/ConfigurationManager.php Outdated Show resolved Hide resolved
@szaimen
Copy link
Collaborator

szaimen commented Dec 17, 2024

Any news? What's blocking the merger?

See #5132 (comment)

@szaimen
Copy link
Collaborator

szaimen commented Dec 17, 2024

Also, twig CI seems to fail...

@docjyJ
Copy link
Collaborator

docjyJ commented Dec 17, 2024

Any news? What's blocking the merger?

See #5132 (comment)

I try to look this

@docjyJ
Copy link
Collaborator

docjyJ commented Dec 17, 2024

Also, twig CI seems to fail...

Fix

@docjyJ
Copy link
Collaborator

docjyJ commented Dec 17, 2024

Hi, I had a fast look at this and I think it would add the capability to all containers that are controlled by AIO. Better would probably be to add this as a capability to containers-schema.json and add this only to certain containers via containers.json

Done

Signed-off-by: Jean-Yves <[email protected]>
Signed-off-by: Jean-Yves <[email protected]>
Signed-off-by: Jean-Yves <[email protected]>
Signed-off-by: Jean-Yves <[email protected]>
@docjyJ docjyJ requested a review from szaimen December 20, 2024 09:06
Copy link
Collaborator

@docjyJ docjyJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🎉

@docjyJ
Copy link
Collaborator

docjyJ commented Dec 20, 2024

Thank you @MondoGao

Copy link
Collaborator

@szaimen szaimen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks a lot @MondoGao and @docjyJ! 😊

@szaimen szaimen added 3. to review Waiting for reviews and removed 2. developing Work in progress labels Dec 20, 2024
@szaimen szaimen merged commit 53edc5d into nextcloud:main Dec 20, 2024
10 checks passed
@szaimen
Copy link
Collaborator

szaimen commented Dec 20, 2024

This is now released with v10.2.0 Beta. Testing and feedback is welcome! See https://github.com/nextcloud/all-in-one#how-to-switch-the-channel

@luzfcb
Copy link

luzfcb commented Dec 20, 2024

@szaimen the local-ai community container is unable to start with Nextcloud AIO configured with the NEXTCLOUD_ENABLE_DRI_DEVICE=true, ENABLE_NVIDIA_GPU=true, AIO_COMMUNITY_CONTAINERS=local-ai environment variables.

This is the logs that I have in the AIO container:

Click me to see the logs

The error log is currently difficult to quickly understand what the cause is. Maybe there is some room for improvement somehow.

Initial startup of Nextcloud All-in-One complete!
You should be able to open the Nextcloud AIO Interface now on port 8080 of this server!
E.g. https://internal.ip.of.this.server:8080
⚠️ Important: do always use an ip-address if you access this port and not a domain as HSTS might block access to it later!

If your server has port 80 and 8443 open and you point a domain to your server, you can get a valid certificate automatically by opening the Nextcloud AIO Interface via:
https://your-domain-that-points-to-this-server.tld:8443
NOTICE: PHP message: Slim Application Error
Type: Exception
Code: 0
Message: Could not start container nextcloud-aio-local-ai: Server error: `POST http://127.0.0.1/v1.41/containers/nextcloud-aio-local-ai/start` resulted in a `500 Internal Server Error` response:
{"message":"failed to create task for container: failed to create shim task: OCI runtime create failed: runc create fail (truncated...)
File: /var/www/docker-aio/php/src/Docker/DockerActionManager.php
Line: 170
Trace: #0 /var/www/docker-aio/php/src/Controller/DockerController.php(59): AIO\Docker\DockerActionManager->StartContainer(Object(AIO\Container\Container))
#1 /var/www/docker-aio/php/src/Controller/DockerController.php(26): AIO\Controller\DockerController->PerformRecursiveContainerStart('nextcloud-aio-l...', true)
#2 /var/www/docker-aio/php/src/Controller/DockerController.php(209): AIO\Controller\DockerController->PerformRecursiveContainerStart('nextcloud-aio-a...', true)
#3 /var/www/docker-aio/php/src/Controller/DockerController.php(189): AIO\Controller\DockerController->startTopContainer(true)
#4 /var/www/docker-aio/php/vendor/slim/slim/Slim/Handlers/Strategies/RequestResponse.php(38): AIO\Controller\DockerController->StartContainer(Object(GuzzleHttp\Psr7\ServerRequest), Object(GuzzleHttp\Psr7\Response), Array)
#5 /var/www/docker-aio/php/vendor/slim/slim/Slim/Routing/Route.php(363): Slim\Handlers\Strategies\RequestResponse->__invoke(Array, Object(GuzzleHttp\Psr7\ServerRequest), Object(GuzzleHttp\Psr7\Response), Array)
#6 /var/www/docker-aio/php/vendor/slim/slim/Slim/MiddlewareDispatcher.php(73): Slim\Routing\Route->handle(Object(GuzzleHttp\Psr7\ServerRequest))
#7 /var/www/docker-aio/php/vendor/slim/slim/Slim/MiddlewareDispatcher.php(73): Slim\MiddlewareDispatcher->handle(Object(GuzzleHttp\Psr7\ServerRequest))
#8 /var/www/docker-aio/php/vendor/slim/slim/Slim/Routing/Route.php(321): Slim\MiddlewareDispatcher->handle(Object(GuzzleHttp\Psr7\ServerRequest))
#9 /var/www/docker-aio/php/vendor/slim/slim/Slim/Routing/RouteRunner.php(74): Slim\Routing\Route->run(Object(GuzzleHttp\Psr7\ServerRequest))
#10 /var/www/docker-aio/php/vendor/slim/csrf/src/Guard.php(482): Slim\Routing\RouteRunner->handle(Object(GuzzleHttp\Psr7\ServerRequest))
#11 /var/www/docker-aio/php/vendor/slim/slim/Slim/MiddlewareDispatcher.php(177): Slim\Csrf\Guard->process(Object(GuzzleHttp\Psr7\ServerRequest), Object(Slim\Routing\RouteRunner))
#12 /var/www/docker-aio/php/vendor/slim/twig-view/src/TwigMiddleware.php(117): Psr\Http\Server\RequestHandlerInterface@anonymous->handle(Object(GuzzleHttp\Psr7\ServerRequest))
#13 /var/www/docker-aio/php/vendor/slim/slim/Slim/MiddlewareDispatcher.php(129): Slim\Views\TwigMiddleware->process(Object(GuzzleHttp\Psr7\ServerRequest), Object(Psr\Http\Server\RequestHandlerInterface@anonymous))
#14 /var/www/docker-aio/php/src/Middleware/AuthMiddleware.php(36): Psr\Http\Server\RequestHandlerInterface@anonymous->handle(Object(GuzzleHttp\Psr7\ServerRequest))
#15 /var/www/docker-aio/php/vendor/slim/slim/Slim/MiddlewareDispatcher.php(280): AIO\Middleware\AuthMiddleware->__invoke(Object(GuzzleHttp\Psr7\ServerRequest), Object(Psr\Http\Server\RequestHandlerInterface@anonymous))
#16 /var/www/docker-aio/php/vendor/slim/slim/Slim/Middleware/ErrorMiddleware.php(77): Psr\Http\Server\RequestHandlerInterface@anonymous->handle(Object(GuzzleHttp\Psr7\ServerRequest))
#17 /var/www/docker-aio/php/vendor/slim/slim/Slim/MiddlewareDispatcher.php(129): Slim\Middleware\ErrorMiddleware->process(Object(GuzzleHttp\Psr7\ServerRequest), Object(Psr\Http\Server\RequestHandlerInterface@anonymous))
#18 /var/www/docker-aio/php/vendor/slim/slim/Slim/MiddlewareDispatcher.php(73): Psr\Http\Server\RequestHandlerInterface@anonymous->handle(Object(GuzzleHttp\Psr7\ServerRequest))
#19 /var/www/docker-aio/php/vendor/slim/slim/Slim/App.php(209): Slim\MiddlewareDispatcher->handle(Object(GuzzleHttp\Psr7\ServerRequest))
#20 /var/www/docker-aio/php/vendor/slim/slim/Slim/App.php(193): Slim\App->handle(Object(GuzzleHttp\Psr7\ServerRequest))
#21 /var/www/docker-aio/php/public/index.php(189): Slim\App->run()
#22 {main}
Tips: To display error details in HTTP response set "displayErrorDetails" to true in the ErrorHandler constructor.
NOTICE: PHP message: Slim Application Error
Type: Exception
Code: 0
Message: Could not start container nextcloud-aio-local-ai: Server error: `POST http://127.0.0.1/v1.41/containers/nextcloud-aio-local-ai/start` resulted in a `500 Internal Server Error` response:
{"message":"failed to create task for container: failed to create shim task: OCI runtime create failed: runc create fail (truncated...)
File: /var/www/docker-aio/php/src/Docker/DockerActionManager.php
Line: 170
Trace: #0 /var/www/docker-aio/php/src/Controller/DockerController.php(59): AIO\Docker\DockerActionManager->StartContainer(Object(AIO\Container\Container))
#1 /var/www/docker-aio/php/src/Controller/DockerController.php(26): AIO\Controller\DockerController->PerformRecursiveContainerStart('nextcloud-aio-l...', true)
#2 /var/www/docker-aio/php/src/Controller/DockerController.php(209): AIO\Controller\DockerController->PerformRecursiveContainerStart('nextcloud-aio-a...', true)
#3 /var/www/docker-aio/php/src/Controller/DockerController.php(189): AIO\Controller\DockerController->startTopContainer(true)
#4 /var/www/docker-aio/php/vendor/slim/slim/Slim/Handlers/Strategies/RequestResponse.php(38): AIO\Controller\DockerController->StartContainer(Object(GuzzleHttp\Psr7\ServerRequest), Object(GuzzleHttp\Psr7\Response), Array)
#5 /var/www/docker-aio/php/vendor/slim/slim/Slim/Routing/Route.php(363): Slim\Handlers\Strategies\RequestResponse->__invoke(Array, Object(GuzzleHttp\Psr7\ServerRequest), Object(GuzzleHttp\Psr7\Response), Array)
#6 /var/www/docker-aio/php/vendor/slim/slim/Slim/MiddlewareDispatcher.php(73): Slim\Routing\Route->handle(Object(GuzzleHttp\Psr7\ServerRequest))
#7 /var/www/docker-aio/php/vendor/slim/slim/Slim/MiddlewareDispatcher.php(73): Slim\MiddlewareDispatcher->handle(Object(GuzzleHttp\Psr7\ServerRequest))
#8 /var/www/docker-aio/php/vendor/slim/slim/Slim/Routing/Route.php(321): Slim\MiddlewareDispatcher->handle(Object(GuzzleHttp\Psr7\ServerRequest))
#9 /var/www/docker-aio/php/vendor/slim/slim/Slim/Routing/RouteRunner.php(74): Slim\Routing\Route->run(Object(GuzzleHttp\Psr7\ServerRequest))
#10 /var/www/docker-aio/php/vendor/slim/csrf/src/Guard.php(482): Slim\Routing\RouteRunner->handle(Object(GuzzleHttp\Psr7\ServerRequest))
#11 /var/www/docker-aio/php/vendor/slim/slim/Slim/MiddlewareDispatcher.php(177): Slim\Csrf\Guard->process(Object(GuzzleHttp\Psr7\ServerRequest), Object(Slim\Routing\RouteRunner))
#12 /var/www/docker-aio/php/vendor/slim/twig-view/src/TwigMiddleware.php(117): Psr\Http\Server\RequestHandlerInterface@anonymous->handle(Object(GuzzleHttp\Psr7\ServerRequest))
#13 /var/www/docker-aio/php/vendor/slim/slim/Slim/MiddlewareDispatcher.php(129): Slim\Views\TwigMiddleware->process(Object(GuzzleHttp\Psr7\ServerRequest), Object(Psr\Http\Server\RequestHandlerInterface@anonymous))
#14 /var/www/docker-aio/php/src/Middleware/AuthMiddleware.php(36): Psr\Http\Server\RequestHandlerInterface@anonymous->handle(Object(GuzzleHttp\Psr7\ServerRequest))
#15 /var/www/docker-aio/php/vendor/slim/slim/Slim/MiddlewareDispatcher.php(280): AIO\Middleware\AuthMiddleware->__invoke(Object(GuzzleHttp\Psr7\ServerRequest), Object(Psr\Http\Server\RequestHandlerInterface@anonymous))
#16 /var/www/docker-aio/php/vendor/slim/slim/Slim/Middleware/ErrorMiddleware.php(77): Psr\Http\Server\RequestHandlerInterface@anonymous->handle(Object(GuzzleHttp\Psr7\ServerRequest))
#17 /var/www/docker-aio/php/vendor/slim/slim/Slim/MiddlewareDispatcher.php(129): Slim\Middleware\ErrorMiddleware->process(Object(GuzzleHttp\Psr7\ServerRequest), Object(Psr\Http\Server\RequestHandlerInterface@anonymous))
#18 /var/www/docker-aio/php/vendor/slim/slim/Slim/MiddlewareDispatcher.php(73): Psr\Http\Server\RequestHandlerInterface@anonymous->handle(Object(GuzzleHttp\Psr7\ServerRequest))
#19 /var/www/docker-aio/php/vendor/slim/slim/Slim/App.php(209): Slim\MiddlewareDispatcher->handle(Object(GuzzleHttp\Psr7\ServerRequest))
#20 /var/www/docker-aio/php/vendor/slim/slim/Slim/App.php(193): Slim\App->handle(Object(GuzzleHttp\Psr7\ServerRequest))
#21 /var/www/docker-aio/php/public/index.php(189): Slim\App->run()
#22 {main}
Tips: To display error details in HTTP response set "displayErrorDetails" to true in the ErrorHandler constructor.

I think the reason is probably because the docker image used as a base for aio-local-ai is quay.io/go-skynet/local-ai:v2.24.2-aio-cpu , but when ENABLE_NVIDIA_GPU=true is used, it should be one of the images below:

quay.io/go-skynet/local-ai:v2.24.2-aio-gpu-nvidia-cuda-12
quay.io/go-skynet/local-ai:v2.24.2-aio-gpu-nvidia-cuda-11

Additionally, I saw that there are options for Intel GPU or something like this

quay.io/go-skynet/local-ai:v2.24.2-aio-gpu-intel-f16
quay.io/go-skynet/local-ai:v2.24.2-aio-gpu-intel-f32
quay.io/go-skynet/local-ai:v2.24.2-aio-gpu-hipblas

Anyway, the main problem about the error is because it stop the all other containers to start.

@szaimen
Copy link
Collaborator

szaimen commented Dec 20, 2024

Hm... This specific error should not happen, it should still be able to start the container... Did you install the nvidia drivers correctly like mentioned im the readme?

@luzfcb
Copy link

luzfcb commented Dec 20, 2024

Did you install the nvidia drivers correctly like mentioned im the readme?

Yes, since I can successfully run the Sample Workload with Docker

sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

image

Note:

I'm using Unraid OS 6.12.3 and have the Nvidia Driver plugin by ich777 and the Nvidia proprietary driver v565.57.01 , and I am able to play games on Steam via docker container https://github.com/Steam-Headless/docker-steam-headless

Anyway, the main problem about the error on the local-ai community container is because it somehow block the all other nextcloud containers to start by Nextcloud AIO.

@szaimen
Copy link
Collaborator

szaimen commented Dec 20, 2024

So if you remove local-ai from the community containers, nextcloud starts correctly?

@luzfcb
Copy link

luzfcb commented Dec 20, 2024

So if you remove local-ai from the community containers, nextcloud starts correctly?

Yes, and also if I keep local-ai on the community containers and set ENABLE_NVIDIA_GPU=false the nextcloud also starts correctly

@szaimen
Copy link
Collaborator

szaimen commented Dec 20, 2024

Okay this is weird. I guess we need to remove the setting for local-ai then. Doe the other containers that we added start correctly?

@luzfcb
Copy link

luzfcb commented Dec 20, 2024

I guess we need to remove the setting for local-ai then

Until aio-local-ai provides a new image using quay.io/go-skynet/local-ai:v2.24.2-aio-gpu-nvidia-cuda-12 as base instead of quay.io/go-skynet/local-ai:v2.24.2-aio-cpu , and community containers feature implements a way to change the image or tag defined on

"image": "szaimen/aio-local-ai",
"image_tag": "v2",
based on the existence of the ENABLE_NVIDIA_GPU=true, probably yes

Doe the other containers that we added start correctly?

facerecognition and memories start correctly.

Since I don't use plex and jellyfin, and I can't confirm if they work or not.

I'm also not sure if facerecognition and memories recognize the nvidia GPU immediately without any extra configuration.

I'll try to figure that out over the weekend.

@luzfcb
Copy link

luzfcb commented Dec 20, 2024

I guess we need to remove the setting for local-ai then

Until aio-local-ai provides a new image using quay.io/go-skynet/local-ai:v2.24.2-aio-gpu-nvidia-cuda-12 as base instead of quay.io/go-skynet/local-ai:v2.24.2-aio-cpu , and community containers feature implements a way to change the image or tag defined on https://github.com/nextcloud/all-in-one/blob/109b9dc019ebb499a9571f8cf3129e6e26e1942a/community-containers/local-ai/local-ai.json#L7-L8 based on the existence of the ENABLE_NVIDIA_GPU=true, probably yes

A quick and simple solution is probably to create a new community container named local-ai-nvidia that uses a possible new aio-local-ai nvidia based image.

That is, if I want to use ENABLE_NVIDIA_GPU=true and local-ai, I should use local-ai-nvidia in the configuration instead of local-ai in the AIO_COMMUNITY_CONTAINERS Nextcloud AIO environment variable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3. to review Waiting for reviews enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants