Deploying HuggingFace model/pipeline using uvicorn-gunicorn-fastapi-docker on Google Cloud Run #328

GiorgioBarnabo · 2023-02-22T03:06:36Z

GiorgioBarnabo
Feb 22, 2023

Hi everybody,

I am pretty new to web app development and have doubts about how to make the best out of this incredible docker image.
In short, I have been trying to deploy an huggingface pipeline on Google Cloud Run using the uvicorn-gunicorn-fastapi-docker image. The model takes about 3.5GB, while the base cloud-run instance can have up to 16 vCPUs and 32GB of RAM. At deployment time, I also need to manually specify the maximum number of concurrent requests before autoscaling happens.

How should I set up the number of workers/threads for gunicorn/uvicorn, and the characteristics of the base cloud run instance? I noticed that, for every additional worker and/or thread, 3.5GB of RAM are needed. Also, during execution, memory leakage occurs, which would require a worker to be restarted every now and then.

My naif guess is that I should have as many workers as the number of vCPU and a RAM of at least 3.5GB times the number of workers. Is that correct? What about the number of concurrent requests?

Right now, my uvicorn command in the dockerfile looks like this:

CMD uvicorn main:app --host 0.0.0.0 --port 8080 --workers 4 --access-log --use-colors

Nonetheless, with this setting, after a while the RAM gets saturated that the service breaks down :(

Any help is more then welcome.

Thank you in advance. Best

Answered by tiangolo

Aug 25, 2024

Now that Uvicorn supports managing workers with --workers, including restarting dead ones, there's no need for Gunicorn. That also means that it's much simpler to build a Docker image from scratch now, I updated the docs to explain it.

Because of that, I deprecated this Docker image: https://github.com/tiangolo/uvicorn-gunicorn-fastapi-docker#-warning-you-probably-dont-need-this-docker-image

That would also probably make use cases like this simpler to deal with. 🤓

View full answer

ahron1 · 2023-07-14T06:26:26Z

ahron1
Jul 14, 2023

If you use def for the fastapi function, it creates a new thread (from a threadpool) for each incoming request. The model has a single copy in GPU.

If you use async def with N workers, it creates a total of N forks. Each request is handled by one of these N forks. For each of the N workers, there is a copy of the model in the GPU. Workers don't share memory or other resources.

To decide the number of workers: N = number of threads + 1.
You also need more than enough GPU to fit N copies of the model.

So if you are GPU limited, that's your criteria to decide the number of workers.

What I wrote above is based on what I observed in a few tests. It might well be incorrect.

0 replies

ahron1 · 2023-07-14T15:19:12Z

ahron1
Jul 14, 2023

I would also recommend using Gunicorn instead of Uvicorn to run the app

0 replies

tiangolo · 2024-08-25T04:09:58Z

tiangolo
Aug 25, 2024
Maintainer

Now that Uvicorn supports managing workers with --workers, including restarting dead ones, there's no need for Gunicorn. That also means that it's much simpler to build a Docker image from scratch now, I updated the docs to explain it.

Because of that, I deprecated this Docker image: https://github.com/tiangolo/uvicorn-gunicorn-fastapi-docker#-warning-you-probably-dont-need-this-docker-image

That would also probably make use cases like this simpler to deal with. 🤓

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deploying HuggingFace model/pipeline using uvicorn-gunicorn-fastapi-docker on Google Cloud Run #328

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Deploying HuggingFace model/pipeline using uvicorn-gunicorn-fastapi-docker on Google Cloud Run #328

GiorgioBarnabo Feb 22, 2023

Replies: 3 comments

ahron1 Jul 14, 2023

ahron1 Jul 14, 2023

tiangolo Aug 25, 2024 Maintainer

GiorgioBarnabo
Feb 22, 2023

ahron1
Jul 14, 2023

ahron1
Jul 14, 2023

tiangolo
Aug 25, 2024
Maintainer