Caching Pipeline on RAM or GPU for faster inference? #8431

edd2110 · 2024-06-07T18:41:12Z

edd2110
Jun 7, 2024

I'm looking all around the internet for this:

I'm deploying an application that for every request from the users, run the python script, loads the diffusers library and instantiate a pipeline (pipeline = DiffusionPipeline.from_pretrained(...)) and then generates an image to return to the user.

However, my inference times are already fast (since calling pipeline(prompt="example_prompt"...) only takes about 0.3 seconds) ... but every user request takes more than 40 seconds because on every request from different users pipeline = DiffusionPipeline.from_pretrained(...) needs to be loaded over and over again every time a user makes a request, making the whole inference process very slow.

So my question is this: is it possible to instantiate (or cache) a single pipeline = DiffusionPipeline.from_pretrained(...) to be used from different http requests? and be reused multiple times? So it doesn't have to run this for each request to my server.

The closest thing that I've found is Mystic (https://docs.mystic.ai/docs/deploy-a-stable-diffusion-pipeline) that allows you to load the model into the cache of the GPU, but its loading the whole model, and I only want to instantiate the pipeline.

Regards,

asomoza · 2024-06-08T02:22:03Z

asomoza
Jun 8, 2024
Maintainer

Hi, this is not a diffusers issue though, what you're asking is more a question on a top level over diffusers. You can instantiate a pipeline and keep it in memory (gpu or cpu) and just run it when there's a user request.

What you need is to learn how to create an API that keeps a service running without terminating on each run, there's multiple libraries and multiple ways to do this but as I stated before, this has nothing to do with diffusers, a very basic way of doing this is to load the pipeline and create a While loop that ask for the user prompt and for each time the user enters a prompt run the pipeline to create the image.

pipe = StableDiffusionPipeline.from_pretrained(
    model_name_or_path,
    torch_dtype=torch.float16,
).to("cuda")

while True:
  prompt = input("Enter a prompt: ")

  image = pipe(
      prompt=prompt,
  ).images[0]

  # save the image

Now if you replace the part of the user input with something like FastAPI and run the pipeline whenever there's a request from the user on a certain route, you'll get what you're asking.

There's multiple other issues that comes with this like concurrent user calls, model management, etc. But again, diffusers is the library you use as the foundation to generate images, you'll need to build a complete web api solution on top of it.

1 reply

edd2110 Jun 8, 2024
Author

Totally agree. this ain't a diffusers issue was more a question about if Diffusers library had some option to do this directly from the library. As you point out: this can be done on a higher semantic level and it's the way i'm doing it right now (i.e. In a specific flask route, in FastAPI or even a docker container that initialize that service from start-up).

However, just wanted to check in the Q&A section if this was possible directly from Diffusers library to avoid using a web service for this.

Thank you for the answer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caching Pipeline on RAM or GPU for faster inference? #8431

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Caching Pipeline on RAM or GPU for faster inference? #8431

edd2110 Jun 7, 2024

Replies: 1 comment · 1 reply

asomoza Jun 8, 2024 Maintainer

edd2110 Jun 8, 2024 Author

edd2110
Jun 7, 2024

Replies: 1 comment 1 reply

asomoza
Jun 8, 2024
Maintainer

edd2110 Jun 8, 2024
Author