Saving disk and download time (plus VRAM) #19

set-soft · 2024-11-10T20:59:43Z

I manually downloaded the model from here:

https://huggingface.co/silveroxides/OmniGen-V1/tree/main

Renamed the fp8 file to model.safetensors and got it working.

The FP8 model is just 3.7 GB

1038lab · 2024-11-10T21:28:12Z

how's FP8 version, is it run faster?

set-soft · 2024-11-11T09:44:45Z

Not sure if faster, it save disk space and download time.
You should also investigate: https://github.com/newgenai79/OmniGen/
It also maintains an FP8 in VRAM, enabling its use on 8 GiB boards.
I tried your addon in combination with the OmniGen code from https://github.com/chflame163/ComfyUI_OmniGen_Wrapper
This code applies some quantization, and was fine for a 12 GiB board.
For this I used the FP8 file, just because my internet connection is slow and I didn't want to wait for hours (I already had the FP8 downloaded). So the FP8 (on disk) works for your addon, but using chflame163 copy of OminGen code.
I also verified that newgenai79 code (which is for the original demo, not a Comfy_UI addon) works perfectly with the FP8 file and uses only 55% of my VRAM.

1038lab · 2024-11-11T19:39:38Z

updated , try the new version

set-soft · 2024-11-12T20:31:43Z

Hi @1038lab !
I'm afraid it doesn't work, it doesn't even start to do the inference.
For some reason you are unconditionally loading everything to VRAM, first the VAE (330 MB) and then the model, and at this point not even 12 GB are enough. The call "pipe = pipe.to(device)" fails, the model isn't loaded and the VRAM gets 10926 MB of failed load.
Then at the beginning of the pipeline you go again and do "self.model.to(dtype)" which fails over a fail.
Your strategy is for a 16 GB or more board.
This is with "memory_management" set to "Memory Priority"
The only thing that "Memory Priority" is doing is just ask for "offload_model" which doesn't help much and makes thing really slow. When I tested it on an older version it didn't help at all, and it moves layers to just 1 CPU core, not sure if the code still does this.

The main problem I see here is the strategy of downloading upstream code, which doesn't implement a good memory strategy. You must incorporate it and patch the code to do the proper things.

Also: loading the FP8 file won't solve memory issues, PyTorch loads it using some current default dtype, so it gets expanded once loaded, it just small on disk. To get quantization working you must patch the nn.Linear layers.

BTW: Please don't use print, use logging.debug, if you use print the messages goes to the console, but the GUI can't catch them.

1038lab · 2024-11-13T00:17:39Z

Applying quantization is a good approach. I'll make an effort to update it when I have the time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Saving disk and download time (plus VRAM) #19

Saving disk and download time (plus VRAM) #19

set-soft commented Nov 10, 2024

1038lab commented Nov 10, 2024

set-soft commented Nov 11, 2024

1038lab commented Nov 11, 2024

set-soft commented Nov 12, 2024

1038lab commented Nov 13, 2024

Saving disk and download time (plus VRAM) #19

Saving disk and download time (plus VRAM) #19

Comments

set-soft commented Nov 10, 2024

1038lab commented Nov 10, 2024

set-soft commented Nov 11, 2024

1038lab commented Nov 11, 2024

set-soft commented Nov 12, 2024

1038lab commented Nov 13, 2024