Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving disk and download time (plus VRAM) #19

Open
set-soft opened this issue Nov 10, 2024 · 5 comments
Open

Saving disk and download time (plus VRAM) #19

set-soft opened this issue Nov 10, 2024 · 5 comments

Comments

@set-soft
Copy link

I manually downloaded the model from here:

https://huggingface.co/silveroxides/OmniGen-V1/tree/main

Renamed the fp8 file to model.safetensors and got it working.

The FP8 model is just 3.7 GB

@1038lab
Copy link
Owner

1038lab commented Nov 10, 2024

how's FP8 version, is it run faster?

@set-soft
Copy link
Author

Not sure if faster, it save disk space and download time.
You should also investigate: https://github.com/newgenai79/OmniGen/
It also maintains an FP8 in VRAM, enabling its use on 8 GiB boards.
I tried your addon in combination with the OmniGen code from https://github.com/chflame163/ComfyUI_OmniGen_Wrapper
This code applies some quantization, and was fine for a 12 GiB board.
For this I used the FP8 file, just because my internet connection is slow and I didn't want to wait for hours (I already had the FP8 downloaded). So the FP8 (on disk) works for your addon, but using chflame163 copy of OminGen code.
I also verified that newgenai79 code (which is for the original demo, not a Comfy_UI addon) works perfectly with the FP8 file and uses only 55% of my VRAM.

@1038lab
Copy link
Owner

1038lab commented Nov 11, 2024

updated , try the new version

@set-soft
Copy link
Author

Hi @1038lab !
I'm afraid it doesn't work, it doesn't even start to do the inference.
For some reason you are unconditionally loading everything to VRAM, first the VAE (330 MB) and then the model, and at this point not even 12 GB are enough. The call "pipe = pipe.to(device)" fails, the model isn't loaded and the VRAM gets 10926 MB of failed load.
Then at the beginning of the pipeline you go again and do "self.model.to(dtype)" which fails over a fail.
Your strategy is for a 16 GB or more board.
This is with "memory_management" set to "Memory Priority"
The only thing that "Memory Priority" is doing is just ask for "offload_model" which doesn't help much and makes thing really slow. When I tested it on an older version it didn't help at all, and it moves layers to just 1 CPU core, not sure if the code still does this.

The main problem I see here is the strategy of downloading upstream code, which doesn't implement a good memory strategy. You must incorporate it and patch the code to do the proper things.

Also: loading the FP8 file won't solve memory issues, PyTorch loads it using some current default dtype, so it gets expanded once loaded, it just small on disk. To get quantization working you must patch the nn.Linear layers.

BTW: Please don't use print, use logging.debug, if you use print the messages goes to the console, but the GUI can't catch them.

@1038lab
Copy link
Owner

1038lab commented Nov 13, 2024

Applying quantization is a good approach. I'll make an effort to update it when I have the time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants