CogvideoX VAE decoder consumes significantly more memory in the latest version #10036

ic-synth · 2024-11-27T14:02:53Z

ic-synth
Nov 27, 2024

Describe the bug

The memory consumption for CogVideoX decoder in diffusers 0.31.0 version consumes significantly more memory. To the point where the model goes OOM even on 80G H100 GPUs with a relatively modest frame count.
I include two profiles for very small input tensors of only 5 frames where its visible how much larger the VAE memory consumption is.

Memory footprints for different input sizes are shown below. As you can see, with latest version memory keeps growing with frame count.

Reproduction

Run CogVideoXDecoder3D model with diffusers 0.30.3 and 0.31.0 on the inputs of the same shape and measure the memory consumption as the frame count increases.

# Code requires a GPU with 50+ gigabytes on RAM 
import torch
import diffusers
from diffusers import AutoencoderKLCogVideoX

with torch.no_grad():
    vae = AutoencoderKLCogVideoX().to(dtype=torch.bfloat16).eval()
    vae.decoder = vae.decoder.to(device="cuda:0")
    input_tensor = torch.randn(1,16,5,96,170).to(device="cuda:0", dtype=torch.bfloat16)
    print("Decoding ... Input size:", input_tensor.shape, "Diffuser version", diffusers.__version__)
    vae.decode(input_tensor)
    print(torch.cuda.max_memory_allocated() / (1024 ** 3) , torch.cuda.max_memory_reserved() / (1024 ** 3))

Logs

No response

System Info

Python 3.11.
Diffusers 0.30.3 vs 0.31.0

Who can help?

@sayakpaul @DN6 @yiyixuxu

sayakpaul · 2024-11-27T15:49:10Z

sayakpaul
Nov 27, 2024
Maintainer

It's not a bug, though.

Have you tried using the memory optimization techniques mentioned in the docs?
https://huggingface.co/docs/diffusers/en/api/pipelines/cogvideox#memory-optimization

0 replies

ic-synth · 2024-11-28T11:33:26Z

ic-synth
Nov 28, 2024
Author

Hey @sayakpaul . In the 0.30.3 version, the memory consumption does not grow together with temporal dimension, so it's serialised on that axis. On the screenshots I provided above that was 52.8G for inputs on 5 or 13 temporal steps. With the latest changes, that is no longer the case and the memory consumption is not just increased, but is now dependent on the temporal axis as it goes from 56.4G to a whopping 72.1G with the same inputs as before.

I understand that cpu_offloading and other tricks are available, but it's worth comparing apples to apples and enable/disable this trick for both tests. Here both runs don't use cpu offloading, and I in general the cpu speeds are prohibitively slow for the model of this size.

My main point remains: in 0.30.3 version memory is O(1) in temporal input dimension. In 0.31.0 memory consumption becomes O(n) in temporal dimension. If it's not a bug, than its a very serious regression that should be explicitly indicated that VAE does not scale any more with input sizes.

0 replies

ic-synth · 2024-12-02T13:51:22Z

ic-synth
Dec 2, 2024
Author

@sayakpaul Hello, is there an update on this? I'd like to reopen this as an issue because this memory consumption is not sustainable even for H100 usage and using cpu for offloading negates the benefits of using GPUs

0 replies

sayakpaul · 2024-12-02T14:23:36Z

sayakpaul
Dec 2, 2024
Maintainer

Will let @a-r-r-o-w and @DN6 comment here. It will be helpful if you provide actual code snippets instead of screenshots so that the team members could try it out.

0 replies

ic-synth · 2024-12-02T16:06:52Z

ic-synth
Dec 2, 2024
Author

@sayakpaul @a-r-r-o-w @DN6 actual code snippet is provided in the Reproduction section above and the screenshot shows this very snippet being run with two different versions of diffusers, as I'm not sure its possible to run two versions of the same module in the same process

2 replies

sayakpaul Dec 2, 2024
Maintainer

Could you do this benchmarking in a script rather than a notebook? I don't see any freeing of the GPU resources before running the second memory benchmark.

ic-synth Dec 2, 2024
Author

In order to update the module version I had to restart the kernel which also frees GPU resources. you can see cell execution counts go [1][2][3] [1][2][3] rather than [1][2][3][4][5][6]. You can achieve the same effect with two virtual environments that contain two versions of diffusers and run the same code. I did this in a notebook for demonstrative purposes: you can run the reproducer code snippet with two envs. Do you need a bash script to create two python venvs and pip install diffusers into them?

yiyixuxu · 2024-12-03T01:27:19Z

yiyixuxu
Dec 3, 2024
Maintainer

thanks for the issue!
we will look into this!
cc @zRzRzRzRzRzRzR too if you have time to look into this

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CogvideoX VAE decoder consumes significantly more memory in the latest version #10036

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

CogvideoX VAE decoder consumes significantly more memory in the latest version #10036

ic-synth Nov 27, 2024

Describe the bug

Reproduction

Logs

System Info

Who can help?

Replies: 6 comments · 2 replies

sayakpaul Nov 27, 2024 Maintainer

ic-synth Nov 28, 2024 Author

ic-synth Dec 2, 2024 Author

sayakpaul Dec 2, 2024 Maintainer

ic-synth Dec 2, 2024 Author

sayakpaul Dec 2, 2024 Maintainer

ic-synth Dec 2, 2024 Author

yiyixuxu Dec 3, 2024 Maintainer

ic-synth
Nov 27, 2024

Replies: 6 comments 2 replies

sayakpaul
Nov 27, 2024
Maintainer

ic-synth
Nov 28, 2024
Author

ic-synth
Dec 2, 2024
Author

sayakpaul
Dec 2, 2024
Maintainer

ic-synth
Dec 2, 2024
Author

sayakpaul Dec 2, 2024
Maintainer

ic-synth Dec 2, 2024
Author

yiyixuxu
Dec 3, 2024
Maintainer