You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi everyone 🤗 We're looking into supporting long video generation techniques for the various video models that are supported in Diffusers - AnimateDiff, SVD, Latte, I2VGenXL, Zeroscope, Text2Video-Zero, etc.
There have been a few requests in the past that we'd like to address in the next few weeks - #5576, #7731, #8274, and similar. In order to prioritize what to work on first (currently, FreeNoise) and gauge what's best in terms of speed/quality, and the community perception of the various techniques, we'd like to hear your thoughts and integrate what you think is best. We benchmarked some of these techniques ourselves and the results for memory/time required are presented below. Many of the techniques are model agnostic and can be applied to any underlying video model.
The prompt used for generation was "A pirate ship trapped in a cosmic maelstorm, nebulae, 4k, high definition".
Run on a single A100 (80GB) in fp16.
DDIM with 20 steps. For the number of generated frames, I set "64" but the internal code may generate fewer frames or process more total frames.
No optimization enabled and mostly default repository settings.
Here are the results:
Method
Resolution (WxHxF)
Time (Minutes)
Memory (GB)
LD stands for Lookahead Denoising. You can read more about it in the FIFODiffusion paper.
FIFO (VideoCrafter2, with LD)
512 x 320 x 64
22
7
FIFO (VideoCrafter2, without LD)
512 x 320 x 64
12
7
FIFO (Zeroscope v2, with LD)
512 x 320 x 64
10
6.8
FIFO (Zeroscope v2, without LD)
512 x 320 x 64
5.5
6.8
StreamingT2V (AnimateDiff)
512 x 512 x 56 (256 x 256 x 64 in preview)
10
18.1
StreamingT2V (ModelScopeT2V)
512 x 512 x 56 (256 x 256 x 64 in preview)
10
18.3
StreamingT2V (SVD)
512 x 512 x 56 (256 x 256 x 64 in preview)
10
24.1
FreeNoise (VideoCrafter2)
512 x 320 x 64
-
-
FreeNoise (VideoCrafter2)
1024 x 576 x 64
7.2
36.7
The memory/time requirements specified for AnimateDiff, below, are for a context window of 16 frames and 24 frames. The 24-frame measurements are within brackets.
FreeNoise (AnimateDiff)
512 x 512 x 64
1.5 (1.2)
6.8 (8.2)
Context Scheduler (AnimateDiff, static)
512 x 512 x 64
1.5 (1.2)
6.8 (8.3)
Context Scheduler (AnimateDiff, uniform)
512 x 512 x 64
1.6 (1.4)
6.8 (8.2)
It is worth noting that some of the methods, for example FIFODiffusion, can benefit from multi-GPU settings to cut down inference time by a lot due to parallel processing of disjoint sets of video frames. However, for a fair comparison, these are reports from a single GPU.
The generation results can be found here. Looking at the results, it is obvious that many methods cannot run on consumer GPUs without optimizations. We'd like to make the best of them more broadly and easily available, work on optimizing inference, and implement them in ways that are easy to understand and adopt in future open video models.
Let us know your thoughts on what techniques/models you'd like to see added in the future. Keep diffusing 🧨
What method would you like to see integrated the soon? 🏃♂️ Could be based on your experience using it or just comparing results from project page demos, etc.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi everyone 🤗 We're looking into supporting long video generation techniques for the various video models that are supported in Diffusers - AnimateDiff, SVD, Latte, I2VGenXL, Zeroscope, Text2Video-Zero, etc.
There have been a few requests in the past that we'd like to address in the next few weeks - #5576, #7731, #8274, and similar. In order to prioritize what to work on first (currently, FreeNoise) and gauge what's best in terms of speed/quality, and the community perception of the various techniques, we'd like to hear your thoughts and integrate what you think is best. We benchmarked some of these techniques ourselves and the results for memory/time required are presented below. Many of the techniques are model agnostic and can be applied to any underlying video model.
For all runs, the following is true:
Here are the results:
It is worth noting that some of the methods, for example FIFODiffusion, can benefit from multi-GPU settings to cut down inference time by a lot due to parallel processing of disjoint sets of video frames. However, for a fair comparison, these are reports from a single GPU.
The generation results can be found here. Looking at the results, it is obvious that many methods cannot run on consumer GPUs without optimizations. We'd like to make the best of them more broadly and easily available, work on optimizing inference, and implement them in ways that are easy to understand and adopt in future open video models.
Let us know your thoughts on what techniques/models you'd like to see added in the future. Keep diffusing 🧨
3 votes ·
Beta Was this translation helpful? Give feedback.
All reactions