[core] LTX Video #10021

a-r-r-o-w · 2024-11-26T01:08:12Z

T2V:

import torch
from diffusers import LTXPipeline
from diffusers.utils import export_to_video

pipe = LTXPipeline.from_pretrained("a-r-r-o-w/LTX-Video-diffusers", torch_dtype=torch.bfloat16)
pipe.to("cuda")

prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"

video = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=704,
    height=480,
    num_frames=161,
    num_inference_steps=50,
).frames[0]
export_to_video(video, "output.mp4", fps=24)

I2V:

import torch
from diffusers import LTXImageToVideoPipeline
from diffusers.utils import export_to_video, load_image

pipe = LTXImageToVideoPipeline.from_pretrained("a-r-r-o-w/LTX-Video-diffusers", torch_dtype=torch.bfloat16)
pipe.to("cuda")

image = load_image(
    "https://huggingface.co/datasets/a-r-r-o-w/tiny-meme-dataset-captioned/resolve/main/images/8.png"
)
prompt = "A young girl stands calmly in the foreground, looking directly at the camera, as a house fire rages in the background. Flames engulf the structure, with smoke billowing into the air. Firefighters in protective gear rush to the scene, a fire truck labeled '38' visible behind them. The girl's neutral expression contrasts sharply with the chaos of the fire, creating a poignant and emotionally charged scene."
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"

video = pipe(
    image=image,
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=704,
    height=480,
    num_frames=161,
    num_inference_steps=50,
).frames[0]
export_to_video(video, "output.mp4", fps=24)

HuggingFaceDocBuilderDev · 2024-11-26T01:15:22Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

a-r-r-o-w · 2024-11-27T14:23:18Z

scripts/convert_ltx_to_diffusers.py

+        scheduler = FlowMatchEulerDiscreteScheduler(
+            use_dynamic_shifting=True,
+            base_shift=0.95,
+            max_shift=2.05,
+            base_image_seq_len=1024,
+            max_image_seq_len=4096,
+            shift_terminal=0.1,
+        )


cc @yiyixux for the shift_terminal change

a-r-r-o-w · 2024-11-27T14:23:52Z

src/diffusers/models/attention_processor.py

+        elif qk_norm == "rms_norm_across_heads":
+            # LTX applies qk norm across all heads
+            self.norm_q = RMSNorm(dim_head * heads, eps=eps)
+            self.norm_k = RMSNorm(dim_head * kv_heads, eps=eps)


@DN6 Should I follow your approach with Mochi and create a separate attention class for LTX?

ok but we want to be more careful, ideally, we do that as part of carefully planned-out refactor
but maybe it would be safe to just inherit form Attention for now? e.g. we wrote code like this with the assumption in mind we only have one attention class
https://github.com/huggingface/diffusers/blob/e47cc1fc1a89a5375c322d296cd122fe71ab859f/src/diffusers/pipelines/pag/pag_utils.py#L57C39-L57C48

cc @DN6 here too

src/diffusers/models/autoencoders/autoencoder_kl_ltx.py

a-r-r-o-w · 2024-11-27T14:27:40Z

src/diffusers/schedulers/scheduling_flow_match_euler_discrete.py

@@ -169,6 +170,12 @@ def _sigma_to_t(self, sigma):
    def time_shift(self, mu: float, sigma: float, t: torch.Tensor):
        return math.exp(mu) / (math.exp(mu) + (1 / t - 1) ** sigma)

+    def stretch_shift_to_terminal(self, t: torch.Tensor) -> torch.Tensor:


cc @yiyixuxu

stevhliu

Thanks for adding!

docs/source/en/api/models/autoencoderkl_ltx.md

docs/source/en/api/pipelines/ltx.md

src/diffusers/models/autoencoders/autoencoder_kl_ltx.py

src/diffusers/models/transformers/transformer_ltx.py

src/diffusers/pipelines/ltx/pipeline_ltx.py

Co-authored-by: Steven Liu <[email protected]>

…users into ltx-integration

src/diffusers/pipelines/ltx/pipeline_ltx.py

src/diffusers/models/normalization.py

src/diffusers/models/transformers/transformer_ltx.py

yiyixuxu · 2024-11-28T21:40:57Z

src/diffusers/models/transformers/transformer_ltx.py

+        hidden_states = hidden_states.reshape(
+            batch_size, -1, post_patch_num_frames, p_t, post_patch_height, p, post_patch_width, p
+        )
+        hidden_states = hidden_states.permute(0, 2, 4, 6, 1, 3, 5, 7).flatten(4, 7).flatten(1, 3)


is there any reason we reshape from 5d -> 3d and then 3d -> 5d on every iteration?

think it will be nice to:

have a _pack_latent and _unpack_latent like we did for flux

maybe we can move the rotary embedding in the pipeline so we only pack/unpack once (I know we currently have some discrepancy here, so open to discussion); or we can just configure the rotary pos embeds class in pipeline, so we do not need to give the shape info each time

Yes, I'll add the pack and unpack latent methods.

Regarding the RoPE, I think a separate layer approach is okay despite it requiring recomputation at every step. This is because we're planning to work on caching hooks that would enable the outputs of any layer to be cached and re-used. Since RoPE is an integral part of many models, there can be some opt-out code we could default to for enabling caching by default on these kinds of model specific RoPE layers. WDYT?

a-r-r-o-w · 2024-11-29T08:58:44Z

src/diffusers/pipelines/ltx/pipeline_ltx_image2video.py

+                # ============= TODO(aryan): needs a look by YiYi
+                latents = latents.float()
+
+                noise_pred = self._unpack_latents(
+                    noise_pred,
+                    latent_num_frames,
+                    latent_height,
+                    latent_width,
+                    self.transformer_spatial_patch_size,
+                    self.transformer_temporal_patch_size,
+                )
+                latents = self._unpack_latents(
+                    latents,
+                    latent_num_frames,
+                    latent_height,
+                    latent_width,
+                    self.transformer_spatial_patch_size,
+                    self.transformer_temporal_patch_size,
+                )
+
+                noise_pred = noise_pred[:, :, 1:]
+                noise_latents = latents[:, :, 1:]
+                pred_latents = self.scheduler.step(noise_pred, t, noise_latents, return_dict=False)[0]
+
+                latents = torch.cat([latents[:, :, :1], pred_latents], dim=2)
+                latents = self._pack_latents(
+                    latents, self.transformer_spatial_patch_size, self.transformer_temporal_patch_size
+                )
+                latents = latents.to(dtype=latents_dtype)
+                # =============


@yiyixuxu They use per latent frame timesteps (actually, it's per-token timesteps but all tokens corresponding to the same frame have the same timesteps), but since we don't support it in our schedulers, we can't really do the normal scheduler.step(). These changes were required to make the pipeline atleast generate reasonable results. The quality of generations looks similar to me but will try and numerically match.

src/diffusers/pipelines/ltx/pipeline_ltx_image2video.py

a-r-r-o-w added 3 commits November 26, 2024 00:02

transformer

b28f89d

make style & make fix-copies

f082cc8

transformer

a255045

add transformer tests

36c9b40

a-r-r-o-w mentioned this pull request Nov 26, 2024

[core] refactor attention_processor.py the easy way #10022

Open

a-r-r-o-w added 16 commits November 26, 2024 14:12

80% vae

c3bd2e4

make style

43f7907

make fix-copies

02a2b6b

fix

c901641

undo cogvideox changes

868cd47

update

db13a83

update

11d2d91

match vae

d320105

add docs

755e29c

t2v pipeline working; scheduler needs to be checked

ac95930

docs

5f185cd

add pipeline test

e580b6b

update

13adf3f

update

c8dfa98

make fix-copies

b234394

Merge branch 'main' into ltx-integration

e379200

a-r-r-o-w marked this pull request as ready for review November 27, 2024 14:22

a-r-r-o-w requested review from yiyixuxu, stevhliu and DN6 and removed request for yiyixuxu November 27, 2024 14:22

a-r-r-o-w commented Nov 27, 2024

View reviewed changes

stevhliu reviewed Nov 27, 2024

View reviewed changes

a-r-r-o-w and others added 3 commits November 27, 2024 21:22

Apply suggestions from code review

6544fcc

Co-authored-by: Steven Liu <[email protected]>

update

7134e2d

copy t2v to i2v pipeline

d4a0f8e

sayakpaul mentioned this pull request Nov 28, 2024

Add Support for Lightricks/LTX-Video Model and Video Generation Features (text2video, image2video, video2video) in Diffusers with Sequential CPU Offload Compatibility #10034

Closed

a-r-r-o-w added 3 commits November 28, 2024 06:36

Merge branch 'ltx-integration' of https://github.com/huggingface/diff…

a1fe164

…users into ltx-integration

Merge branch 'main' into ltx-integration

f3a0b0a

Merge branch 'ltx-integration' of https://github.com/huggingface/diff…

8a26886

…users into ltx-integration

yiyixuxu reviewed Nov 28, 2024

View reviewed changes

src/diffusers/pipelines/ltx/pipeline_ltx.py Outdated Show resolved Hide resolved

yiyixuxu reviewed Nov 28, 2024

View reviewed changes

src/diffusers/models/normalization.py Outdated Show resolved Hide resolved

yiyixuxu reviewed Nov 28, 2024

View reviewed changes

src/diffusers/models/normalization.py Outdated Show resolved Hide resolved

yiyixuxu reviewed Nov 28, 2024

View reviewed changes

src/diffusers/models/normalization.py Outdated Show resolved Hide resolved

yiyixuxu reviewed Nov 28, 2024

View reviewed changes

src/diffusers/models/normalization.py Outdated Show resolved Hide resolved

yiyixuxu reviewed Nov 28, 2024

View reviewed changes

src/diffusers/models/transformers/transformer_ltx.py Outdated Show resolved Hide resolved

yiyixuxu reviewed Nov 28, 2024

View reviewed changes

src/diffusers/models/transformers/transformer_ltx.py Outdated Show resolved Hide resolved

yiyixuxu reviewed Nov 28, 2024

View reviewed changes

src/diffusers/models/transformers/transformer_ltx.py Outdated Show resolved Hide resolved

a-r-r-o-w added 4 commits November 28, 2024 10:08

update

06db66b

apply review suggestions

f8f30a5

update

4e89c8d

make style

5391ceb

yiyixuxu reviewed Nov 28, 2024

View reviewed changes

a-r-r-o-w added 5 commits November 29, 2024 04:58

remove framewise encoding/decoding

c201880

pack/unpack latents

30a3bb7

Merge branch 'main' into ltx-integration

e10b7e7

image2video

1f008fc

update

8e16389

a-r-r-o-w commented Nov 29, 2024

View reviewed changes

src/diffusers/pipelines/ltx/pipeline_ltx_image2video.py Outdated Show resolved Hide resolved

a-r-r-o-w added 3 commits November 29, 2024 09:59

make fix-copies

57c41df

update

606e6b2

update

f4b5341

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] LTX Video #10021

[core] LTX Video #10021

a-r-r-o-w commented Nov 26, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Nov 26, 2024

a-r-r-o-w Nov 27, 2024

a-r-r-o-w Nov 27, 2024

yiyixuxu Nov 28, 2024

a-r-r-o-w Nov 27, 2024

stevhliu left a comment

yiyixuxu Nov 28, 2024

yiyixuxu Nov 28, 2024

a-r-r-o-w Nov 29, 2024

a-r-r-o-w Nov 29, 2024

[core] LTX Video #10021

Are you sure you want to change the base?

[core] LTX Video #10021

Conversation

a-r-r-o-w commented Nov 26, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Nov 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevhliu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

a-r-r-o-w commented Nov 26, 2024 •

edited

Loading