Replies: 2 comments
-
The mapping of condition (text) (and noise) to image happens in the diffusion model (Unet or DiT). Also, the CLIP text encoder is pretrained on massive datasets (LAION, OpenAI 400M, etc). |
Beta Was this translation helpful? Give feedback.
0 replies
-
I know it's a late answer, but Just in case someone reads later, we added the option to train the text encoders with SD3 Lora training, you can see an example here. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I mean, if the concept is slightly different than what the model was trained on, it trains very badly without text encoder lora training.
(I mean only the clip text encoder, please dont add lora training on top of T5XXL)
Beta Was this translation helpful? Give feedback.
All reactions