diff --git a/index.html b/index.html index 3745c3b..cfff8a8 100644 --- a/index.html +++ b/index.html @@ -231,7 +231,6 @@

the text prompt, thus simplifying the learning of mapping from embeddings to image outputs. Finally, to align the pre-trained Stable Diffusion model (1.4) with the embeddings of our modular encoder, we retrain the conditioning by finetuning the cross-attention weights (2.2).

- src/imgs/architecture.png
@@ -243,7 +242,7 @@

Image Fidelity and Text-to-Image Alignment

We meassure image fidelity and image-text-alignment using the standard metrics FID-30K and Clip Scores. We find that MultiFusion prompted with text only performs on par with Stable Diffusion despite extension of the Encoder to support multiple languages and modalities.

- method
+

Compositional Robustness

Image Composition is a known limitation of Diffusion Models. Through evaluation of our new benchmark MCC-250 we show that multimodal prompting leads to more compositional robustness as judged by humans.

method