diff --git a/index.html b/index.html
index 3745c3b..cfff8a8 100644
--- a/index.html
+++ b/index.html
@@ -231,7 +231,6 @@ <h3>
        		the text prompt, thus simplifying the learning of mapping from embeddings to image outputs. Finally, to align the pre-trained Stable Diffusion model (1.4) with the embeddings of our modular 
        		encoder, we retrain the conditioning by finetuning the cross-attention weights (2.2). 
 		</p>
-		    src/imgs/architecture.png
                 <image src="src/imgs/architecture.png"><br>
             </div>
         </div>
@@ -243,7 +242,7 @@ <h3>
 	        </h3>
 		<h4>Image Fidelity and Text-to-Image Alignment</h4>
 		<p>We meassure image fidelity and image-text-alignment using the standard metrics FID-30K and Clip Scores. We find that MultiFusion prompted with text only performs on par with Stable Diffusion despite extension of the Encoder to support multiple languages and modalities.</p>
-		<image style='border:1px solid #000000' src="src/imgs/evaluation.png" class="img-responsive" alt="method"><br>
+		<image style='border:1px solid #000000' src="src/imgs/evaluation.png" class="img-responsive"><br>
 		<h4>Compositional Robustness</h4>
 		<p>Image Composition is a known limitation of Diffusion Models. Through evaluation of our new benchmark <a href="https://huggingface.co/datasets/AIML-TUDA/MCC-250">MCC-250</a> we show that multimodal prompting leads to more compositional robustness as judged by humans.</p>
 		<image style='border:1px solid #000000' src="src/imgs/compositional_robustness.png" class="img-responsive" alt="method"><br>