diff --git a/_posts/2024-08-12-AIGC.md b/_posts/2024-08-12-AIGC.md
new file mode 100644
index 00000000..ea522964
--- /dev/null
+++ b/_posts/2024-08-12-AIGC.md
@@ -0,0 +1,275 @@
+---
+layout: post
+title: Generative AI
+author: [Richard Kuo]
+category: [Lecture]
+tags: [jekyll, ai]
+---
+
+This introduction includes Text-to-Image, Text-to-Video, Text-to-Motion, Text-to-3D, Image-to-3D.
+
+---
+
+
+---
+## Text-to-Image
+**News:** [An A.I.-Generated Picture Won an Art Prize. Artists Aren’t Happy.](https://www.nytimes.com/2022/09/02/technology/ai-artificial-intelligence-artists.html)
+![](https://static01.nyt.com/images/2022/09/01/business/00roose-1/merlin_212276709_3104aef5-3dc4-4288-bb44-9e5624db0b37-superJumbo.jpg?quality=75&auto=webp)
+
+**Blog:** [DALL-E, DALL-E2 and StoryDALL-E](https://zhangtemplar.github.io/dalle/)
+
+---
+### DALL.E
+DALL·E is a 12-billion parameter version of GPT-3 trained to generate images from text descriptions, using a dataset of text–image pairs.
+
+**Blog:** [https://openai.com/blog/dall-e/](https://openai.com/blog/dall-e/)
+**Paper:** [Zero-Shot Text-to-Image Generation](https://arxiv.org/abs/2102.12092)
+**Code:** [openai/DALL-E](https://github.com/openai/DALL-E)
+
+The overview of DALL-E could be illustrated as below. It contains two components: for image, VQGAN (vector quantized GAN) is used to map the 256x256 image to a 32x32 grid of image token and each token has 8192 possible values; then this token is combined with 256 BPE=encoded text token is fed into to train the autoregressive transformer. The text token is set to 256 by maximal.
+![](https://raw.githubusercontent.com/zhangtemplar/zhangtemplar.github.io/master/uPic/2022_09_30_16_08_31_105325789-46d94700-5bcd-11eb-9c91-818e8b5d6a35.jpeg)
+
+---
+### Contrastive Language-Image Pre-training (CLIP)
+**Paper:** [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020)
+![](https://production-media.paperswithcode.com/methods/3d5d1009-6e3d-4570-8fd9-ee8f588003e7.png)
+
+---
+### [DALL.E-2](https://openai.com/dall-e-2/)
+DALL·E 2 is a new AI system that can create realistic images and art from a description in natural language.
+
+**Blog:** [How DALL-E 2 Actually Works](https://www.assemblyai.com/blog/how-dall-e-2-actually-works/)
+"a bowl of soup that is a portal to another dimension as digital art".
+![](https://www.assemblyai.com/blog/content/images/size/w1000/2022/04/soup.png)
+
+**Paper:** [Hierarchical Text-Conditional Image Generation with CLIP Latents](https://arxiv.org/abs/2204.06125)
+![](https://pic3.zhimg.com/80/v2-e096e3cf8a1e7a9f569b18f658da574e_720w.jpg)
+
+---
+### [LAION-5B Dataset](https://laion.ai/blog/laion-5b/)
+5.85 billion CLIP-filtered image-text pairs
+**Paper:** [LAION-5B: An open large-scale dataset for training next generation image-text models](https://arxiv.org/abs/2210.08402)
+![](https://lh5.googleusercontent.com/u4ax53sZ0oABJ2tCt4FH6fs4V6uUQ_DRirV24fX0EPpGLMZrA8OlknEohbC0L1Nctvo7hLi01R4I0a3HCfyUMnUcCm76u86ML5CyJ-5boVk_8E5BPG5Z2eeJtPDQ00IhVE-camk4)
+
+---
+### [DALL.E-3](https://openai.com/dall-e-3)
+![](https://media.cloudbooklet.com/uploads/2023/09/23121557/DALL-E-3.jpg)
+
+**Paper:** [Improving Image Generation with Better Captions](https://cdn.openai.com/papers/dall-e-3.pdf)
+
+**Blog:** [DALL-E 2 vs DALL-E 3 Everything you Need to Know](https://www.cloudbooklet.com/dall-e-2-vs-dall-e-3/)
+
+**Dataset Recaptioning**
+![](https://github.com/rkuo2000/AI-course/blob/main/images/DALL-E3_Descriptive_Synthetic_Captions.png?raw=true)
+
+---
+### Stable Diffusion
+**Paper:** [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752)
+![](https://miro.medium.com/v2/resize:fit:720/format:webp/0*rW_y1kjruoT9BSO0.png)
+**Blog:** [Stable Diffusion: Best Open Source Version of DALL·E 2](https://towardsdatascience.com/stable-diffusion-best-open-source-version-of-dall-e-2-ebcdf1cb64bc)
+![](https://miro.medium.com/v2/resize:fit:828/format:webp/1*F3jVIlEAyLkMpJFhb4fxKQ.png)
+**Code:** [Stable Diffusion](https://github.com/CompVis/stable-diffusion)
+![](https://github.com/CompVis/stable-diffusion/blob/main/assets/stable-samples/txt2img/merged-0005.png?raw=true)
+![](https://github.com/CompVis/stable-diffusion/blob/main/assets/stable-samples/txt2img/merged-0007.png?raw=true)
+
+**Demo:** [Stable Diffusion Online (SDXL)](https://stablediffusionweb.com/)
+Stable Diffusion XL is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input, cultivates autonomous freedom to produce incredible imagery, empowers billions of people to create stunning art within seconds.
+
+---
+### [Imagen](https://imagen.research.google/)
+**Paper:** [Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding](https://arxiv.org/abs/2205.11487)
+**Blog:** [How Imagen Actually Works](https://www.assemblyai.com/blog/how-imagen-actually-works/)
+![](https://www.assemblyai.com/blog/content/images/size/w1000/2022/06/imagen_examples.png)
+![](https://www.assemblyai.com/blog/content/images/size/w1000/2022/06/image-6.png)
+The text encoder in Imagen is the encoder network of T5 (Text-to-Text Transfer Transformer)
+![](https://www.assemblyai.com/blog/content/images/2022/06/t5_tasksgif.gif)
+
+---
+### Diffusion Models
+**Blog:** [Introduction to Diffusion Models for Machine Learning](https://www.assemblyai.com/blog/diffusion-models-for-machine-learning-introduction/)
+
+Diffusion Models are a method of creating data that is similar to a set of training data.
+They train by destroying the training data through the addition of noise, and then learning to recover the data by reversing this noising process. Given an input image, the Diffusion Model will iteratively corrupt the image with Gaussian noise in a series of timesteps, ultimately leaving pure Gaussian noise, or "TV static".
+![](https://www.assemblyai.com/blog/content/images/size/w1000/2022/06/image-5.png)
+The Diffusion Model will then work backwards, learning how to isolate and remove the noise at each timestep, undoing the destruction process that just occurred.
+Once trained, the model can then be "split in half", and we can start from randomly sampled Gaussian noise which we use the Diffusion Model to gradually denoise in order to generate an image.
+![](https://www.assemblyai.com/blog/content/images/size/w1000/2022/06/image-4.png)
+
+---
+### SDXL
+**Paper:** [SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis](https://arxiv.org/abs/2307.01952)
+**Code:** [Generative Models by Stability AI](https://github.com/stability-ai/generative-models)
+![](https://github.com/Stability-AI/generative-models/blob/main/assets/000.jpg?raw=true)
+![](https://github.com/Stability-AI/generative-models/blob/main/assets/tile.gif?raw=true)
+
+**Huggingface:** [stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0)
+![](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/pipeline.png)
+SDXL consists of an ensemble of experts pipeline for latent diffusion: In a first step, the base model is used to generate (noisy) latents, which are then further processed with a refinement model (available here: https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/) specialized for the final denoising steps. Note that the base model can be used as a standalone module.
+
+**Kaggle:** [https://www.kaggle.com/rkuo2000/stable-diffusion-xl](https://www.kaggle.com/rkuo2000/stable-diffusion-xl/)
+
+---
+### [Transfusion](https://arxiv.org/html/2408.11039v1)
+**Paper:** [Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model](https://www.arxiv.org/abs/2408.11039)
+**Code:**
+![](https://arxiv.org/html/2408.11039v1/x2.png)
+
+---
+## Text-to-Video
+
+### Turn-A-Video
+**Paper:** [Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation](https://arxiv.org/abs/2212.11565)
+**Code:** [https://github.com/showlab/Tune-A-Video](https://github.com/showlab/Tune-A-Video)
+
+
+
+Given a video-text pair as input, our method, Tune-A-Video, fine-tunes a pre-trained text-to-image diffusion model for text-to-video generation.
+