[Paper] [Project Page ✨] [Pre-trained Models in 🤗Hugging Face] [Demo] [Civitai]
by Fu-Yun Wang1, Zhaoyang Huang2, Alexander William Bergman3,6, Dazhong Shen4, Peng Gao4, Michael Lingelbach3,6, Keqiang Sun1, Weikang Bian1 Guanglu Song5, Yu Liu4, Hongsheng Li1, Xiaogang Wang1
1CUHK-MMLab 2Avolution AI 3Hedra 4Shanghai AI Lab 5SenseTime 6Stanford University
- [2024.07.27]: Release Training Scripts of PCM-LoRA with Stable Diffusion XL.
- [2024.07.14]: FIX inference bug caused by the default parameters of DDIM.
- [2024.06.19]: Release the training script of PCM-LoRA with Stable Diffusion 3. See text_to_image_sd3. Release the weights of PCM-LORA with Stable Diffusion 3. See PCM_Weights.
PCM-SD3-2step-Deterministic | PCM-SD3-4step-Deterministic | PCM-SD3-Stochastic (treat it as a clearer LCM) |
---|---|---|
- [2024.06.04]: Hugging Face Demo is available. Thanks @radames for the commit!
- [2024.06.01]: Release PCM-LoRA weights of Stable Diffusion v1.5 and Stable Diffusion XL on huggingface.
- [2024.06.01]: Release Training Script of PCM-LoRA with Stable Diffusion v1.5. See tran_pcm_lora_sd15.sh.
We train the weights with 8 A 800. But my tentative experimental results suggest that using just one GPU can still achieve good results. Happy Children's Day! Never too old to celebrate the joys of childhood!
Note that adv loss might harm a bit on FID when NFE >=4 but generally has better visual effects.
- [2024.05.30]: Technical report is available on arXiv.
One-Step Generation Comparison by HyperSD | One-Step Generation Comparison by PCM |
---|---|
hypersd | ours |
Our model has clearly better generation diversity than the cocurrent work HyperSD.
Consistency Model (CM), is a promising new famility of generative models that can generate high-fidelity images with very few steps (generally 2 steps) under the unconditional and class-conditional settings. Previous work, latent-consistency model (LCM), tried to replicate the power of consistency models for text-conditioned generation, but generally failed to achieve pleasant results, especially in low-step regime (1~4 steps). Instead, we believe PCM is a much more successful extension to the original consistency models for high-resolution, text-conditioned image generation, better replicating the power of original consistency models for more advanced generation settings.
Generally, we show there are mainly three limitations of (L)CMs:
- LCM lacks flexibility for CFG choosing and is insensitive to negative prompts.
- LCM fails to produce consistent results under different inference steps. Its results are blurry when step is too large (Stochastic sampling error) or small (inability).
- LCM produces bad and blurry results at low-step regime.
These limitaions can be explicitly viewed from the following figure.
We generalize the design space of consistency models for high-resolution text-conditioned image generation, analyzing and tackling the limitations in the previous work LCM.
Diffusion model, from a continuous time perspective, actually defines a forward conditional probability path, with a general representation of
For the forward SDE, a remarkable property is that there exists a reverse time ODE trajectory, which is termed as PF ODE by song et al, which does not introduce additional stochasticity and still satisfy the pre-defined marginal distribution, that is
where
Generally say, there are just infinite probable paths for reversing the SDE. However, the ODE trajectory, without any stochasticity, is basically more stable for sampling. Most schedulers, including DDIM, DPM-solver, Euler, and Heun, etc., applied in the stable diffusion community are generally based on the principle of better approximating the ODE trajectory. Most distillation-based methods including rectified-flow, guided distillation, can also generally be seen as better approximating the ODE trajectory with larger steps (though most of them did not discuss the relevant parts).
Consistency models aims directly learn the solution point of the ODE trajectory either through distillation or training.
In PCMs, we focus our work on the distillation, which is generally easier for learning. For training, we leave it for futural research.
Consistency Trajectory Model (CTM) points out that CM suffer from the stochasticity error accumulation when applied for multistep sampling for better sample quality and propose a more general framework, allowing for arbitrary pair moving along the ODE trajectory. Yet, it requires an additional target timesteps embedding, which is not aligned with design space of traditional diffusion models. Additionally, CTM is basically harder to train. Say we discretizing the ODE trajectory into
The core idea of our method is phasing the whole ODE trajectory into multiple sub-trajectories. The following figure illustrates the learning paradigm difference among diffusion models (DMs), consistency models (CMs), consistency trajectory models (CTMs), and our proposed phased consistency models (PCMs).
For a better comparison, we also implement a baseline, which we termed as simpleCTM. We adapt the high-level idea of CTM from the k-diffusion framework into the DDPM framework with stable diffusion, and compare its performance. When trained with the same resource, our method achieves significant superior performance.
PCM can achieve text-conditioned image synthesis with good quality in 1, 2, 4, 8, 16 steps.
PCM achieves advanced generation results compared with current open-available powerful fast generation models, including the GAN-based methods: SDXL-Turbo, SD-Turbo, SDXL-Lightning; rectified-flow-based method: InstaFlow; CM-based methods: LCM, SimpleCTM.
If you have any questions about the code, please do not hesitate to contact me!
Email: [email protected]
@article{wang2024phased,
title={Phased Consistency Model},
author={Wang, Fu-Yun and Huang, Zhaoyang and Bergman, Alexander William and Shen, Dazhong and Gao, Peng and Lingelbach, Michael and Sun, Keqiang and Bian, Weikang and Song, Guanglu and Liu, Yu and others},
journal={arXiv preprint arXiv:2405.18407},
year={2024}
}
@article{wang2024animatelcm,
title={AnimateLCM: Accelerating the Animation of Personalized Diffusion Models and Adapters with Decoupled Consistency Learning},
author={Wang, Fu-Yun and Huang, Zhaoyang and Shi, Xiaoyu and Bian, Weikang and Song, Guanglu and Liu, Yu and Li, Hongsheng},
journal={arXiv preprint arXiv:2402.00769},
year={2024}
}
@article{wang2024rectified,
title={Rectified Diffusion: Straightness Is Not Your Need in Rectified Flow},
author={Wang, Fu-Yun and Yang, Ling and Huang, Zhaoyang and Wang, Mengdi and Li, Hongsheng},
journal={arXiv preprint arXiv:2410.07303},
year={2024}
}
@article{wang2024stable,
title={Stable Consistency Tuning: Understanding and Improving Consistency Models},
author={Wang, Fu-Yun and Geng, Zhengyang and Li, Hongsheng},
journal={arXiv preprint arXiv:2410.18958},
year={2024}
}
@incollection{wang2024animatelcm,
title={AnimateLCM: Computation-Efficient Personalized Style Video Generation without Personalized Video Data},
author={Wang, Fu-Yun and Huang, Zhaoyang and Bian, Weikang and Shi, Xiaoyu and Sun, Keqiang and Song, Guanglu and Liu, Yu and Li, Hongsheng},
booktitle={SIGGRAPH Asia 2024 Technical Communications},
pages={1--5},
year={2024}
}