Add FAQ and citation

hkchengrex · Dec 22, 2024 · 7740afa · 7740afa
1 parent 8624361
commit 7740afa
Showing 1 changed file with 40 additions and 7 deletions.
diff --git a/README.md b/README.md
@@ -56,9 +56,11 @@ We recommend using a [miniforge](https://github.com/conda-forge/miniforge) envir
 <!-- - ffmpeg<7 ([this is required by torchaudio](https://pytorch.org/audio/master/installation.html#optional-dependencies), you can install it in a miniforge environment with `conda install -c conda-forge 'ffmpeg<7'`) -->
 
 **1. Install prerequisite if not yet met:**
-```
+
+```bash
 pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 --upgrade
 ```
+
 (Or any other CUDA versions that your GPUs/driver support)
 
 <!-- ```
@@ -140,41 +142,72 @@ In our experiments, inference only takes around 6GB of GPU memory (in 16-bit mod
 ### Command-line interface
 
 With `demo.py`
+
 ```bash
 python demo.py --duration=8 --video=<path to video> --prompt "your prompt" 
 ```
+
 The output (audio in `.flac` format, and video in `.mp4` format) will be saved in `./output`.
 See the file for more options.
 Simply omit the `--video` option for text-to-audio synthesis.
 The default output (and training) duration is 8 seconds. Longer/shorter durations could also work, but a large deviation from the training duration may result in a lower quality.
 
-
 ### Gradio interface
 
-Supports video-to-audio and text-to-audio synthesis. Use [port forwarding](https://unix.stackexchange.com/questions/115897/whats-ssh-port-forwarding-and-whats-the-difference-between-ssh-local-and-remot) if necessary. Our default port is `7860` which you can change in `gradio_demo.py`.
+Supports video-to-audio and text-to-audio synthesis. Use [port forwarding](https://unix.stackexchange.com/questions/115897/whats-ssh-port-forwarding-and-whats-the-difference-between-ssh-local-and-remot) (e.g., `ssh -L 7860:localhost:7860 server`) if necessary. The default port is `7860` which you can change in `gradio_demo.py`.
 
-```
+```bash
 python gradio_demo.py
 ```
 
+### FAQ
+
+1. Video processing
+    - Processing higher-resolution videos takes longer due to encoding and decoding, but it does not improve the quality of results.
+    - The CLIP encoder resizes input frames to 384×384 pixels. 
+    - Synchformer resizes the shorter edge to 224 pixels and applies a center crop, focusing only on the central square of each frame.
+2. Frame rates
+    - The CLIP model operates at 8 FPS, while Synchformer works at 25 FPS.
+    - Frame rate conversion happens on-the-fly via the video reader.
+    - For input videos with a frame rate below 25 FPS, frames will be duplicated to match the required rate.
+3. Failure cases
+As with most models of this type, failures can occur, and the reasons are not always clear. Below are some known failure modes. If you notice a failure mode or believe there’s a bug, feel free to open an issue in the repository.
+4. Performance variations
+We notice that there can be subtle performance variations in different hardware and software environments. Some of the reasons include using/not using `torch.compile`, video reader library/backend, inference precision, batch sizes, random seeds, etc. We (will) provide pre-computed results on standard benchmark for reference. Results obtained from this codebase should be similar but might not be exactly the same.
+
 ### Known limitations
 
-1. The model sometimes generates undesired unintelligible human speech-like sounds
-2. The model sometimes generates undesired background music
+1. The model sometimes generates unintelligible human speech-like sounds
+2. The model sometimes generates background music (without explicit training, it would not be high quality)
 3. The model struggles with unfamiliar concepts, e.g., it can generate "gunfires" but not "RPG firing".
 
 We believe all of these three limitations can be addressed with more high-quality training data.
 
 ## Training
+
 Work in progress.
 
 ## Evaluation
+
 Work in progress.
 
-## Datasets
+## Training Datasets
+
 MMAudio was trained on several datasets, including [AudioSet](https://research.google.com/audioset/), [Freesound](https://github.com/LAION-AI/audio-dataset/blob/main/laion-audio-630k/README.md), [VGGSound](https://www.robots.ox.ac.uk/~vgg/data/vggsound/), [AudioCaps](https://audiocaps.github.io/), and [WavCaps](https://github.com/XinhaoMei/WavCaps). These datasets are subject to specific licenses, which can be accessed on their respective websites. We do not guarantee that the pre-trained models are suitable for commercial use. Please use them at your own risk.
 
+## Citation
+
+```bibtex
+@inproceedings{cheng2024putting,
+  title={Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis},
+  author={Cheng, Ho Kei and Ishii, Masato and Hayakawa, Akio and Shibuya, Takashi and Schwing, Alexander and Mitsufuji, Yuki},
+  booktitle={arXiv},
+  year={2024}
+}
+```
+
 ## Acknowledgement
+
 Many thanks to:
 - [Make-An-Audio 2](https://github.com/bytedance/Make-An-Audio-2) for the 16kHz BigVGAN pretrained model and the VAE architecture
 - [BigVGAN](https://github.com/NVIDIA/BigVGAN)