Skip to content

Commit

Permalink
Add FAQ and citation
Browse files Browse the repository at this point in the history
  • Loading branch information
hkchengrex committed Dec 22, 2024
1 parent 8624361 commit 7740afa
Showing 1 changed file with 40 additions and 7 deletions.
47 changes: 40 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,9 +56,11 @@ We recommend using a [miniforge](https://github.com/conda-forge/miniforge) envir
<!-- - ffmpeg<7 ([this is required by torchaudio](https://pytorch.org/audio/master/installation.html#optional-dependencies), you can install it in a miniforge environment with `conda install -c conda-forge 'ffmpeg<7'`) -->

**1. Install prerequisite if not yet met:**
```

```bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 --upgrade
```

(Or any other CUDA versions that your GPUs/driver support)

<!-- ```
Expand Down Expand Up @@ -140,41 +142,72 @@ In our experiments, inference only takes around 6GB of GPU memory (in 16-bit mod
### Command-line interface

With `demo.py`

```bash
python demo.py --duration=8 --video=<path to video> --prompt "your prompt"
```

The output (audio in `.flac` format, and video in `.mp4` format) will be saved in `./output`.
See the file for more options.
Simply omit the `--video` option for text-to-audio synthesis.
The default output (and training) duration is 8 seconds. Longer/shorter durations could also work, but a large deviation from the training duration may result in a lower quality.


### Gradio interface

Supports video-to-audio and text-to-audio synthesis. Use [port forwarding](https://unix.stackexchange.com/questions/115897/whats-ssh-port-forwarding-and-whats-the-difference-between-ssh-local-and-remot) if necessary. Our default port is `7860` which you can change in `gradio_demo.py`.
Supports video-to-audio and text-to-audio synthesis. Use [port forwarding](https://unix.stackexchange.com/questions/115897/whats-ssh-port-forwarding-and-whats-the-difference-between-ssh-local-and-remot) (e.g., `ssh -L 7860:localhost:7860 server`) if necessary. The default port is `7860` which you can change in `gradio_demo.py`.

```
```bash
python gradio_demo.py
```

### FAQ

1. Video processing
- Processing higher-resolution videos takes longer due to encoding and decoding, but it does not improve the quality of results.
- The CLIP encoder resizes input frames to 384×384 pixels.
- Synchformer resizes the shorter edge to 224 pixels and applies a center crop, focusing only on the central square of each frame.
2. Frame rates
- The CLIP model operates at 8 FPS, while Synchformer works at 25 FPS.
- Frame rate conversion happens on-the-fly via the video reader.
- For input videos with a frame rate below 25 FPS, frames will be duplicated to match the required rate.
3. Failure cases
As with most models of this type, failures can occur, and the reasons are not always clear. Below are some known failure modes. If you notice a failure mode or believe there’s a bug, feel free to open an issue in the repository.
4. Performance variations
We notice that there can be subtle performance variations in different hardware and software environments. Some of the reasons include using/not using `torch.compile`, video reader library/backend, inference precision, batch sizes, random seeds, etc. We (will) provide pre-computed results on standard benchmark for reference. Results obtained from this codebase should be similar but might not be exactly the same.

### Known limitations

1. The model sometimes generates undesired unintelligible human speech-like sounds
2. The model sometimes generates undesired background music
1. The model sometimes generates unintelligible human speech-like sounds
2. The model sometimes generates background music (without explicit training, it would not be high quality)
3. The model struggles with unfamiliar concepts, e.g., it can generate "gunfires" but not "RPG firing".

We believe all of these three limitations can be addressed with more high-quality training data.

## Training

Work in progress.

## Evaluation

Work in progress.

## Datasets
## Training Datasets

MMAudio was trained on several datasets, including [AudioSet](https://research.google.com/audioset/), [Freesound](https://github.com/LAION-AI/audio-dataset/blob/main/laion-audio-630k/README.md), [VGGSound](https://www.robots.ox.ac.uk/~vgg/data/vggsound/), [AudioCaps](https://audiocaps.github.io/), and [WavCaps](https://github.com/XinhaoMei/WavCaps). These datasets are subject to specific licenses, which can be accessed on their respective websites. We do not guarantee that the pre-trained models are suitable for commercial use. Please use them at your own risk.

## Citation

```bibtex
@inproceedings{cheng2024putting,
title={Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis},
author={Cheng, Ho Kei and Ishii, Masato and Hayakawa, Akio and Shibuya, Takashi and Schwing, Alexander and Mitsufuji, Yuki},
booktitle={arXiv},
year={2024}
}
```

## Acknowledgement

Many thanks to:
- [Make-An-Audio 2](https://github.com/bytedance/Make-An-Audio-2) for the 16kHz BigVGAN pretrained model and the VAE architecture
- [BigVGAN](https://github.com/NVIDIA/BigVGAN)
Expand Down

0 comments on commit 7740afa

Please sign in to comment.