- Create an environment with cuda v 11.4+ and cuDNN v 8+, better with anaconda, recent versions are needed for TF.
$ conda create -n wav2lip -c main python=3.8 cudnn=8.0 cudatoolkit=11.0 -c conda-forge
- Clone the repo
$ git clone https://github.com/Rudrabha/Wav2Lip.git
- Install ffmpeg:
$ sudo apt-get install ffmpeg
- Install requirements:
$ pip install -r requirements_mod.txt
NOTE: these reqs are modified. To be compatible with EC2
Also, make sure that opencv-python is installed properlly.
- Install torch
$ pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html
- Download
s3fd
model for face detection (86 MB)
$ wget "https://www.adrianbulat.com/downloads/python-fan/s3fd-619a316812.pth" -O "Wav2Lip/face_detection/detection/sfd/s3fd.pth"
NOTE: It's being placed on a specific path, with s3fd.pth name, so make sure to run this command on the directory before Wav2Lip.
- Inference models, they provide four different models to work with:
Model | Description | Link to the model |
---|---|---|
Wav2Lip | Highly accurate lip-sync | wav2lip.pth (416 MB) |
Wav2Lip + GAN | Slightly inferior lip-sync, but better visual quality | wav2lip_gan.pth (416 MB) |
Expert Discriminator | Weights of the expert discriminator | lipsync_expert.pth (188 MB) |
Visual Quality Discriminator | Weights of the visual disc trained in a GAN setup | visual_quality_disc.pth (162 MB) |
To download any of them (eg. wav2lip_gan): You can manually use the link, and make sure to place it at ./checkpoints/
If you want to download it using a script. I provided a script
dl.py
very easy and simple to use and read.
$ pip install gdrive
$ python3 dl.py
You need one .mp4 video and one audio to start with. Make sure to type in the appropriate path for them.
The result is saved (by default) in results/result_voice.mp4. You can specify it as an argument, similar to several other available options. The audio source can be any file supported by FFMPEG containing audio data: *.wav, *.mp3 or even a video file, from which the code will automatically extract the audio.
$ python inference.py --checkpoint_path checkpoints/wav2lip_gan.pth --face "../sample_data/input_vid.mp4" --audio "../sample_data/input_audio.wav"
You may use parameters provided by the authors such as --box, --nosmoot ..etc. We'll discuss some of them in the following subsection:
- --box: To exclude the
s3fd
face detector model and manually locate the face within the video/image. Works better in case of images to save time. But authors forgot to implement it for images, so I did it in the modified version in this repo.
Provided that when using box parameter, to store an image with the box on it. Try it out.
- --crop: To crop the image and work on a smaller scale.
- --nosmooth: Proved to show better results. Recommended.
- --fps: Doesn't provide its intended job, so I implemented it within the code.
- --resize_factor: Since modek is trained on small scale images, and produce at most 512*512 videos, so scalling down is required for HD images/videos. This paramtere is a number larger than one to divide width and length on.
RuntimeError: CUDA error: an illegal memory access was encountered
Then..
RuntimeError: Image too big to run face detection on GPU. Please use the --resize_factor argument
- --resize_factor: larger than 1 (eg. 2, 10, 100):
- Doesn't always work .. needs a lot of trial & error based on your video aspects.
- If image is still big, same error will occur, if image is very tiny .. another error will occur x')
- Using a bounded box (semi-success)
- Needed to be tuned exactly on the face, otherwise it will produce awful results.
- Runnin on cpu (worked nut very slow - 17 s/it - it=1 img (no batching in cpu))
- Takes up to one hour to predect faces on a 120 seconds video with 25 fps.
- CPU with fewer fps (5 fps)
- Takes up to 15 mins. Results are ..
- Problem is that the argument
fps
is kinda useless, to adjust the videos fps .. you need to modify it manually. This version of the code is modified, it will ask you for the fps while running.
- Run on smaller images (works - bad results/pixled)
- I guess the limit is 512*512!
RuntimeError: Error opening 'temp/temp.wav': System error.
Then..
FileNotFoundError: [Errno 2] No such file or directory: 'temp/temp.wav'
- reinstalling ffmpeg (failed)
$ sudo apt-get install ffmpeg
- executing within
Wav2lip
directory (success) The problem is that I ran theinference.py
file from another directory. It was a relative path error, not a package one.
"All results are currently limited to (utmost) 480p resolution and will be cropped to max. 20s to minimize compute latency" ~said the author.
- Ask the authors for HD model. They did not release it.
If One frame doesn't contain an image, the whole process is discarded.
- Try modifying the code and list index length if you want to.
- Use an image instead of a video. (Works)
sisi_em_nosmooth.mp4
PaddleGAN is a very comprehensive api consists of various generative models, collecting projects from different sources. Wav2lip is one of these projects, notice that they are implementing the execution of the same wav2lip link, in addition to the feature of face enhancement, that paddlegan produce here.
So in a nutshil, PaddleGAN Wav2lip = Wav2lip + face enhancement
- Follow the same instructions provided above.
- Install PaddlePaddle follow here for more
# CUDA10.1
$ python -m pip install paddlepaddle-gpu==2.1.0.post101 -f https://mirror.baidu.com/pypi/simple
# CPU
$ python -m pip install paddlepaddle -i https://mirror.baidu.com/pypi/simple
- Execute
$ cd applications
$ python tools/wav2lip.py \
--face ../docs/imgs/mona7s.mp4 \
--audio ../docs/imgs/guangquan.m4a \
--outfile pp_guangquan_mona7s.mp4 \
--face_enhancement
mona7s.mp4 and guangquan.m4a exist already in the repo. face_enhancement will automatically download the model produced by paddlegan (270+ MB)
RuntimeError: module compiled against API version 0xe but this version of numpy is 0xd
- Upgrade package (solved)
$ pip install numpy --upgrade
- Run on google colab instead (provided with the code).
- Run on cpu
Pass the parameter
--cpu
to the above and will work but slow (takes up to 40 mins for 5 sec videos).
- Use docker image (didn't try it)
- To install cuDNN on conda and check it:
conda install -c anaconda cudnn
conda list cudnn