This repository contains the codes for lip reading using 3D cross audio-visual Convolutional Neural Networks.
Link to our project report : ref
In this small project, we tried to re-engineer [1], by using similar network architecture, but using our own data and different video and audio preprocessing techniques, as described below. Due to large computational requirements for Audio and Visual Preprocessing, we trained the model on a dummy dataset, with random placeholders for the data, instead of actual intensity values.
- Download either VidTimit or the BBC Lip Reading in the Wild datasets and place them in
./dataset/
folder - To extract the lip region (bounding box) using Histogram of Oriented Gradients:
cd Visual_Preprocessing
. Then runpython mouth_cropping_in_video.py
for getting the crops of the mouth region from the video. - To run the audio preprocessing:
cd Audio_Preperocessing
. Then run the file:matlab MMSESTSA84.m
, which performs the audio preprocessing using the MMSE STSA method. Another Audio Preprocessing, Voice Activity Detection, which is an energy based method is also supported, which can be run usingpython unsupervised_vad.py
.
- To train the CNN model, run
python train.py
, with the appropriate paths to the audio and video files.
- 3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition. Amirsina Torfi, Seyed Mehdi Iranmanesh, Nasser Nasrabadi, Jeremy Dawson et al. IEEE Access, Volume 5.