Original founder: @staplesinLA
Major contributor: @nryant @mmmaat(many thanks!)
The repository provides tools to reproduce the enhancement results of the speech preprocessing part of our DIHARD18 system[1]. The deep-learning based denoising model is trained on 400 hours of English and Mandarin audio; for full details see [1,2,3]. Currently the tools accept 16 kHz, 16-bit monochannel WAV files. Please convert the audio format in advance.
Additionally, this package integrates a voice activity detection (VAD) module based on py-webrtcvad, which provides a Python interface to the WebRTC VAD. The default parameters are tuned on the development set of DIHARD18.
[1] Sun, Lei, et al. "Speaker Diarization with Enhancing Speech for the First DIHARD Challenge." Proc. Interspeech 2018 (2018): 2793-2797. PDF
[2] Gao, Tian, et al. "Densely connected progressive learning for lstm-based speech enhancement." 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018. PDF
[3] Sun, Lei, et al. "Multiple-target deep learning for LSTM-RNN based speech enhancement." 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA). IEEE, 2017. PDF
-
Install all dependencies (Note that you need to have Python and pip already installed on your system) :
sudo apt-get install openmpi-bin pip install numpy scipy librosa pip install cntk-gpu pip install webrtcvad pip install wurlitzer pip install joblib
Make sure the CNTK engine installed successfully by querying its version:
python -c "import cntk; print(cntk.__version__)"
-
Download the speech enhancement repository :
git clone https://github.com/staplesinLA/denoising_DIHARD18.git
-
Install the pretrained model:
cd denoising_DIHARD18 ./install_model.sh
-
Specify parameters in
run_eval.sh
:-
For the speech enhancement tool:
WAV_DIR=<path to original wavs> SE_WAV_DIR=<path to output dir> USE_GPU=<true|false, if false use CPU, default=true> GPU_DEVICE_ID=<GPU device id on your machine, default=0> TRUNCATE_MINUTES=<audio chunk length in minutes, default=10>
We recommend using a GPU for decoding as it's much faster than CPU. If decoding fails with a
CUDA Error: out of memory
error, reduce the value ofTRUNCATE_MINUTES
. -
For the VAD tool:
VAD_DIR=<path to output dir> HOPLENGTH=<duration in milliseconds of VAD frame size, default=30> MODE=<WebRTC aggressiveness, default=3> NJOBS=<number of parallel processes, default=1>
-
-
Execute
run_eval.sh
:./run_eval.sh
-
Install docker
-
Install nvidia docker, a plugin to use your GPUs within docker
-
Build the image using the provided
Dockerfile
:docker build -t dihard18 .
-
Run the evaluation script within docker with the following commands:
docker run -it --rm --runtime=nvidia -v /abs/path/to/dihard/data:/data dihard18 /bin/bash # you are now in the docker machine ./run_eval.sh # before launcing the script you can edit it to modify the parameters
-
The option
--runtime=nvidia
enables the use of GPUs within docker -
The option
-v /absolute/path/to/dihard/data:/data
mounts the folder where the data are stored into Docker in the/data
folder. The directory/absolute/path/to/dihard/data
must contain awav/
subdirectory. The results will be stored in the directorieswav_pn_enhanced/
andvad/
.
-
-
Speech enhancement model
The scripts accept 16 kHz, 16-bit monochannel WAV files. Please convert the audio format in advance. To easily rebuild the waveform, the input feature is log-power spectrum (LPS). As the model has dual outputs including "IRM" and "LPS", the final used component is the "IRM" target which directly applies a mask to the original speech. Compared with "LPS" output, it can yield better speech intelligibility and fewer distortions.
-
VAD module
The optional parameters of WebRTC VAD are aggressiveness mode (default=3) and hop length (default=30 ms). The default settings are tuned on the development set of the First DIHARD challenge. For the development set, here is the comparison between original speech and processed speech in terms of VAD metrics:
VAD(default) Original_Dev Processed_Dev Miss 11.85 7.21 FA 6.12 6.17 Total 17.97 13.38 And the performance on the evaluation set:
VAD(default) Original_Eval Processed_Eval Miss 17.49 8.89 FA 6.36 6.4 Total 23.85 15.29 -
Effectiveness
The contribution of a single sub-module on the final speaker diarization performance is too trivial to analyze. However, it can be seen clearly that the enhancement based pre-processing is beneficial to at least VAD performance. Users can also tune the default VAD parameters to obtain a desired trade-off between Miss and False Alarm rates.