Name		Name	Last commit message	Last commit date
parent directory ..
config		config
data		data
dataloader		dataloader
losses		losses
models		models
utils		utils
README.md		README.md
inference.py		inference.py
inference.sh		inference.sh
networks.py		networks.py
solver.py		solver.py
train.py		train.py
train.sh		train.sh

README.md

ClearerVoice-Studio: Train Speech Enhancement Models

1. Introduction

This repository provides training scripts for speech enhancement models. Currently, it supports fresh train or finetune for the following models:

model name	sampling rate	Paper Link
FRCRN_SE_16K	16000	FRCRN (Paper, ICASSP 2022)
MossFormerGAN_SE_16K	16000	MossFormer2 Backbone + GAN (Paper, ICASSP 2024)
MossFormer2_SE_48K	48000	MossFormer2 Backbone + Masking (Paper, ICASSP 2024)

FRCRN_SE_16K

FRCRN uses a complex network for single-channel speech enhancement. It is a generalized method for enhancing speech in various noise environments. Our trained FRCRN model has won good performance in IEEE ICASSP 2022 DNS Challenge. Please check our paper.

The FRCRN model is developed based on a new framework of Convolutional Recurrent Encoder-Decoder (CRED), which is built on the Convolutional Encoder-Decoder (CED) architecture. CRED can significantly improve the performance of the convolution kernel by improving the limited receptive fields in the convolutions of CED using frequency recurrent layers. In addition, we introduce the Complex Feedforward Sequential Memory Network (CFSMN) to reduce the complexity of the recurrent network, and apply complex-valued network operations to realize the full complex deep model, which not only constructs long sequence speech more effectively, but also can enhance the amplitude and phase of speech at the same time.

MossFormerGAN_SE_16K

MossFormerGAN is motivated from CMGAN and TF-GridNet. We use an extended MossFormer2 backbone (See below figure) to replace Conformer in CMGAN and add the Full-band Self-attention Modul proposed in TF-GridNet. The whole speech enhancemnt network is optimized by the adversarial training scheme as described in CMGAN. We extended the CNN network to an attention-based network for the discriminator. MossFormerGAN is trained for 16kHz speech enhancement.

MossFormer2_SE_48K

MossFormer2_SE_48K is a full-band (48kHz) speech enhancement model. Full-band 48 kHz speech enhancement is becoming increasingly important due to advancements in communication platforms and high-quality media consumption. Several open-sourced github repos such as FullSubNet, DeepFilterNet, and resemble-enhance have released pre-trained models. We provide a more competitive MossFormer2_SE_48K model in our ClearVoice and the training and finetune scripts here.

MossFormer2_SE_48K uses the following model architecture. It uses noisy fbank as input to predict the Phase-Sensitive Mask (PSM). Then, the predicted mask is applied to the noisy STFT spectrogram. Finally, the estimated STFT spectrogram is converted back to waveform by IFFT. The main component is the MossFormer2 block which consists of a MossFormer module and a Recurrent model. The number of MossFormer2 blocks can be adjusted to deepen the network. We used 24 MossFormer2 blocks in MossFormer2_SE_48K.

We provided performance comparisons of our released models with the publically available models in ClearVoice page.

2. Usage

Step-by-Step Guide

If you haven't created a Conda environment for ClearerVoice-Studio yet, follow steps 1 and 2. Otherwise, skip directly to step 3.

Clone the Repository

git clone https://github.com/modelscope/ClearerVoice-Studio.git

Create Conda Environment

cd ClearerVoice-Studio
conda create -n ClearerVoice-Studio python=3.8
conda activate ClearerVoice-Studio
pip install -r requirements.txt

Prepare Dataset

If you don't have any training dataset to start with, we recommend you to download the VoiceBank-DEMAND dataset (link]. You may store the dataset anywhere. What you need to start the model training is to create two scp files as shown in data/tr_demand_28_spks_16k.scp and data/cv_demand_testset_16k.scp. data/tr_demand_28_spks_16k.scp contains the training data list and data/cv_demand_testset_16k.scp contains the testing data list.

Replace data/tr_demand_28_spks_16k.scp and data/cv_demand_testset_16k.scp with your new .scp files in config/train/*.yaml. Now it is ready to train the models.

Start Training

bash train.sh

You may need to set the correct network in train.sh and choose either a fresh training or a finetune process using:

network=MossFormer2_SE_48K            #Train MossFormer2_SE_48K model
train_from_last_checkpoint=1          #Set 1 to start training from the last checkpoint if exists, 
init_checkpoint_path=./               #Path to your initial model if starting fine-tuning; otherwise, set it to 'None'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speech_enhancement

speech_enhancement

README.md

ClearerVoice-Studio: Train Speech Enhancement Models

1. Introduction

2. Usage

Step-by-Step Guide

Files

speech_enhancement

Directory actions

More options

Directory actions

More options

Latest commit

History

speech_enhancement

Folders and files

parent directory

README.md

ClearerVoice-Studio: Train Speech Enhancement Models

1. Introduction

2. Usage

Step-by-Step Guide