Tacotron + HiFiGAN vocoder for vietnamese datasets.
A synthesized audio clip: clip.wav. A colab notebook: notebook.
git clone https://github.com/NTT123/vietTTS.git
cd vietTTS
pip3 install -e .
bash ./scripts/quick_start.sh
bash ./scripts/download_aligned_infore_dataset.sh
Note: this is a denoised and aligned version of the original dataset which is donated by the InfoRe Technology company (see here). You can download the original dataset (InfoRe Technology 1) at here.
python3 -m vietTTS.nat.duration_trainer
python3 -m vietTTS.nat.acoustic_trainer
We use the original implementation from HiFiGAN authors at https://github.com/jik876/hifi-gan. Use the config file at assets/hifigan/config.json
to train your model.
git clone https://github.com/jik876/hifi-gan.git
# create dataset in hifi-gan format
ln -sf `pwd`/train_data hifi-gan/data
cd hifi-gan/data
ls -1 *.TextGrid | sed -e 's/\.TextGrid$//' > files.txt
cd ..
head -n 100 data/files.txt > val_files.txt
tail -n +101 data/files.txt > train_files.txt
rm data/files.txt
# training
python3 train.py \
--config ../assets/hifigan/config.json \
--input_wavs_dir=data \
--input_training_file=train_files.txt \
--input_validation_file=val_files.txt
Finetune on Ground-Truth Aligned melspectrograms:
cd /path/to/vietTTS # go to vietTTS directory
python3 -m vietTTS.nat.zero_silence_segments -o train_data # zero all [sil, sp, spn] segments
python3 -m vietTTS.nat.gta -o /path/to/hifi-gan/ft_dataset # create gta melspectrograms at hifi-gan/ft_dataset directory
# turn on finetune
cd /path/to/hifi-gan
python3 train.py \
--fine_tuning True \
--config ../assets/hifigan/config.json \
--input_wavs_dir=data \
--input_training_file=train_files.txt \
--input_validation_file=val_files.txt
Then, use the following command to convert pytorch model to haiku format:
cd ..
python3 -m vietTTS.hifigan.convert_torch_model_to_haiku \
--config-file=assets/hifigan/config.json \
--checkpoint-file=hifi-gan/cp_hifigan/g_[latest_checkpoint]
python3 -m vietTTS.synthesizer \
--lexicon-file=train_data/lexicon.txt \
--text="hôm qua em tới trường" \
--output=clip.wav