From 27d0633d6dee498027db9f5f84255b1eee5193d2 Mon Sep 17 00:00:00 2001 From: Yiwen Zhao Date: Thu, 25 Jul 2024 20:47:09 -0400 Subject: [PATCH 01/13] add aishell3_tts2 recipe --- egs2/aishell3/tts2/README.md | 163 +++ egs2/aishell3/tts2/cmd.sh | 110 ++ .../tts2/conf/decode_fastspeech2.yaml | 10 + egs2/aishell3/tts2/conf/decode_teacher.yaml | 15 + egs2/aishell3/tts2/conf/mfcc.conf | 7 + egs2/aishell3/tts2/conf/pbs.conf | 11 + egs2/aishell3/tts2/conf/queue.conf | 12 + egs2/aishell3/tts2/conf/slurm.conf | 14 + .../aishell3/tts2/conf/train_fastspeech2.yaml | 104 ++ egs2/aishell3/tts2/conf/train_teacher.yaml | 81 ++ egs2/aishell3/tts2/conf/vad.conf | 4 + egs2/aishell3/tts2/db.sh | 1 + egs2/aishell3/tts2/local/data.sh | 97 ++ egs2/aishell3/tts2/local/data_prep.py | 1 + .../aishell3/tts2/local/download_and_untar.sh | 1 + egs2/aishell3/tts2/local/path.sh | 0 egs2/aishell3/tts2/local/run_mfa.sh | 23 + egs2/aishell3/tts2/path.sh | 1 + egs2/aishell3/tts2/pyscripts | 1 + egs2/aishell3/tts2/run.sh | 65 + egs2/aishell3/tts2/run_train_teacher.sh | 61 + egs2/aishell3/tts2/scripts | 1 + egs2/aishell3/tts2/sid | 1 + egs2/aishell3/tts2/steps | 1 + egs2/aishell3/tts2/tts.sh | 1215 +++++++++++++++++ egs2/aishell3/tts2/tts2.sh | 1 + egs2/aishell3/tts2/utils | 1 + 27 files changed, 2002 insertions(+) create mode 100644 egs2/aishell3/tts2/README.md create mode 100644 egs2/aishell3/tts2/cmd.sh create mode 100644 egs2/aishell3/tts2/conf/decode_fastspeech2.yaml create mode 100644 egs2/aishell3/tts2/conf/decode_teacher.yaml create mode 100644 egs2/aishell3/tts2/conf/mfcc.conf create mode 100644 egs2/aishell3/tts2/conf/pbs.conf create mode 100644 egs2/aishell3/tts2/conf/queue.conf create mode 100644 egs2/aishell3/tts2/conf/slurm.conf create mode 100644 egs2/aishell3/tts2/conf/train_fastspeech2.yaml create mode 100644 egs2/aishell3/tts2/conf/train_teacher.yaml create mode 100644 egs2/aishell3/tts2/conf/vad.conf create mode 120000 egs2/aishell3/tts2/db.sh create mode 100644 egs2/aishell3/tts2/local/data.sh create mode 120000 egs2/aishell3/tts2/local/data_prep.py create mode 120000 egs2/aishell3/tts2/local/download_and_untar.sh create mode 100644 egs2/aishell3/tts2/local/path.sh create mode 100755 egs2/aishell3/tts2/local/run_mfa.sh create mode 120000 egs2/aishell3/tts2/path.sh create mode 120000 egs2/aishell3/tts2/pyscripts create mode 100755 egs2/aishell3/tts2/run.sh create mode 100755 egs2/aishell3/tts2/run_train_teacher.sh create mode 120000 egs2/aishell3/tts2/scripts create mode 120000 egs2/aishell3/tts2/sid create mode 120000 egs2/aishell3/tts2/steps create mode 100755 egs2/aishell3/tts2/tts.sh create mode 120000 egs2/aishell3/tts2/tts2.sh create mode 120000 egs2/aishell3/tts2/utils diff --git a/egs2/aishell3/tts2/README.md b/egs2/aishell3/tts2/README.md new file mode 100644 index 00000000000..bdb7d4fc8f8 --- /dev/null +++ b/egs2/aishell3/tts2/README.md @@ -0,0 +1,163 @@ +# AISHELL3 RECIPE + +This is the recipe of Mandrain multi-speaker TTS2 model with [aishell3](https://www.openslr.org/93/) corpus. + +See the following pages for running on clusters. They can help you to set the environment and get familiar with ESPNet's repo structure. +- [PSC usage tutorial](https://www.wavlab.org/activities/2022/psc-usage/) +- [Espnet recipe tutorial]((https://github.com/espnet/notebook/blob/master/ESPnet2/Course/CMU_SpeechRecognition_Fall2022/recipe_tutorial.ipynb) ) + + +## Brief on TTS2 + +- In terms of features + + ``tts2`` uses discrete acoustic features instead of continuous features in ``tts1``. Current TEMPLATE supports the training of a discrete FastSpeech2 model. +- In terms of data + + ``tts2`` additionally requires duration information, which can be obtained from **Speech-Text Alignment Tools** (tacotron teacher model or mfa). According to FastSpeech2 paper, mfa has a higher quality. + + +## Run the Recipe + +🌟 Please notice that most of the ``bash files`` are symbolic linked from the TEMPLATE. It might be updated by later commmits using other corpus, so please double check and customize the parameters before your run. + +Here is the basic order for running scripts, followed by more details. + +1. ``./local/run_mfa.sh`` +2. ``./run_train_teacher.sh`` (to stage 8, must use teacher forcing in decoding) +3. ``./run_train_teacher.sh`` (stage 6 only, to extract energy and pitch) +4. ``./run.sh --stop_stage 8 --s3prl_upstream_name hf_hubert_custom`` +5. Train a vocoder at [PWG](https://github.com/kan-bayashi/ParallelWaveGAN/tree/master/egs) (We use discrete hifigan here) +6. ``./run.sh --stage 9`` +7. Evaluate the generated wav using [scripts here](https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE/tts1#evaluation) + + +### 1. Data Preparation + +* Download aishell-3 dataset(trainset & testset) +* Trim slience to improve the efficiency and potentially improve the generated wave quality by cutting off noise. +* Get the initial ``{dset}_phn`` dictionary. +* Split 250 samples from the trainset to be the devset. + +``` +//{dset}/text sample +SSB00050353 深交所副总经理周明指出 +//{dset}_phn/text sample +SSB00050353 shen1 jiao1 suo3 fu4 zong3 jing1 li3 zhou1 ming2 zhi3 chu1 +``` + + NOTE: The parameters like ``fs``, ``n_fft``, in ``trim_slience.sh`` don't have to be the same as what in ``run.sh``, since they only determine the precision of slience trimming, where the outcome of different sets of parameters will be roughly the same (corpus w/ minimum slience sound). + +### 2. Train the teacher model +Following ``tts1``, we train a Tacotron2 model to be the teacher model for FastSpeech2 in ``tts2``. + +Set ``audio_format=wav`` is recommended, as it can be directly processed if you want to use x-vector. Or you can use ``flac``, but take ``egs2/libirspeech/asr1/local/data.sh`` as a reference for ``uttid path-to-utt`` + +Remember to keep the frame shift(fs) for the teacher model and the student model to be the same, only by which the soft target generated by teacher Tacotron2 can be aligned with the Fastspeech2 input. + +### 3. Extract additional features + +Calculate pitch and energy (still following ``tts1``), for fastspeech2. + +### 4. Train discrete fastspeech2 +The datasets include text, durations, speech, discrete speech, pitch, energy, and spembs. Use cn_hubert(pretrained on mandarin) here for discrete tts feature extraction. + +### 5. Train a vocoder +A customized vocoder for aishell3 discrete features is necessary for the purpose of generating ``wav`` from discrete hubert features. + +The vocoder for tts2 are not exactly mel2text, so our goal here is not to train a rule-based vocoder like ``tts1``, but another unique vocoder that maps discrete features to waves. + +We use [PWG repo](https://github.com/kan-bayashi/ParallelWaveGAN/tree/master/egs), and here are the detailed steps: + + +* git clone https://github.com/kan-bayashi/ParallelWaveGAN.git , ``cd ParallelWaveGAN/egs/aishell3/hubert_voc1``. + +* Collect hubert text to a single file, which can be done conveniently using ``vim`` + + ```shell + vim path/to/train_hubert.txt + :r path/to/dev_hubert.txt + :r path/to/test_hubert.txt + :w path/to/newfile_all.txt + :q! + ``` +* Modify the ``hubert_text`` in ./run.sh. Follow instructions in stage 0 to symlink the data (``wav`` format is better supported in kaldiio than ``flac``). Notice that aishell3 has unknown speakers, so we don't use sid. + +* Modify ``num_embs``(equals to number of k-means clusters) and custom parameters in the config file ``conf/hifigan_hubert_24k.v1.yaml``. + +* Start feature extraction and training from stage 1. + + +### 6. Inference +Run the inference stage in espnet2 recipe with your trained vocoder. Waveform will be directly generated this time. + +### 7. Evaluate model performance +Please follow [scripts here](https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE/tts1#evaluation). + + +## Other references + +**Speech-Text Alignment Tools** + +The token duration is predicted using Speech-Text alignment tools, which can be either force-aligner or attention-based auto-regressive model (e.g., Tacotron2). Please refer to [Alignment from Tacotron2](https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE/tts1#fastspeech-training) and [Montreal Forced Aligner(MFA)](https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE/tts1#new-mfa-aligments-generation) for details. + + +**MFA** + +Firstly make sure ``mfa`` has been prepared in the environment. +``` +cd ../../../tools +make mfa.done +cd - +``` + +Originally, ``Stage 1`` in ``run.sh`` calls ``local/data.sh``, but here we won't run ``Stage 1``, instead, we use + +``` +./local/run_mfa.sh +``` + +which is an entry point that will call ``scripts/mfa.sh`` and further call ``local/data.sh``. If ``--train false``, this script will download pretrained g2p and acoustic models, else if ``--train true``, this script will generate the alignments. The generated results will be stored in the ``_phn`` lexicon. + +For aishell-3, we train a new G2P model on ``mandarin_china_mfa`` dictionary, and generate the lexicon. Then train the speech-text alignment MFA. + +If you want to use the duration extracted by mfa, then you can continue the training on the main script from ``Stage 2``: + +``` +./run_.sh --stage 2 --stop_stage 2 --teacher_dumpdir "data" +``` + +### Multi-Speakers tts2 + +In multi-spk scenario, adding speaker id or speaker embedding can help better tell speakers apart, specified using ``--use_spk_embed`` or ``--use_sid``. But since aishell-3 is not a fixed speaker corpus, i.e. exists speakers with unknown id, so here we use speaker embeddings. + +**Speaker Embeddings** + +ESPnet supports several types of speaker embeddings (kaldi: x-vector, speechbrain, espnet_spk). The recently proposed espnet_spk shows SOTA performance among many tasks, thus we use it here. + + +### Discrete Speech Challenge Baseline + + + + + + + + + + + + + + + + + + + + +
ModelMCD ⬇️Log F0 RMSE ⬇️CERUTMOS ⬆️
HuBERT-base-layer611.7626 ± 1.66730.4608 ± 0.17241.4078 ± 0.1414
+ + +* CER is currently unfilled since it requires an additional asr model. diff --git a/egs2/aishell3/tts2/cmd.sh b/egs2/aishell3/tts2/cmd.sh new file mode 100644 index 00000000000..2aae6919fef --- /dev/null +++ b/egs2/aishell3/tts2/cmd.sh @@ -0,0 +1,110 @@ +# ====== About run.pl, queue.pl, slurm.pl, and ssh.pl ====== +# Usage: .pl [options] JOB=1: +# e.g. +# run.pl --mem 4G JOB=1:10 echo.JOB.log echo JOB +# +# Options: +# --time