-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[!145][RELEASE] Gradient-reversal and multi-gender models to control …
…gender (CLiC-it 2023) # Which work do we release? How To Build Competitive Multi-gender Speech Translation Models For Controlling Speaker Gender Translation (CLiC-it 2023) # What changes does this release refer to? 12884f97dd1a76ee79218dd8d4b790d0a29b38fe 538639e93c7926a6fd5bf1aa1824bb832e5fa172
- Loading branch information
Showing
2 changed files
with
197 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,196 @@ | ||
# How To Build Competitive Multi-gender Speech Translation Models For Controlling Speaker Gender Translation (CLiC-it 2023) | ||
|
||
Instructions to reproduce the paper | ||
["How To Build Competitive Multi-gender Speech Translation Models For Controlling Speaker Gender Translation"](http://arxiv.org/abs/2310.15114). | ||
|
||
## 📍 Preprocess and Setup | ||
|
||
Download all the corpora listed in our paper and preprocess them as explained [here](SPEECHFORMER.md#preprocessing). | ||
|
||
## 🏃 Training | ||
The models of the paper have been trained with the following scripts. | ||
All the scripts below assume that 4 GPUs are used with at least 16GB of VRAM. | ||
On different hardware, you may need to adjust the parameters `--max-tokens` (e.g. lower it if you have lower VRAM) | ||
and `--update-freq` so that the product `num_gpus * max_tokens * update_freq` remains the same. | ||
|
||
### Multi-gender Baseline | ||
|
||
To train multi-gender models, you first need to edit the YAML config file | ||
generated by the preprocessing script, so as to have: | ||
|
||
``` | ||
audio_root: $YOUR_AUDIO_ROOT_DIR | ||
bpe_tokenizer: | ||
bpe: sentencepiece | ||
sentencepiece_model: $YOUR_TGTLANG_SENTENCEPIECE_MODEL | ||
bpe_tokenizer_src: | ||
bpe: sentencepiece | ||
sentencepiece_model: $YOUR_ENGLISH_SENTENCEPIECE_MODEL | ||
input_channels: 1 | ||
input_feat_per_channel: 80 | ||
sampling_alpha: 1.0 | ||
prepend_tgt_lang_tag: True | ||
specaugment: | ||
freq_mask_F: 27 | ||
freq_mask_N: 1 | ||
time_mask_N: 1 | ||
time_mask_T: 100 | ||
time_mask_p: 1.0 | ||
time_wrap_W: 0 | ||
transforms: | ||
'*': | ||
- utterance_cmvn | ||
_train: | ||
- utterance_cmvn | ||
- specaugment | ||
vocab_filename: $YOUR_TGTLANG_SENTENCEPIECE_TOKENS_TXT | ||
vocab_filename_src: $YOUR_ENGLISH_SENTENCEPIECE_TOKENS_TXT | ||
``` | ||
|
||
which we name `config_st_mix_multigender.yaml` hereinafter. | ||
Mind the `prepend_tgt_lang_tag: True`. | ||
|
||
Your SentencePiece models should contain tags for the two genders as the special tokens | ||
`<lang:He>` and `<lang:She>`. In addition, the TSV you have obtained from the preprocessing | ||
of your data must be enriched with a `tgt_lang` column containing either `He` or `She` according to | ||
the gender of the speaker (in the following, we assume the TSV is named `train_st_src_gender_multilang.tsv`. | ||
To know the gender of each speaker, please refer to | ||
[MuST-Speakers](https://mt.fbk.eu/must-speakers/). | ||
|
||
Then, train multi-gender models with the following command: | ||
|
||
``` | ||
python train.py ${DATA_ROOT} \ | ||
--train-subset train_st_src_gender_multilang \ | ||
--valid-subset dev_with_gender_lang \ | ||
--save-dir ${ST_SAVE_DIR} \ | ||
--num-workers 5 --max-update 50000 \ | ||
--max-tokens 10000 --adam-betas '(0.9, 0.98)' \ | ||
--user-dir examples/speech_to_text \ | ||
--task speech_to_text_aux_classification --config-yaml config_st_mix_multigender.yaml \ | ||
--ignore-prefix-size 1 \ | ||
--criterion ctc_multi_loss --underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ | ||
--arch conformer \ | ||
--ctc-encoder-layer 8 --ctc-weight 0.5 --ctc-compress-strategy avg \ | ||
--optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt \ | ||
--warmup-updates 25000 \ | ||
--clip-norm 10.0 \ | ||
--seed 1 --update-freq 8 \ | ||
--skip-invalid-size-inputs-valid-test \ | ||
--log-format simple >> ${ST_SAVE_DIR}/train.log 2> ${ST_SAVE_DIR}/train.err | ||
``` | ||
|
||
### Finetuned Multi-gender Baseline | ||
|
||
To obtain a multi-gender model that is fine-tuned from the base ST one, | ||
add to the training command above `--allow-extra-tokens --finetune-from-model $BASE_ST_MODEL_CHECKPOINT`, | ||
change the learning rate to `5e-4`, and the `lr-scheduler` to `fixed`. | ||
|
||
|
||
|
||
### Multi-gender Gradient Reversal | ||
|
||
First, you need to add the following lines to the YAML config file, so as to obtain `config_st_mix_multigender_with_aux.yaml`: | ||
|
||
``` | ||
aux_classes: | ||
- He | ||
- She | ||
``` | ||
|
||
Then, you need to duplicate the `tgt_lang` column in the TSV files, | ||
naming the new column as `auxiliary_target`. | ||
|
||
The training can be executed with the following script: | ||
|
||
``` | ||
python train.py ${DATA_ROOT} \ | ||
--train-subset train_st_src_gender_multilang \ | ||
--valid-subset dev_with_gender_lang \ | ||
--save-dir ${ST_SAVE_DIR} \ | ||
--num-workers 5 --max-update 50000 --keep-last-epochs 10 \ | ||
--max-tokens 10000 --adam-betas '(0.9, 0.98)' \ | ||
--user-dir examples/speech_to_text \ | ||
--task speech_to_text_aux_classification --config-yaml config_st_mix_multigender_with_aux.yaml \ | ||
--ignore-prefix-size 1 \ | ||
--criterion ctc_multi_loss --underlying-criterion cross_entropy_multi_task --label-smoothing 0.1 \ | ||
--arch multitask_conformer --reverted-classifier --auxiliary-loss-weight 0.5 --reverted-lambda 0.5 \ | ||
--ctc-encoder-layer 8 --ctc-weight 0.5 --ctc-compress-strategy avg \ | ||
--optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt \ | ||
--warmup-updates 25000 \ | ||
--clip-norm 10.0 \ | ||
--seed 1 --update-freq 8 \ | ||
--skip-invalid-size-inputs-valid-test \ | ||
--log-format simple >> ${ST_SAVE_DIR}/train.log 2> ${ST_SAVE_DIR}/train.err | ||
``` | ||
|
||
To obtain the **weighted** variant, add `--auxiliary-loss-class-weights 0.8 1.4` to the command above. | ||
|
||
### Finetuned Multi-gender Gradient Reversal | ||
|
||
To fine-tune from a pre-trained multi-gender model, the procedure is the same as above, | ||
but the script is the following: | ||
|
||
``` | ||
python train.py ${DATA_ROOT} \ | ||
--train-subset train_st_src_gender_multilang \ | ||
--valid-subset dev_with_gender_lang \ | ||
--save-dir ${ST_SAVE_DIR} \ | ||
--num-workers 5 --max-update 50000 \ | ||
--max-tokens 10000 --adam-betas '(0.9, 0.98)' \ | ||
--user-dir examples/speech_to_text \ | ||
--task speech_to_text_aux_classification --config-yaml config_st_mix_multigender.yaml \ | ||
--ignore-prefix-size 1 \ | ||
--criterion ctc_multi_loss --underlying-criterion cross_entropy_multi_task --label-smoothing 0.1 \ | ||
--arch multitask_conformer --reverted-classifier --auxiliary-loss-weight 0.5 --reverted-lambda 10 \ | ||
--ctc-encoder-layer 8 --ctc-weight 0.5 --ctc-compress-strategy avg --auxiliary-loss-class-weights 0.8 1.4 \ | ||
--allow-extra-tokens --allow-partial-loading --finetune-from-model $PATH_TO_PRETRAINED_MULTIGENDER_MODEL \ | ||
--optimizer adam --lr 5e-4 --lr-scheduler fixed \ | ||
--clip-norm 10.0 \ | ||
--seed 1 --update-freq 8 \ | ||
--skip-invalid-size-inputs-valid-test \ | ||
--log-format simple >> ${ST_SAVE_DIR}/train.log 2> ${ST_SAVE_DIR}/train.err | ||
``` | ||
|
||
Similarly, the **weighted** variant is obtained by adding | ||
`--auxiliary-loss-class-weights 0.8 1.4` to the command above. | ||
|
||
### Audio Manipulation | ||
|
||
To enable the audio manipulation that converts speakers' vocal traits into the opposite gender, | ||
edit the `config_st_mix_multigender.yaml` file adding: | ||
|
||
``` | ||
opposite_pitch: | ||
gender_tsv: /home/ubuntu/disk2/corpora/MuST-Speakers_v1.1/MuST-Speakers_v1.1.tsv | ||
sampling_rate: 16000 | ||
p_male: $PROB_MANIP | ||
p_female: $PROB_MANIP | ||
raw_transforms: | ||
_train: | ||
- opposite_pitch | ||
waveform_sample_rate: 16000 | ||
is_input_waveform: True | ||
``` | ||
|
||
where `$PROB_MANIP` has been set to 0.5 and 0.8 in the experiments reported in the paper. | ||
|
||
## 🔍 Evaluation | ||
|
||
Evaluation of the system outputs has been performed with SacreBLEU v2.0 | ||
and the [MuST-SHE Gender Accuracy Script](../examples/speech_to_text/scripts/gender/mustshe_gender_accuracy.py) | ||
v1.1. | ||
|
||
## ⭐ Citation | ||
|
||
If you use this work, please cite: | ||
|
||
```bibtex | ||
@inproceedings{gaido-et-al-multigender, | ||
title={{How To Build Competitive Multi-gender Speech Translation Models For Controlling Speaker Gender Translation}}, | ||
author={Gaido, Marco and Fucci, Dennis and Negri, Matteo and Bentivogli, Luisa}, | ||
year={2023}, | ||
booktitle="Ninth Italian Conference on Computational Linguistics (CLiC-it 2023)", | ||
address="Venice, Italy" | ||
} | ||
``` |