From 4b7966b0b2cf8a2c260cdd0479632da8fd30b56c Mon Sep 17 00:00:00 2001 From: Marco Gaido Date: Thu, 26 Oct 2023 10:42:38 +0200 Subject: [PATCH] [!145][RELEASE] Gradient-reversal and multi-gender models to control gender (CLiC-it 2023) # Which work do we release? How To Build Competitive Multi-gender Speech Translation Models For Controlling Speaker Gender Translation (CLiC-it 2023) # What changes does this release refer to? 12884f97dd1a76ee79218dd8d4b790d0a29b38fe 538639e93c7926a6fd5bf1aa1824bb832e5fa172 --- README.md | 1 + fbk_works/MULTIGENDER_CLIC_2023.md | 196 +++++++++++++++++++++++++++++ 2 files changed, 197 insertions(+) create mode 100644 fbk_works/MULTIGENDER_CLIC_2023.md diff --git a/README.md b/README.md index e3719ae3..fa9c461c 100644 --- a/README.md +++ b/README.md @@ -5,6 +5,7 @@ Dedicated README for each work can be found in the `fbk_works` directory. ### 2023 + - [[CLiC-IT 2023] **How To Build Competitive Multi-gender Speech Translation Models For Controlling Speaker Gender Translation**](fbk_works/MULTIGENDER_CLIC_2023.md) - [[EMNLP 2023] **Integrating Language Models into Direct Speech Translation: An Inference-Time Solution to Control Gender Inflection**](fbk_works/SHALLOW_FUSION_GENDER_BIAS.md) - [[WMT 2023] **Test Suites Task: Evaluation of Gender Fairness in MT with MuST-SHE and INES**](fbk_works/INES_eval.md) - [[ASRU 2023] **No Pitch Left Behind: Addressing Gender Unbalance in Automatic Speech Recognition Through Pitch Manipulation**](fbk_works/PITCH_MANIPULATION_ASR.md) diff --git a/fbk_works/MULTIGENDER_CLIC_2023.md b/fbk_works/MULTIGENDER_CLIC_2023.md new file mode 100644 index 00000000..147e87ee --- /dev/null +++ b/fbk_works/MULTIGENDER_CLIC_2023.md @@ -0,0 +1,196 @@ +# How To Build Competitive Multi-gender Speech Translation Models For Controlling Speaker Gender Translation (CLiC-it 2023) + +Instructions to reproduce the paper +["How To Build Competitive Multi-gender Speech Translation Models For Controlling Speaker Gender Translation"](http://arxiv.org/abs/2310.15114). + +## 📍 Preprocess and Setup + +Download all the corpora listed in our paper and preprocess them as explained [here](SPEECHFORMER.md#preprocessing). + +## 🏃 Training +The models of the paper have been trained with the following scripts. +All the scripts below assume that 4 GPUs are used with at least 16GB of VRAM. +On different hardware, you may need to adjust the parameters `--max-tokens` (e.g. lower it if you have lower VRAM) +and `--update-freq` so that the product `num_gpus * max_tokens * update_freq` remains the same. + +### Multi-gender Baseline + +To train multi-gender models, you first need to edit the YAML config file +generated by the preprocessing script, so as to have: + +``` +audio_root: $YOUR_AUDIO_ROOT_DIR +bpe_tokenizer: + bpe: sentencepiece + sentencepiece_model: $YOUR_TGTLANG_SENTENCEPIECE_MODEL +bpe_tokenizer_src: + bpe: sentencepiece + sentencepiece_model: $YOUR_ENGLISH_SENTENCEPIECE_MODEL +input_channels: 1 +input_feat_per_channel: 80 +sampling_alpha: 1.0 +prepend_tgt_lang_tag: True +specaugment: + freq_mask_F: 27 + freq_mask_N: 1 + time_mask_N: 1 + time_mask_T: 100 + time_mask_p: 1.0 + time_wrap_W: 0 +transforms: + '*': + - utterance_cmvn + _train: + - utterance_cmvn + - specaugment +vocab_filename: $YOUR_TGTLANG_SENTENCEPIECE_TOKENS_TXT +vocab_filename_src: $YOUR_ENGLISH_SENTENCEPIECE_TOKENS_TXT +``` + +which we name `config_st_mix_multigender.yaml` hereinafter. +Mind the `prepend_tgt_lang_tag: True`. + +Your SentencePiece models should contain tags for the two genders as the special tokens +`<lang:He>` and `<lang:She>`. In addition, the TSV you have obtained from the preprocessing +of your data must be enriched with a `tgt_lang` column containing either `He` or `She` according to +the gender of the speaker (in the following, we assume the TSV is named `train_st_src_gender_multilang.tsv`. +To know the gender of each speaker, please refer to +[MuST-Speakers](https://mt.fbk.eu/must-speakers/). + +Then, train multi-gender models with the following command: + +``` +python train.py ${DATA_ROOT} \ + --train-subset train_st_src_gender_multilang \ + --valid-subset dev_with_gender_lang \ + --save-dir ${ST_SAVE_DIR} \ + --num-workers 5 --max-update 50000 \ + --max-tokens 10000 --adam-betas '(0.9, 0.98)' \ + --user-dir examples/speech_to_text \ + --task speech_to_text_aux_classification --config-yaml config_st_mix_multigender.yaml \ + --ignore-prefix-size 1 \ + --criterion ctc_multi_loss --underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ + --arch conformer \ + --ctc-encoder-layer 8 --ctc-weight 0.5 --ctc-compress-strategy avg \ + --optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt \ + --warmup-updates 25000 \ + --clip-norm 10.0 \ + --seed 1 --update-freq 8 \ + --skip-invalid-size-inputs-valid-test \ + --log-format simple >> ${ST_SAVE_DIR}/train.log 2> ${ST_SAVE_DIR}/train.err +``` + +### Finetuned Multi-gender Baseline + +To obtain a multi-gender model that is fine-tuned from the base ST one, +add to the training command above `--allow-extra-tokens --finetune-from-model $BASE_ST_MODEL_CHECKPOINT`, +change the learning rate to `5e-4`, and the `lr-scheduler` to `fixed`. + + + +### Multi-gender Gradient Reversal + +First, you need to add the following lines to the YAML config file, so as to obtain `config_st_mix_multigender_with_aux.yaml`: + +``` +aux_classes: + - He + - She +``` + +Then, you need to duplicate the `tgt_lang` column in the TSV files, +naming the new column as `auxiliary_target`. + +The training can be executed with the following script: + +``` +python train.py ${DATA_ROOT} \ + --train-subset train_st_src_gender_multilang \ + --valid-subset dev_with_gender_lang \ + --save-dir ${ST_SAVE_DIR} \ + --num-workers 5 --max-update 50000 --keep-last-epochs 10 \ + --max-tokens 10000 --adam-betas '(0.9, 0.98)' \ + --user-dir examples/speech_to_text \ + --task speech_to_text_aux_classification --config-yaml config_st_mix_multigender_with_aux.yaml \ + --ignore-prefix-size 1 \ + --criterion ctc_multi_loss --underlying-criterion cross_entropy_multi_task --label-smoothing 0.1 \ + --arch multitask_conformer --reverted-classifier --auxiliary-loss-weight 0.5 --reverted-lambda 0.5 \ + --ctc-encoder-layer 8 --ctc-weight 0.5 --ctc-compress-strategy avg \ + --optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt \ + --warmup-updates 25000 \ + --clip-norm 10.0 \ + --seed 1 --update-freq 8 \ + --skip-invalid-size-inputs-valid-test \ + --log-format simple >> ${ST_SAVE_DIR}/train.log 2> ${ST_SAVE_DIR}/train.err +``` + +To obtain the **weighted** variant, add `--auxiliary-loss-class-weights 0.8 1.4` to the command above. + +### Finetuned Multi-gender Gradient Reversal + +To fine-tune from a pre-trained multi-gender model, the procedure is the same as above, +but the script is the following: + +``` +python train.py ${DATA_ROOT} \ + --train-subset train_st_src_gender_multilang \ + --valid-subset dev_with_gender_lang \ + --save-dir ${ST_SAVE_DIR} \ + --num-workers 5 --max-update 50000 \ + --max-tokens 10000 --adam-betas '(0.9, 0.98)' \ + --user-dir examples/speech_to_text \ + --task speech_to_text_aux_classification --config-yaml config_st_mix_multigender.yaml \ + --ignore-prefix-size 1 \ + --criterion ctc_multi_loss --underlying-criterion cross_entropy_multi_task --label-smoothing 0.1 \ + --arch multitask_conformer --reverted-classifier --auxiliary-loss-weight 0.5 --reverted-lambda 10 \ + --ctc-encoder-layer 8 --ctc-weight 0.5 --ctc-compress-strategy avg --auxiliary-loss-class-weights 0.8 1.4 \ + --allow-extra-tokens --allow-partial-loading --finetune-from-model $PATH_TO_PRETRAINED_MULTIGENDER_MODEL \ + --optimizer adam --lr 5e-4 --lr-scheduler fixed \ + --clip-norm 10.0 \ + --seed 1 --update-freq 8 \ + --skip-invalid-size-inputs-valid-test \ + --log-format simple >> ${ST_SAVE_DIR}/train.log 2> ${ST_SAVE_DIR}/train.err +``` + +Similarly, the **weighted** variant is obtained by adding +`--auxiliary-loss-class-weights 0.8 1.4` to the command above. + +### Audio Manipulation + +To enable the audio manipulation that converts speakers' vocal traits into the opposite gender, +edit the `config_st_mix_multigender.yaml` file adding: + +``` +opposite_pitch: + gender_tsv: /home/ubuntu/disk2/corpora/MuST-Speakers_v1.1/MuST-Speakers_v1.1.tsv + sampling_rate: 16000 + p_male: $PROB_MANIP + p_female: $PROB_MANIP +raw_transforms: + _train: + - opposite_pitch +waveform_sample_rate: 16000 +is_input_waveform: True +``` + +where `$PROB_MANIP` has been set to 0.5 and 0.8 in the experiments reported in the paper. + +## 🔍 Evaluation + +Evaluation of the system outputs has been performed with SacreBLEU v2.0 +and the [MuST-SHE Gender Accuracy Script](../examples/speech_to_text/scripts/gender/mustshe_gender_accuracy.py) +v1.1. + +## ⭐ Citation + +If you use this work, please cite: + +```bibtex +@inproceedings{gaido-et-al-multigender, + title={{How To Build Competitive Multi-gender Speech Translation Models For Controlling Speaker Gender Translation}}, + author={Gaido, Marco and Fucci, Dennis and Negri, Matteo and Bentivogli, Luisa}, + year={2023}, + booktitle="Ninth Italian Conference on Computational Linguistics (CLiC-it 2023)", + address="Venice, Italy" +} +```