From 14ad2a597ae77f470f23e4ebf54604a44db27be6 Mon Sep 17 00:00:00 2001 From: dennis Date: Wed, 25 Oct 2023 14:55:02 +0200 Subject: [PATCH] [!147][RELEASE] Shallow Fusion with Gender-Specific LM to Mitigate Gender Bias (EMNLP 2023) # Which work do we release? "Integrating Language Models into Direct Speech Translation: An Inference-Time Solution to Control Gender Inflection" # What changes does this release refer to? 2f05ba7b862045de37640bf794ad96b27199abda 2bb5679e7ea86f8ec4a33d70a9171155deb85e47 ee402de718b39a98ec009b04fc300ab6fd72f507 9bf56b2b73865f50637a93bbaa66d21da4433d43 --- README.md | 3 +- fbk_works/SHALLOW_FUSION_GENDER_BIAS.md | 293 ++++++++++++++++++++++++ 2 files changed, 295 insertions(+), 1 deletion(-) create mode 100644 fbk_works/SHALLOW_FUSION_GENDER_BIAS.md diff --git a/README.md b/README.md index 0f78a4c3..e3719ae3 100644 --- a/README.md +++ b/README.md @@ -5,7 +5,8 @@ Dedicated README for each work can be found in the `fbk_works` directory. ### 2023 - - [[WMT 2023] **Test Suites Task: Evaluation of Gender Fairness in MT with MuST-SHE and INES**](fbk_works/INES_eval.md) + - [[EMNLP 2023] **Integrating Language Models into Direct Speech Translation: An Inference-Time Solution to Control Gender Inflection**](fbk_works/SHALLOW_FUSION_GENDER_BIAS.md) + - [[WMT 2023] **Test Suites Task: Evaluation of Gender Fairness in MT with MuST-SHE and INES**](fbk_works/INES_eval.md) - [[ASRU 2023] **No Pitch Left Behind: Addressing Gender Unbalance in Automatic Speech Recognition Through Pitch Manipulation**](fbk_works/PITCH_MANIPULATION_ASR.md) - [[TACL 2023] **Direct Speech Translation for Automatic Subtitling**](fbk_works/DIRECT_SUBTITLING.md) - [[INTERSPEECH 2023] **AlignAtt: Using Attention-based Audio-Translation Alignments as a Guide for Simultaneous Speech Translation**](fbk_works/ALIGNATT_SIMULST_AGENT_INTERSPEECH2023.md) diff --git a/fbk_works/SHALLOW_FUSION_GENDER_BIAS.md b/fbk_works/SHALLOW_FUSION_GENDER_BIAS.md new file mode 100644 index 00000000..3b6835e6 --- /dev/null +++ b/fbk_works/SHALLOW_FUSION_GENDER_BIAS.md @@ -0,0 +1,293 @@ +# Shallow Fusion with Gender-Specific LM to Mitigate Gender Bias (EMNLP 2023) + +Code and models for the paper: +[**Integrating Language Models into Direct Speech Translation: +An Inference-Time Solution to Control Gender Inflection**](https://arxiv.org/abs/2310.15752v1) +accepted at EMNLP 2023. + + +## Models + +To ensure reproducibility, we release the model checkpoints used in our experiments, +together with the SentencePiece model, the vocabulary files, and the yaml files: +- **Baseline ST Models**: [en-es](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/EcDXRCaV4m9MqDZDUziWk7YB2cBwp2Px6NY_eFbRNj1tSA?e=oVqxzY), +[vocab_src_txt](https://fbk-my.sharepoint.com/:t:/g/personal/dfucci_fbk_eu/ETUbJ1Up0HxAlFHJeibnCDsBQa_jCmGrRoh-RYSJ14nvuQ?e=8siU4y), [spm_src_model](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/EYkFmRy5d-9BvAYyoSjMrDoBufj4_wC-A00X5C3gAP13KQ?e=OchY8h), +[vocab_tgt_txt](https://fbk-my.sharepoint.com/:t:/g/personal/dfucci_fbk_eu/Ea2pHVkEWoFGqx93rj9TTw0B9LfXcgCTDjDBrcRZNfaZXg?e=qJbajC), [spm_tgt_model](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/ETpA0Ynx_LBBuqlsGkAyXKsBIY8Is37-hSvvkB-cjlKKBA?e=bhdgEM) | +[en-fr](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/EbNA-4pHCm5HkxJDabbgCIsB_6GUKny8ucT4W0EvrkQWzw?e=ZQLzcQ), +[vocab_src_txt](https://fbk-my.sharepoint.com/:t:/g/personal/dfucci_fbk_eu/EeOSsAx_KKpOgGQuofC04_8BIPUqC6gJSw4igBvnNrGtCw?e=PH8vwZ), [spm_src_model](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/ESI_HTQ9oAVMrAy69v1XlNgBW9bmNXvK2pCuQNy2bmXXWA?e=T3acyA), +[vocab_tgt_txt](https://fbk-my.sharepoint.com/:t:/g/personal/dfucci_fbk_eu/EYNb1GFe46tAlSP-DZsOGwgBkG1RzkdQjLJsrQiKOtyuRg?e=Y0STTC), [spm_tgt_model](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/EXQUA2DeVNJNpIvoqARrSk4B_kBgF1QngWjCYT5S_5xhfQ?e=3rwbSB) | +[en-it](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/EW8NLZO0xshPjQZtj1ZNNXUBOBeNgjcQ_bKbc1m837W53w?e=TeNkFt), +[vocab_src_txt](https://fbk-my.sharepoint.com/:t:/g/personal/dfucci_fbk_eu/EelnpJXTtrNMnOKEoIm475wBrb8kCz06rU-FtL8HW5dpLw?e=zYBLpK), [spm_src_model](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/EbGGwMccLH9GgJfpQS3OMuwBMsmen3FpGwhTmfvFSpi3eQ?e=NZA1af), +[vocab_tgt_txt](https://fbk-my.sharepoint.com/:t:/g/personal/dfucci_fbk_eu/EXlf-eZMgVFHop4hSzJPVasBAsNC-o4WXvZaZfsX55SUlw?e=FuthQK), [spm_tgt_model](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/EVv1ftuBOSREs6BcP80iKLsBFIfKq6Or0h2x0Ujg0OQuCA?e=AsIr9L) | +[config_file](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/ET3L78C9LCpHsp9rsvjVU3IBJk8dFiXmkMnvWZOKf6w-_A?e=GZUYQl) +- **Specialized ST Models**: en-es: [M](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/EQasJ8PW0nlNtgtIHPlCfJ0Bgqgmv0VX8dCn4ZS5Ox_Uxw?e=tOAuDo), [F](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/EV3kY0-ufv9Jk7v1WGPkE1cB-8tBCyEZzF1Ruj2yjWJvYg?e=DYGbxQ) | +en-fr: [M](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/EdjvXBG1h8NAgRflBWQLhCgBMjo-jKEL1ilhBqsvlvMqsg?e=3GaWgO), [F](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/EUiQR0momOdBg2xsNM1k3OYBKXx32pTdzqJKYxu6JhO9kw?e=u1t10A) | +en-it: [M](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/EQksjXQCRilCsY9GsqdDk10BH2YAKTZb7wlK5Hh5YaeV2A?e=3GKsu2), [F](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/ESYdZ7hFAk9Bj4zF8kw0i7YB7UOk7ias8VsNyDu-41nxrQ?e=C5h59i) +- **Language Models**: es: [M](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/ERZAXNUR8XNCsCd6TK2sTCgBSMmuyOWQDcJgjM_eVGZecQ?e=PrwhKz), [F](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/EWWLefKIv4ZNo6ZhjnwVqc4B3eDt4dFVkCL6vty4udwjiA?e=y7YWZ7) | +fr: [M](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/ERjieMoB4z9HowQIxmQ8e3wBfqsrGmZtZ67WcIc0jADaPg?e=YcJg86), [F](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/EWNkjJvWtTRDr2fErUPQ8XIBFrf4751gLO4XhRbr4tXg8w?e=O309kq) | +it: [M](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/EXM6-sPQJ-pEiHT1PbR-la4BmcjAm3qqeyDMpKkFuMYhHg?e=GqH1Tm), [F](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/EbxrARzsnO9Lvk7Yr6r-RdIBoIAeFXrZNJbGCESmX0QxVA?e=OeNqFv) + + +## How to run + +With the following instructions, it is possible to reimplement our models from scratch, +as well as to use our models for inference. + + +### Data and Preprocessing + +#### ST Data + +To train the ST models, we have used the [MuST-C dataset](https://mt.fbk.eu/must-c/). \ +The data, comprising text files found in `$MUSTC_TEXT_DATA_FOLDER` and audio files in `$MUSTC_WAV_FOLDER`, +can be preprocessed for each target language (`$TGT_LANG`) using the following command. +The results will be saved in `$MUSTC_SAVE_FOLDER`. + +```bash +python ${FBK_FAIRSEQ}/examples/speech_to_text/preprocess_generic.py \ + --data-root $MUSTC_TEXT_DATA_FOLDER \ + --save-dir $MUSTC_SAVE_FOLDER \ + --wav-dir $MUSTC_WAV_FOLDER \ + --split train dev \ + --vocab-type bpe \ + --vocab-size 8000 \ + --src-lang en --tgt-lang $TGT_LANG \ + --task st \ + --n-mel-bins 80 +``` + +For testing the ST models, we have used the [MuST-SHE](https://mt.fbk.eu/must-she/) test set. \ +The text files located in `$MUSTSHE_EXT_DATA_FOLDER` and the audio files in `$MUSTSHE_WAV_FOLDER` +which can be preprocessed for each target language (`$TGT_LANG`) using the text vocabularies (`$VOCAB_SRC` and +`$VOCAB_TGT`) obtained by the previous preprocessing step. +The results will be saved in `$MUSTSHE_SAVE_FOLDER`. + +```bash +python ${FBK_FAIRSEQ}/examples/speech_to_text/preprocess_generic.py \ + --data-root $MUSTSHE_EXT_DATA_FOLDER \ + --save-dir $MUSTSHE_SAVE_FOLDER \ + --wav-dir $MUSTSHE_WAV_FOLDER \ + --split MONOLINGUAL.${TGT_LANG}_v1.2 \ + --vocab-type bpe \ + --vocab-file-src /$VOCAB_SRC \ + --vocab-file-tgt $VOCAB_TGT \ + --src-lang en --tgt-lang $TGT_LANG \ + --task st \ + --n-mel-bins 80 +``` + +#### Monolingual Text (LM) Data + +To train the LMs we have used [GenderCrawl](https://mt.fbk.eu/gendercrawl/), a set of text corpora +derived from [ParaCrawl](https://www.paracrawl.eu/) by selecting the sentences that contain +gender-marked words referring to the speaker. \ +These data can be preprocessed using the following command, where `$TRAIN_DATA_TOKENIZED` and +`$DEV_DATA_TOKENIZED` are the text training and validation data tokenized with the SentencePiece model +obtained by the ST preprocessing, `$VOCAB_SRC` is the txt vocabulary obtained used for the ST preprocessing, +and `$GENDERCRAWL_SAVE_FOLDER` is the folder where the preprocessed data are stored. + +```bash +fairseq-preprocess \ + --task language_modeling \ + --cpu \ + --only-source \ + --trainpref $TRAIN_DATA_TOKENIZED \ + --validpref $DEV_DATA_TOKENIZED \ + --srcdict $VOCAB_SRC \ + --destdir $GENDERCRAWL_SAVE_FOLDER +``` + +### Training + +#### Base ST Models + +To train the base ST models we have used the following command +(parameters intended for training on 4 GPUs, each with 40 GB of VRAM). +The `$TRAIN_MUSTC` and `$DEV_MUSTC` files are in TSV format located in `$MUSTC_FOLDER`, +obtained after preprocessing. The `$CONFIG_ST` is a YAML file and can be downloaded above. +Final checkpoint and log information will be saved in `$ST_BASE_SAVE_DIR`. + +```bash +python ${FBK_FAIRSEQ}/train.py $MUSTC_FOLDER \ + --train-subset $TRAIN_MUSTC \ + --valid-subset $DEV_MUSTC \ + --save-dir $ST_BASE_SAVE_DIR \ + --num-workers 3 \ + --max-update 50000 \ + --max-tokens 40000 --adam-betas '(0.9, 0.98)' \ + --user-dir examples/speech_to_text \ + --task speech_to_text_ctc \ + --config-yaml $CONFIG_ST \ + --criterion ctc_multi_loss --underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ + --arch conformer \ + --ctc-encoder-layer 8 --ctc-weight 0.5 --ctc-compress-strategy avg \ + --optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt \ + --warmup-updates 25000 \ + --clip-norm 10.0 \ + --update-freq 2 \ + --skip-invalid-size-inputs-valid-test \ + --log-format simple >> $ST_BASE_SAVE_DIR/train.log 2> $ST_BASE_SAVE_DIR/train.err + +python ${FBK_FAIRSEQ}/scripts/average_checkpoints.py --input $ST_BASE_SAVE_DIR \ + --num-update-checkpoints 7 \ + --checkpoint-upper-bound 50000 \ + --output $ST_BASE_SAVE_DIR/avg7.pt +``` + +#### Specialized ST Models + +To finetune the base ST models and develop the specialized ST models, we used the following +command, where `$TRAIN_MUSTC_${GDR}` and `$DEV_MUSTC_${GDR}` are the gender-specific portions of MuST-C +TSV files obtained with the [MuST-Speaker](https://mt.fbk.eu/must-speakers/) resource. +Final checkpoint and log information will be saved in `$ST_SPECIALIZED_SAVE_DIR`. + +```bash +python ${FBK_FAIRSEQ}/train.py $MUSTC_FOLDER \ + --train-subset $TRAIN_MUSTC_${GDR} \ + --valid-subset $DEV_MUSTC_${GDR} \ + --save-dir $ST_SPECIALIZED_SAVE_DIR \ + --num-workers 3 \ + --max-epoch 7 \ + --max-tokens 40000 --adam-betas '(0.9, 0.98)' \ + --user-dir examples/speech_to_text \ + --task speech_to_text_ctc \ + --finetune-from-model $ST_BASE_SAVE_DIR/avg7.pt \ + --config-yaml $CONFIG_ST \ + --criterion ctc_multi_loss --underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ + --arch conformer \ + --ctc-encoder-layer 8 --ctc-weight 0.5 --ctc-compress-strategy avg \ + --optimizer adam --lr 1e-3 \ + --warmup-updates 25000 \ + --clip-norm 10.0 \ + --update-freq 2 \ + --skip-invalid-size-inputs-valid-test \ + --log-format simple >> $ST_SPECIALIZED_SAVE_DIR/finetuning.log 2> $ST_SPECIALIZED_SAVE_DIR/finetuning.err + +python ${FBK_FAIRSEQ}/scripts/average_checkpoints.py --input $ST_SPECIALIZED_SAVE_DIR \ + --num-epoch-checkpoints 4 \ + --checkpoint-upper-bound 7 \ + --output $ST_SPECIALIZED_SAVE_DIR/avg4.pt +``` + +#### Language Models + +To train the LMs, we have used the following command (parameters intended for training +on 1 GPU with 12G of VRAM). +`$TRAIN_GENDERCRAWL` and `$DEV_GENDERCRAWL` are the bin files obtained from the preprocessing and are located in +`$EGOCRAWL_FOLDER`. The final checkpoint will be saved in $LM_SAVE_FOLDER. + +```bash +fairseq-train $GENDERCRAWL_FOLDER \ + --task language_modeling \ + --train-subset $TRAIN_GENDERCRAWL \ + --valid-subset $DEV_GENDERCRAWL \ + --validate-interval 10000 --validate-interval-updates 100 \ + --save-dir $LM_SAVE_FOLDER \ + --arch transformer_lm --share-decoder-input-output-embed \ + --dropout 0.1 \ + --optimizer adam --adam-betas '(0.9, 0.98)' --weight-decay 0.01 --clip-norm 0.0 \ + --lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 200 --warmup-init-lr 1e-07 \ + --tokens-per-sample 1024 --sample-break-mode none \ + --max-tokens 16384 --update-freq 8 \ + --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ + --patience 5 \ + --fp16 \ + --save-interval-updates 100 \ + --keep-interval-updates 10 \ + --no-epoch-checkpoints + + python ${FBK_FAIRSEQ}/scripts/average_checkpoints.py --input $LM_SAVE_FOLDER \ + --num-epoch-checkpoints 5 \ + --checkpoint-upper-bound $(ls $LM_SAVE_FOLDER | head -n 5 | tail -n 1 | grep -o "[0-9]*") \ + --output $LM_SAVE_FOLDER/avg5.pt +``` + + +### Inference + +For the **base** and **specialized ST models**, whose checkpoint is `$CHECKPOINT`, we have used the following +command, where `MUSTSHE_DATA` is the preprocessed TSV file located in +`$MUSTSHE_SAVE_FOLDER`, `CONFIG_FILE` is the YAML file provided above. +The output translations will be saved in `$OUTPUT`. + +```bash +python ${FBK_FAIRSEQ}/fairseq_cli/generate.py $MUSTSHE_SAVE_FOLDER \ + --gen-subset $MUSTSHE_DATA \ + --user-dir examples/speech_to_text \ + --max-tokens 20000 \ + --config-yaml $CONFIG_FILE \ + --beam 5 \ + --max-source-positions 10000 \ + --max-target-positions 1000 \ + --task speech_to_text_ctc \ + --criterion ctc_multi_loss \ + --underlying-criterion label_smoothed_cross_entropy \ + --label-smoothing 0.1 \ + --no-repeat-ngram-size 5 \ + --path $CHECKPOINT > $OUTPUT +``` + +For shallow fusion, inference can be executed with the following command. Here, +`$AVG_ENCODER` is the averaged encoder output provided above, and `$ILMW` and `$ELMW` are the weights for the +internal LM contribution and the external LM contribution, respectively. +The output translations will be saved in `$OUTPUT_SHALLOW_FUSION`. + +```bash +python ${FBK_FAIRSEQ}/fairseq_cli/generate.py $MUSTSHE_SAVE_FOLDER \ + --gen-subset $TEST_DATA \ + --user-dir examples/speech_to_text \ + --max-tokens 20000 \ + --config-yaml $CONFIG_FILE \ + --beam 5 \ + --max-source-positions 10000 \ + --max-target-positions 1000 \ + --task speech_to_text_ctc_noilm \ + --criterion ctc_multi_loss \ + --underlying-criterion label_smoothed_cross_entropy \ + --label-smoothing 0.1 \ + --no-repeat-ngram-size 5 \ + --path $ST_BASE_SAVE_DIR/avg7.pt \ + --lm-path $LM_SAVE_FOLDER/best.pt \ + --ilm-weight $ILMW \ + --lm-weight $ELMW \ + --encoder-avg-outs $AVG_ENCODER > $OUTPUT_SHALLOW_FUSION +``` + +### Evaluation + +We have used [SacreBLEU](https://github.com/mjpost/sacrebleu) 2.0.0 to compute BLEU +and the official script of MuST-SHE to compute _gender accuracy_ and _term coverage_ metrics. \ +To run the paired bootstrap resampling we have used the implementation in SacreBLEU for BLEU scores, +and the following command for the gender accuracy and term coverage scores (`$MUSTSHE_REF` is the text file +containing reference sentences, `$BASE_OUTPUTS` and `$EXP_OUTPUTS` are the text files containing +the translated sentences by the two systems to be compared, and `$CATEGORIES` is the categories to be evaluated). + +```bash +python ${FBK_FAIRSEQ}/examples/speech_to_text/scripts/gender/paired_bootstrap_resampling_mustshe.py \ + --reference-file $MUSTSHE_REF \ + --baseline-file $BASE_OUTPUTS \ + --experimental-file $EXP_OUTPUTS \ + --categories $CATEGORIES \ + --num-samples 1000 \ + --significance-level 0.05 +``` + + +## Citation + +```bibtex +@inproceedings{fucci-etal-2023-integrating, + title = "Integrating Language Models into Direct Speech Translation: An Inference-Time Solution to Control Gender Inflection", + author = "Fucci, Dennis and + Gaido, Marco and + Papi, Sara and + Cettolo, Mauro and + Negri, Matteo and + Bentivogli, Luisa}, + booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing", + month = dec, + year = "2023", + address = "Singapore", + publisher = "Association for Computational Linguistics", +} +```