Skip to content

Commit

Permalink
[!147][RELEASE] Shallow Fusion with Gender-Specific LM to Mitigate Ge…
Browse files Browse the repository at this point in the history
…nder Bias (EMNLP 2023)

# Which work do we release?
"Integrating Language Models into Direct Speech Translation: An Inference-Time Solution to Control Gender Inflection"

# What changes does this release refer to?
2f05ba7b862045de37640bf794ad96b27199abda 2bb5679e7ea86f8ec4a33d70a9171155deb85e47 ee402de718b39a98ec009b04fc300ab6fd72f507 9bf56b2b73865f50637a93bbaa66d21da4433d43
  • Loading branch information
deanfcc authored and mgaido91 committed Oct 25, 2023
1 parent c4eef68 commit 14ad2a5
Show file tree
Hide file tree
Showing 2 changed files with 295 additions and 1 deletion.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,8 @@ Dedicated README for each work can be found in the `fbk_works` directory.

### 2023

- [[WMT 2023] **Test Suites Task: Evaluation of Gender Fairness in MT with MuST-SHE and INES**](fbk_works/INES_eval.md)
- [[EMNLP 2023] **Integrating Language Models into Direct Speech Translation: An Inference-Time Solution to Control Gender Inflection**](fbk_works/SHALLOW_FUSION_GENDER_BIAS.md)
- [[WMT 2023] **Test Suites Task: Evaluation of Gender Fairness in MT with MuST-SHE and INES**](fbk_works/INES_eval.md)
- [[ASRU 2023] **No Pitch Left Behind: Addressing Gender Unbalance in Automatic Speech Recognition Through Pitch Manipulation**](fbk_works/PITCH_MANIPULATION_ASR.md)
- [[TACL 2023] **Direct Speech Translation for Automatic Subtitling**](fbk_works/DIRECT_SUBTITLING.md)
- [[INTERSPEECH 2023] **AlignAtt: Using Attention-based Audio-Translation Alignments as a Guide for Simultaneous Speech Translation**](fbk_works/ALIGNATT_SIMULST_AGENT_INTERSPEECH2023.md)
Expand Down
293 changes: 293 additions & 0 deletions fbk_works/SHALLOW_FUSION_GENDER_BIAS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,293 @@
# Shallow Fusion with Gender-Specific LM to Mitigate Gender Bias (EMNLP 2023)

Code and models for the paper:
[**Integrating Language Models into Direct Speech Translation:
An Inference-Time Solution to Control Gender Inflection**](https://arxiv.org/abs/2310.15752v1)
accepted at EMNLP 2023.


## Models

To ensure reproducibility, we release the model checkpoints used in our experiments,
together with the SentencePiece model, the vocabulary files, and the yaml files:
- **Baseline ST Models**: [en-es](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/EcDXRCaV4m9MqDZDUziWk7YB2cBwp2Px6NY_eFbRNj1tSA?e=oVqxzY),
[vocab_src_txt](https://fbk-my.sharepoint.com/:t:/g/personal/dfucci_fbk_eu/ETUbJ1Up0HxAlFHJeibnCDsBQa_jCmGrRoh-RYSJ14nvuQ?e=8siU4y), [spm_src_model](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/EYkFmRy5d-9BvAYyoSjMrDoBufj4_wC-A00X5C3gAP13KQ?e=OchY8h),
[vocab_tgt_txt](https://fbk-my.sharepoint.com/:t:/g/personal/dfucci_fbk_eu/Ea2pHVkEWoFGqx93rj9TTw0B9LfXcgCTDjDBrcRZNfaZXg?e=qJbajC), [spm_tgt_model](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/ETpA0Ynx_LBBuqlsGkAyXKsBIY8Is37-hSvvkB-cjlKKBA?e=bhdgEM) |
[en-fr](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/EbNA-4pHCm5HkxJDabbgCIsB_6GUKny8ucT4W0EvrkQWzw?e=ZQLzcQ),
[vocab_src_txt](https://fbk-my.sharepoint.com/:t:/g/personal/dfucci_fbk_eu/EeOSsAx_KKpOgGQuofC04_8BIPUqC6gJSw4igBvnNrGtCw?e=PH8vwZ), [spm_src_model](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/ESI_HTQ9oAVMrAy69v1XlNgBW9bmNXvK2pCuQNy2bmXXWA?e=T3acyA),
[vocab_tgt_txt](https://fbk-my.sharepoint.com/:t:/g/personal/dfucci_fbk_eu/EYNb1GFe46tAlSP-DZsOGwgBkG1RzkdQjLJsrQiKOtyuRg?e=Y0STTC), [spm_tgt_model](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/EXQUA2DeVNJNpIvoqARrSk4B_kBgF1QngWjCYT5S_5xhfQ?e=3rwbSB) |
[en-it](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/EW8NLZO0xshPjQZtj1ZNNXUBOBeNgjcQ_bKbc1m837W53w?e=TeNkFt),
[vocab_src_txt](https://fbk-my.sharepoint.com/:t:/g/personal/dfucci_fbk_eu/EelnpJXTtrNMnOKEoIm475wBrb8kCz06rU-FtL8HW5dpLw?e=zYBLpK), [spm_src_model](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/EbGGwMccLH9GgJfpQS3OMuwBMsmen3FpGwhTmfvFSpi3eQ?e=NZA1af),
[vocab_tgt_txt](https://fbk-my.sharepoint.com/:t:/g/personal/dfucci_fbk_eu/EXlf-eZMgVFHop4hSzJPVasBAsNC-o4WXvZaZfsX55SUlw?e=FuthQK), [spm_tgt_model](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/EVv1ftuBOSREs6BcP80iKLsBFIfKq6Or0h2x0Ujg0OQuCA?e=AsIr9L) |
[config_file](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/ET3L78C9LCpHsp9rsvjVU3IBJk8dFiXmkMnvWZOKf6w-_A?e=GZUYQl)
- **Specialized ST Models**: en-es: [M](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/EQasJ8PW0nlNtgtIHPlCfJ0Bgqgmv0VX8dCn4ZS5Ox_Uxw?e=tOAuDo), [F](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/EV3kY0-ufv9Jk7v1WGPkE1cB-8tBCyEZzF1Ruj2yjWJvYg?e=DYGbxQ) |
en-fr: [M](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/EdjvXBG1h8NAgRflBWQLhCgBMjo-jKEL1ilhBqsvlvMqsg?e=3GaWgO), [F](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/EUiQR0momOdBg2xsNM1k3OYBKXx32pTdzqJKYxu6JhO9kw?e=u1t10A) |
en-it: [M](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/EQksjXQCRilCsY9GsqdDk10BH2YAKTZb7wlK5Hh5YaeV2A?e=3GKsu2), [F](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/ESYdZ7hFAk9Bj4zF8kw0i7YB7UOk7ias8VsNyDu-41nxrQ?e=C5h59i)
- **Language Models**: es: [M](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/ERZAXNUR8XNCsCd6TK2sTCgBSMmuyOWQDcJgjM_eVGZecQ?e=PrwhKz), [F](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/EWWLefKIv4ZNo6ZhjnwVqc4B3eDt4dFVkCL6vty4udwjiA?e=y7YWZ7) |
fr: [M](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/ERjieMoB4z9HowQIxmQ8e3wBfqsrGmZtZ67WcIc0jADaPg?e=YcJg86), [F](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/EWNkjJvWtTRDr2fErUPQ8XIBFrf4751gLO4XhRbr4tXg8w?e=O309kq) |
it: [M](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/EXM6-sPQJ-pEiHT1PbR-la4BmcjAm3qqeyDMpKkFuMYhHg?e=GqH1Tm), [F](https://fbk-my.sharepoint.com/:u:/g/personal/dfucci_fbk_eu/EbxrARzsnO9Lvk7Yr6r-RdIBoIAeFXrZNJbGCESmX0QxVA?e=OeNqFv)


## How to run

With the following instructions, it is possible to reimplement our models from scratch,
as well as to use our models for inference.


### Data and Preprocessing

#### ST Data

To train the ST models, we have used the [MuST-C dataset](https://mt.fbk.eu/must-c/). \
The data, comprising text files found in `$MUSTC_TEXT_DATA_FOLDER` and audio files in `$MUSTC_WAV_FOLDER`,
can be preprocessed for each target language (`$TGT_LANG`) using the following command.
The results will be saved in `$MUSTC_SAVE_FOLDER`.

```bash
python ${FBK_FAIRSEQ}/examples/speech_to_text/preprocess_generic.py \
--data-root $MUSTC_TEXT_DATA_FOLDER \
--save-dir $MUSTC_SAVE_FOLDER \
--wav-dir $MUSTC_WAV_FOLDER \
--split train dev \
--vocab-type bpe \
--vocab-size 8000 \
--src-lang en --tgt-lang $TGT_LANG \
--task st \
--n-mel-bins 80
```

For testing the ST models, we have used the [MuST-SHE](https://mt.fbk.eu/must-she/) test set. \
The text files located in `$MUSTSHE_EXT_DATA_FOLDER` and the audio files in `$MUSTSHE_WAV_FOLDER`
which can be preprocessed for each target language (`$TGT_LANG`) using the text vocabularies (`$VOCAB_SRC` and
`$VOCAB_TGT`) obtained by the previous preprocessing step.
The results will be saved in `$MUSTSHE_SAVE_FOLDER`.

```bash
python ${FBK_FAIRSEQ}/examples/speech_to_text/preprocess_generic.py \
--data-root $MUSTSHE_EXT_DATA_FOLDER \
--save-dir $MUSTSHE_SAVE_FOLDER \
--wav-dir $MUSTSHE_WAV_FOLDER \
--split MONOLINGUAL.${TGT_LANG}_v1.2 \
--vocab-type bpe \
--vocab-file-src /$VOCAB_SRC \
--vocab-file-tgt $VOCAB_TGT \
--src-lang en --tgt-lang $TGT_LANG \
--task st \
--n-mel-bins 80
```

#### Monolingual Text (LM) Data

To train the LMs we have used [GenderCrawl](https://mt.fbk.eu/gendercrawl/), a set of text corpora
derived from [ParaCrawl](https://www.paracrawl.eu/) by selecting the sentences that contain
gender-marked words referring to the speaker. \
These data can be preprocessed using the following command, where `$TRAIN_DATA_TOKENIZED` and
`$DEV_DATA_TOKENIZED` are the text training and validation data tokenized with the SentencePiece model
obtained by the ST preprocessing, `$VOCAB_SRC` is the txt vocabulary obtained used for the ST preprocessing,
and `$GENDERCRAWL_SAVE_FOLDER` is the folder where the preprocessed data are stored.

```bash
fairseq-preprocess \
--task language_modeling \
--cpu \
--only-source \
--trainpref $TRAIN_DATA_TOKENIZED \
--validpref $DEV_DATA_TOKENIZED \
--srcdict $VOCAB_SRC \
--destdir $GENDERCRAWL_SAVE_FOLDER
```

### Training

#### Base ST Models

To train the base ST models we have used the following command
(parameters intended for training on 4 GPUs, each with 40 GB of VRAM).
The `$TRAIN_MUSTC` and `$DEV_MUSTC` files are in TSV format located in `$MUSTC_FOLDER`,
obtained after preprocessing. The `$CONFIG_ST` is a YAML file and can be downloaded above.
Final checkpoint and log information will be saved in `$ST_BASE_SAVE_DIR`.

```bash
python ${FBK_FAIRSEQ}/train.py $MUSTC_FOLDER \
--train-subset $TRAIN_MUSTC \
--valid-subset $DEV_MUSTC \
--save-dir $ST_BASE_SAVE_DIR \
--num-workers 3 \
--max-update 50000 \
--max-tokens 40000 --adam-betas '(0.9, 0.98)' \
--user-dir examples/speech_to_text \
--task speech_to_text_ctc \
--config-yaml $CONFIG_ST \
--criterion ctc_multi_loss --underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--arch conformer \
--ctc-encoder-layer 8 --ctc-weight 0.5 --ctc-compress-strategy avg \
--optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt \
--warmup-updates 25000 \
--clip-norm 10.0 \
--update-freq 2 \
--skip-invalid-size-inputs-valid-test \
--log-format simple >> $ST_BASE_SAVE_DIR/train.log 2> $ST_BASE_SAVE_DIR/train.err

python ${FBK_FAIRSEQ}/scripts/average_checkpoints.py --input $ST_BASE_SAVE_DIR \
--num-update-checkpoints 7 \
--checkpoint-upper-bound 50000 \
--output $ST_BASE_SAVE_DIR/avg7.pt
```

#### Specialized ST Models

To finetune the base ST models and develop the specialized ST models, we used the following
command, where `$TRAIN_MUSTC_${GDR}` and `$DEV_MUSTC_${GDR}` are the gender-specific portions of MuST-C
TSV files obtained with the [MuST-Speaker](https://mt.fbk.eu/must-speakers/) resource.
Final checkpoint and log information will be saved in `$ST_SPECIALIZED_SAVE_DIR`.

```bash
python ${FBK_FAIRSEQ}/train.py $MUSTC_FOLDER \
--train-subset $TRAIN_MUSTC_${GDR} \
--valid-subset $DEV_MUSTC_${GDR} \
--save-dir $ST_SPECIALIZED_SAVE_DIR \
--num-workers 3 \
--max-epoch 7 \
--max-tokens 40000 --adam-betas '(0.9, 0.98)' \
--user-dir examples/speech_to_text \
--task speech_to_text_ctc \
--finetune-from-model $ST_BASE_SAVE_DIR/avg7.pt \
--config-yaml $CONFIG_ST \
--criterion ctc_multi_loss --underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--arch conformer \
--ctc-encoder-layer 8 --ctc-weight 0.5 --ctc-compress-strategy avg \
--optimizer adam --lr 1e-3 \
--warmup-updates 25000 \
--clip-norm 10.0 \
--update-freq 2 \
--skip-invalid-size-inputs-valid-test \
--log-format simple >> $ST_SPECIALIZED_SAVE_DIR/finetuning.log 2> $ST_SPECIALIZED_SAVE_DIR/finetuning.err

python ${FBK_FAIRSEQ}/scripts/average_checkpoints.py --input $ST_SPECIALIZED_SAVE_DIR \
--num-epoch-checkpoints 4 \
--checkpoint-upper-bound 7 \
--output $ST_SPECIALIZED_SAVE_DIR/avg4.pt
```

#### Language Models

To train the LMs, we have used the following command (parameters intended for training
on 1 GPU with 12G of VRAM).
`$TRAIN_GENDERCRAWL` and `$DEV_GENDERCRAWL` are the bin files obtained from the preprocessing and are located in
`$EGOCRAWL_FOLDER`. The final checkpoint will be saved in $LM_SAVE_FOLDER.

```bash
fairseq-train $GENDERCRAWL_FOLDER \
--task language_modeling \
--train-subset $TRAIN_GENDERCRAWL \
--valid-subset $DEV_GENDERCRAWL \
--validate-interval 10000 --validate-interval-updates 100 \
--save-dir $LM_SAVE_FOLDER \
--arch transformer_lm --share-decoder-input-output-embed \
--dropout 0.1 \
--optimizer adam --adam-betas '(0.9, 0.98)' --weight-decay 0.01 --clip-norm 0.0 \
--lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 200 --warmup-init-lr 1e-07 \
--tokens-per-sample 1024 --sample-break-mode none \
--max-tokens 16384 --update-freq 8 \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--patience 5 \
--fp16 \
--save-interval-updates 100 \
--keep-interval-updates 10 \
--no-epoch-checkpoints

python ${FBK_FAIRSEQ}/scripts/average_checkpoints.py --input $LM_SAVE_FOLDER \
--num-epoch-checkpoints 5 \
--checkpoint-upper-bound $(ls $LM_SAVE_FOLDER | head -n 5 | tail -n 1 | grep -o "[0-9]*") \
--output $LM_SAVE_FOLDER/avg5.pt
```


### Inference

For the **base** and **specialized ST models**, whose checkpoint is `$CHECKPOINT`, we have used the following
command, where `MUSTSHE_DATA` is the preprocessed TSV file located in
`$MUSTSHE_SAVE_FOLDER`, `CONFIG_FILE` is the YAML file provided above.
The output translations will be saved in `$OUTPUT`.

```bash
python ${FBK_FAIRSEQ}/fairseq_cli/generate.py $MUSTSHE_SAVE_FOLDER \
--gen-subset $MUSTSHE_DATA \
--user-dir examples/speech_to_text \
--max-tokens 20000 \
--config-yaml $CONFIG_FILE \
--beam 5 \
--max-source-positions 10000 \
--max-target-positions 1000 \
--task speech_to_text_ctc \
--criterion ctc_multi_loss \
--underlying-criterion label_smoothed_cross_entropy \
--label-smoothing 0.1 \
--no-repeat-ngram-size 5 \
--path $CHECKPOINT > $OUTPUT
```

For shallow fusion, inference can be executed with the following command. Here,
`$AVG_ENCODER` is the averaged encoder output provided above, and `$ILMW` and `$ELMW` are the weights for the
internal LM contribution and the external LM contribution, respectively.
The output translations will be saved in `$OUTPUT_SHALLOW_FUSION`.

```bash
python ${FBK_FAIRSEQ}/fairseq_cli/generate.py $MUSTSHE_SAVE_FOLDER \
--gen-subset $TEST_DATA \
--user-dir examples/speech_to_text \
--max-tokens 20000 \
--config-yaml $CONFIG_FILE \
--beam 5 \
--max-source-positions 10000 \
--max-target-positions 1000 \
--task speech_to_text_ctc_noilm \
--criterion ctc_multi_loss \
--underlying-criterion label_smoothed_cross_entropy \
--label-smoothing 0.1 \
--no-repeat-ngram-size 5 \
--path $ST_BASE_SAVE_DIR/avg7.pt \
--lm-path $LM_SAVE_FOLDER/best.pt \
--ilm-weight $ILMW \
--lm-weight $ELMW \
--encoder-avg-outs $AVG_ENCODER > $OUTPUT_SHALLOW_FUSION
```

### Evaluation

We have used [SacreBLEU](https://github.com/mjpost/sacrebleu) 2.0.0 to compute BLEU
and the official script of MuST-SHE to compute _gender accuracy_ and _term coverage_ metrics. \
To run the paired bootstrap resampling we have used the implementation in SacreBLEU for BLEU scores,
and the following command for the gender accuracy and term coverage scores (`$MUSTSHE_REF` is the text file
containing reference sentences, `$BASE_OUTPUTS` and `$EXP_OUTPUTS` are the text files containing
the translated sentences by the two systems to be compared, and `$CATEGORIES` is the categories to be evaluated).

```bash
python ${FBK_FAIRSEQ}/examples/speech_to_text/scripts/gender/paired_bootstrap_resampling_mustshe.py \
--reference-file $MUSTSHE_REF \
--baseline-file $BASE_OUTPUTS \
--experimental-file $EXP_OUTPUTS \
--categories $CATEGORIES \
--num-samples 1000 \
--significance-level 0.05
```


## Citation

```bibtex
@inproceedings{fucci-etal-2023-integrating,
title = "Integrating Language Models into Direct Speech Translation: An Inference-Time Solution to Control Gender Inflection",
author = "Fucci, Dennis and
Gaido, Marco and
Papi, Sara and
Cettolo, Mauro and
Negri, Matteo and
Bentivogli, Luisa},
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
}
```

0 comments on commit 14ad2a5

Please sign in to comment.