Shallow Fusion with Gender-Specific LM to Mitigate Gender Bias (EMNLP 2023)

Code and models for the paper: Integrating Language Models into Direct Speech Translation: An Inference-Time Solution to Control Gender Inflection accepted at EMNLP 2023.

Models

To ensure reproducibility, we release the model checkpoints used in our experiments, together with the SentencePiece model, the vocabulary files, and the yaml files:

Baseline ST Models: en-es, vocab_src_txt, spm_src_model, vocab_tgt_txt, spm_tgt_model | en-fr, vocab_src_txt, spm_src_model, vocab_tgt_txt, spm_tgt_model | en-it, vocab_src_txt, spm_src_model, vocab_tgt_txt, spm_tgt_model | config_file
Specialized ST Models: en-es: M, F | en-fr: M, F | en-it: M, F
Language Models: es: M, F | fr: M, F | it: M, F

How to run

With the following instructions, it is possible to reimplement our models from scratch, as well as to use our models for inference.

Data and Preprocessing

ST Data

To train the ST models, we have used the MuST-C dataset.
The data, comprising text files found in $MUSTC_TEXT_DATA_FOLDER and audio files in $MUSTC_WAV_FOLDER, can be preprocessed for each target language ($TGT_LANG) using the following command. The results will be saved in $MUSTC_SAVE_FOLDER.

python ${FBK_FAIRSEQ}/examples/speech_to_text/preprocess_generic.py \
      --data-root $MUSTC_TEXT_DATA_FOLDER \
      --save-dir $MUSTC_SAVE_FOLDER \
      --wav-dir $MUSTC_WAV_FOLDER \
      --split train dev \
      --vocab-type bpe \
      --vocab-size 8000 \
      --src-lang en --tgt-lang $TGT_LANG \
      --task st \
      --n-mel-bins 80

For testing the ST models, we have used the MuST-SHE test set.
The text files located in $MUSTSHE_EXT_DATA_FOLDER and the audio files in $MUSTSHE_WAV_FOLDER which can be preprocessed for each target language ($TGT_LANG) using the text vocabularies ($VOCAB_SRC and $VOCAB_TGT) obtained by the previous preprocessing step. The results will be saved in $MUSTSHE_SAVE_FOLDER.

python ${FBK_FAIRSEQ}/examples/speech_to_text/preprocess_generic.py \
      --data-root $MUSTSHE_EXT_DATA_FOLDER \
      --save-dir $MUSTSHE_SAVE_FOLDER \
      --wav-dir $MUSTSHE_WAV_FOLDER \
      --split MONOLINGUAL.${TGT_LANG}_v1.2 \
      --vocab-type bpe \
      --vocab-file-src /$VOCAB_SRC \
      --vocab-file-tgt $VOCAB_TGT \
      --src-lang en --tgt-lang $TGT_LANG \
      --task st \
      --n-mel-bins 80

Monolingual Text (LM) Data

To train the LMs we have used GenderCrawl, a set of text corpora derived from ParaCrawl by selecting the sentences that contain gender-marked words referring to the speaker.
These data can be preprocessed using the following command, where $TRAIN_DATA_TOKENIZED and $DEV_DATA_TOKENIZED are the text training and validation data tokenized with the SentencePiece model obtained by the ST preprocessing, $VOCAB_SRC is the txt vocabulary obtained used for the ST preprocessing, and $GENDERCRAWL_SAVE_FOLDER is the folder where the preprocessed data are stored.

fairseq-preprocess \
        --task language_modeling \
        --cpu \
        --only-source \
        --trainpref $TRAIN_DATA_TOKENIZED \
        --validpref $DEV_DATA_TOKENIZED \
        --srcdict $VOCAB_SRC \
        --destdir $GENDERCRAWL_SAVE_FOLDER

Training

Base ST Models

To train the base ST models we have used the following command (parameters intended for training on 4 GPUs, each with 40 GB of VRAM). The $TRAIN_MUSTC and $DEV_MUSTC files are in TSV format located in $MUSTC_FOLDER, obtained after preprocessing. The $CONFIG_ST is a YAML file and can be downloaded above. Final checkpoint and log information will be saved in $ST_BASE_SAVE_DIR.

python ${FBK_FAIRSEQ}/train.py $MUSTC_FOLDER \
        --train-subset $TRAIN_MUSTC \
        --valid-subset $DEV_MUSTC \
        --save-dir $ST_BASE_SAVE_DIR \
        --num-workers 3 \
        --max-update 50000 \
        --max-tokens 40000 --adam-betas '(0.9, 0.98)' \
        --user-dir examples/speech_to_text \
        --task speech_to_text_ctc \
        --config-yaml $CONFIG_ST \
        --criterion ctc_multi_loss --underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
        --arch conformer \
        --ctc-encoder-layer 8 --ctc-weight 0.5 --ctc-compress-strategy avg \
        --optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt \
        --warmup-updates 25000 \
        --clip-norm 10.0 \
        --update-freq 2 \
        --skip-invalid-size-inputs-valid-test \
        --log-format simple >> $ST_BASE_SAVE_DIR/train.log 2> $ST_BASE_SAVE_DIR/train.err

python ${FBK_FAIRSEQ}/scripts/average_checkpoints.py --input $ST_BASE_SAVE_DIR \
        --num-update-checkpoints 7 \
        --checkpoint-upper-bound 50000 \
        --output $ST_BASE_SAVE_DIR/avg7.pt

Specialized ST Models

To finetune the base ST models and develop the specialized ST models, we used the following command, where $TRAIN_MUSTC_${GDR} and $DEV_MUSTC_${GDR} are the gender-specific portions of MuST-C TSV files obtained with the MuST-Speaker resource. Final checkpoint and log information will be saved in $ST_SPECIALIZED_SAVE_DIR.

python ${FBK_FAIRSEQ}/train.py $MUSTC_FOLDER \
        --train-subset $TRAIN_MUSTC_${GDR} \
        --valid-subset $DEV_MUSTC_${GDR} \
        --save-dir $ST_SPECIALIZED_SAVE_DIR \
        --num-workers 3 \
        --max-epoch 7 \
        --max-tokens 40000 --adam-betas '(0.9, 0.98)' \
        --user-dir examples/speech_to_text \
        --task speech_to_text_ctc \
        --finetune-from-model $ST_BASE_SAVE_DIR/avg7.pt \
        --config-yaml $CONFIG_ST \
        --criterion ctc_multi_loss --underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
        --arch conformer \
        --ctc-encoder-layer 8 --ctc-weight 0.5 --ctc-compress-strategy avg \
        --optimizer adam --lr 1e-3 \
        --warmup-updates 25000 \
        --clip-norm 10.0 \
        --update-freq 2 \
        --skip-invalid-size-inputs-valid-test \
        --log-format simple >> $ST_SPECIALIZED_SAVE_DIR/finetuning.log 2> $ST_SPECIALIZED_SAVE_DIR/finetuning.err

python ${FBK_FAIRSEQ}/scripts/average_checkpoints.py --input $ST_SPECIALIZED_SAVE_DIR \
        --num-epoch-checkpoints 4 \
        --checkpoint-upper-bound 7 \
        --output $ST_SPECIALIZED_SAVE_DIR/avg4.pt

Language Models

To train the LMs, we have used the following command (parameters intended for training on 1 GPU with 12G of VRAM). $TRAIN_GENDERCRAWL and $DEV_GENDERCRAWL are the bin files obtained from the preprocessing and are located in $EGOCRAWL_FOLDER. The final checkpoint will be saved in $LM_SAVE_FOLDER.

fairseq-train $GENDERCRAWL_FOLDER \
        --task language_modeling \
        --train-subset $TRAIN_GENDERCRAWL \
        --valid-subset $DEV_GENDERCRAWL \
        --validate-interval 10000 --validate-interval-updates 100 \
        --save-dir $LM_SAVE_FOLDER \
        --arch transformer_lm --share-decoder-input-output-embed \
        --dropout 0.1 \
        --optimizer adam --adam-betas '(0.9, 0.98)' --weight-decay 0.01 --clip-norm 0.0 \
        --lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 200 --warmup-init-lr 1e-07 \
        --tokens-per-sample 1024 --sample-break-mode none \
        --max-tokens 16384 --update-freq 8 \
        --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
        --patience 5 \
        --fp16 \
        --save-interval-updates 100 \
        --keep-interval-updates 10 \
        --no-epoch-checkpoints
        
 python ${FBK_FAIRSEQ}/scripts/average_checkpoints.py --input $LM_SAVE_FOLDER \
        --num-epoch-checkpoints 5 \
        --checkpoint-upper-bound $(ls $LM_SAVE_FOLDER | head -n 5 | tail -n 1 | grep -o "[0-9]*") \
        --output $LM_SAVE_FOLDER/avg5.pt

Inference

For the base and specialized ST models, whose checkpoint is $CHECKPOINT, we have used the following command, where MUSTSHE_DATA is the preprocessed TSV file located in $MUSTSHE_SAVE_FOLDER, CONFIG_FILE is the YAML file provided above. The output translations will be saved in $OUTPUT.

python ${FBK_FAIRSEQ}/fairseq_cli/generate.py $MUSTSHE_SAVE_FOLDER \
        --gen-subset $MUSTSHE_DATA \
        --user-dir examples/speech_to_text \
        --max-tokens 20000 \
        --config-yaml $CONFIG_FILE \
        --beam 5 \
        --max-source-positions 10000 \
        --max-target-positions 1000 \
        --task speech_to_text_ctc \
        --criterion ctc_multi_loss \
        --underlying-criterion label_smoothed_cross_entropy \
        --label-smoothing 0.1 \
        --no-repeat-ngram-size 5  \
        --path $CHECKPOINT > $OUTPUT

For shallow fusion, inference can be executed with the following command. Here, $AVG_ENCODER is the averaged encoder output provided above, and $ILMW and $ELMW are the weights for the internal LM contribution and the external LM contribution, respectively. The output translations will be saved in $OUTPUT_SHALLOW_FUSION.

python ${FBK_FAIRSEQ}/fairseq_cli/generate.py $MUSTSHE_SAVE_FOLDER \
        --gen-subset $TEST_DATA \
        --user-dir examples/speech_to_text \
        --max-tokens 20000 \
        --config-yaml $CONFIG_FILE \
        --beam 5 \
        --max-source-positions 10000 \
        --max-target-positions 1000 \
        --task speech_to_text_ctc_noilm \
        --criterion ctc_multi_loss \
        --underlying-criterion label_smoothed_cross_entropy \
        --label-smoothing 0.1 \
        --no-repeat-ngram-size 5  \
        --path $ST_BASE_SAVE_DIR/avg7.pt \
        --lm-path $LM_SAVE_FOLDER/best.pt \
        --ilm-weight $ILMW \
        --lm-weight $ELMW \
        --encoder-avg-outs $AVG_ENCODER > $OUTPUT_SHALLOW_FUSION

Evaluation

We have used SacreBLEU 2.0.0 to compute BLEU and the official script of MuST-SHE to compute gender accuracy and term coverage metrics.
To run the paired bootstrap resampling we have used the implementation in SacreBLEU for BLEU scores, and the following command for the gender accuracy and term coverage scores ($MUSTSHE_REF is the text file containing reference sentences, $BASE_OUTPUTS and $EXP_OUTPUTS are the text files containing the translated sentences by the two systems to be compared, and $CATEGORIES is the categories to be evaluated).

python ${FBK_FAIRSEQ}/examples/speech_to_text/scripts/gender/paired_bootstrap_resampling_mustshe.py \
        --reference-file $MUSTSHE_REF \
        --baseline-file $BASE_OUTPUTS \
        --experimental-file $EXP_OUTPUTS \
        --categories $CATEGORIES \
        --num-samples 1000 \
        --significance-level 0.05

Citation

@inproceedings{fucci-etal-2023-integrating,
    title = "Integrating Language Models into Direct Speech Translation: An Inference-Time Solution to Control Gender Inflection",
    author = "Fucci, Dennis and 
        Gaido, Marco and 
        Papi, Sara and 
	Cettolo, Mauro and
	Negri, Matteo and 
        Bentivogli, Luisa},
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SHALLOW_FUSION_GENDER_BIAS.md

SHALLOW_FUSION_GENDER_BIAS.md

Shallow Fusion with Gender-Specific LM to Mitigate Gender Bias (EMNLP 2023)

Models

How to run

Data and Preprocessing

ST Data

Monolingual Text (LM) Data

Training

Base ST Models

Specialized ST Models

Language Models

Inference

Evaluation

Citation

Files

SHALLOW_FUSION_GENDER_BIAS.md

Latest commit

History

SHALLOW_FUSION_GENDER_BIAS.md

File metadata and controls

Shallow Fusion with Gender-Specific LM to Mitigate Gender Bias (EMNLP 2023)

Models

How to run

Data and Preprocessing

ST Data

Monolingual Text (LM) Data

Training

Base ST Models

Specialized ST Models

Language Models

Inference

Evaluation

Citation