Code and models for the paper: "No Pitch Left Behind: Addressing Gender Unbalance in Automatic Speech Recognition Through Pitch Manipulation" accepted at ASRU 2023.
To ensure complete reproducibility, we release the ASR model checkpoints used in our experiments, together with the SentencePiece model, the vocabulary files, the yaml files, and the outputs obtained by each model:
- Baseline: checkpoint | config.yaml | tst-COMMON.out | tst-HE.out
- Baseline + VTLP: checkpoint | config.yaml | tst-COMMON.out | tst-HE.out
- Baseline + Random: checkpoint | config.yaml | tst-COMMON.out | tst-HE.out
- Baseline + Opposite: checkpoint | config.yaml | tst-COMMON.out | tst-HE.out
- Baseline + Random - Formant Shifting: checkpoint | config.yaml | tst-COMMON.out | tst-HE.out
- Baseline + Random - Formant Shifting - Gender Swapping: checkpoint | config.yaml | tst-COMMON.out | tst-HE.out
- Vocabulary: vocab.txt | spm_model
Data (MuST-C v1, en-es direction) have to be preprocessed with:
python /path/to/fbk-fairseq/examples/speech_to_text/preprocess_generic.py --data-root /data/to/mustc \
--save-dir /data/to/mustc/save_folder --wav-dir /data/to/mustc/wav_folder \
--split train, dev, tst-HE, tst-COMMON --vocab-type bpe --src-lang en --tgt-lang en \
--task asr --n-mel-bins 80 --store-waveform
The following parameters are intended for training on a system with 4 GPUs, each having 16 GB of VRAM.
The training_data
and dev_data
files are in TSV format, obtained after preprocessing.
The config_file
is a YAML file and can be downloaded above.
python train.py /path/to/data_folder \
--train-subset training_data --valid-subset dev_data \
--save-dir /path/to/save_folder \
--num-workers 5 --max-update 50000 --patience 10 --keep-last-epochs 13 \
--max-tokens 10000 --adam-betas '(0.9, 0.98)' \
--user-dir examples/speech_to_text \
--task speech_to_text_ctc --config-yaml config_file \
--criterion ctc_multi_loss --underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--arch conformer \
--ctc-encoder-layer 8 --ctc-weight 0.5 \
--optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt \
--warmup-updates 25000 \
--clip-norm 10.0 \
--seed 1 --update-freq 8 \
--skip-invalid-size-inputs-valid-test \
--log-format simple >> /path/to/save_folder/train.log 2> /path/to/save_folder/train.err
python /path/to/fbk-fairseq/scripts/average_checkpoints.py --input /path/to/save/folder --num-epoch-checkpoints 5 --checkpoint-upper-bound $(ls /path/to/save_folder | head -n 5 | tail -n 1 | grep -o "[0-9]*") --output /path/to/save_folder/avg5.pt
Inference can be executed with the following command
(setting TEST_DATA
to a TSV obtained from the preprocessing
and CONFIG_FILE
to one of the YAML files provided above):
python /path/to/fbk-fairseq/fairseq_cli/generate.py /path/to/data_folder \
--gen-subset $TEST_DATA \
--user-dir examples/speech_to_text \
--max-tokens 40000 \
--config-yaml $CONFIG_FILE \
--beam 5 \
--max-source-positions 10000 \
--max-target-positions 1000 \
--task speech_to_text_ctc \
--criterion ctc_multi_loss \
--underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--no-repeat-ngram-size 5 \
--path /path/to/checkpoint > /path/to/output_file
We use the Python package JiWER to compute the word error rate. Gender-specific evaluations are performed by partitioning the test sets based on the MuST-Speaker resource.
@inproceedings{fucci2023pitch,
title={{No Pitch Left Behind: Addressing Gender Unbalance in Automatic Speech Recognition through Pitch Manipulation}},
author={Dennis Fucci and Marco Gaido and Matteo Negri and Mauro Cettolo and Luisa Bentivogli},
year={2023},
booktitle="IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)",
month = dec,
address="Taipei, Taiwan"
}