This README contains the instructions to replicate the training and evaluation of the models in the paper How do Hyenas deal with Human Speech? Speech Recognition and Translation with ConfHyena. In addition, we release the pre-trained models used in the paper.
Clone this repository and install it as explained in the original Fairseq(-py). For the experiments we used MuST-C, make sure to download the corpus. Follow the preprocessing steps of Speechformer to preprocess the MuST-C data.
Below we release the dictionary/config files and the pre-trained checkpoints obtained in our experiments. The dictionary and config files are the same as those used for the Conformer baseline, whose checkpoints can be found here.
- Source dictionary SentencePiece model and fairseq dictionary: srcdict.model, srcdict.txt
- Target dictionary SentencePiece model and fairseq dictionary:
- en (ASR): same as srcdict.model and srcdict.txt
- en-de: tgtdict.model, tgtdict.txt
- en-es: tgtdict.model, tgtdict.txt
- en-fr: tgtdict.model, tgtdict.txt
- en-it: tgtdict.model, tgtdict.txt
- en-nl: tgtdict.model, tgtdict.txt
- en-pt: tgtdict.model, tgtdict.txt
- en-ro: tgtdict.model, tgtdict.txt
- config yaml:
bpe_tokenizer:
bpe: sentencepiece
sentencepiece_model: tgtdict.model
bpe_tokenizer_src:
bpe: sentencepiece
sentencepiece_model: srcdict.model
input_channels: 1
input_feat_per_channel: 80
sampling_alpha: 1.0
specaugment:
freq_mask_F: 27
freq_mask_N: 1
time_mask_N: 1
time_mask_T: 100
time_mask_p: 1.0
time_wrap_W: 0
transforms:
'*':
- utterance_cmvn
_train:
- utterance_cmvn
- specaugment
vocab_filename: tgtdict.txt
vocab_filename_src: srcdict.txt
Model | en (ASR) | en-de | en-es | en-fr | en-it | en-nl | en-pt | en-ro |
---|---|---|---|---|---|---|---|---|
ConfHyena | ckp.pt | ckp.pt | ckp.pt | ckp.pt | ckp.pt | ckp.pt | ckp.pt | ckp.pt |
- non-causal Hyena | ckp.pt | ckp.pt | ckp.pt | ckp.pt | ckp.pt | ckp.pt | ckp.pt | ckp.pt |
Hybrid ConfHyena | ckp.pt | ckp.pt | ckp.pt | ckp.pt | ckp.pt | ckp.pt | ckp.pt | ckp.pt |
- non-causal Hyena | ckp.pt | ckp.pt | ckp.pt | ckp.pt | ckp.pt | ckp.pt | ckp.pt | ckp.pt |
For the Conformer baseline, please refer to the bug-free Conformer README.
For the Hybrid ConfHyena models, our training has been executed with the following commands.
LANG=$1
MUSTC_ROOT=$2
TASK=$3
SAVE_DIR=$4
mkdir -p $SAVE_DIR
python ${FBK_fairseq}/train.py ${MUSTC_ROOT} \
--train-subset train_${TASK}_src --valid-subset dev_${TASK}_src \
--user-dir examples/speech_to_text --seed 1 \
--num-workers 2 --max-update 100000 --patience 10 --keep-last-epochs 12 \
--max-tokens 40000 --update-freq 4 \
--task speech_to_text_ctc --config-yaml config.yaml \
--criterion ctc_multi_loss \
--underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--arch confhyena --conformer-after-compression --stride 2 \
--ctc-encoder-layer 8 --ctc-weight 0.5 --ctc-compress-strategy avg \
--optimizer adam --adam-betas '(0.9, 0.98)' \
--lr 2e-3 --lr-scheduler inverse_sqrt --warmup-updates 25000 \
--clip-norm 10.0 \
--skip-invalid-size-inputs-valid-test \
--save-dir ${SAVE_DIR} \
--log-format simple > $SAVE_DIR/train.log 2> $SAVE_DIR/train.err
python ${FBK_fairseq}/scripts/average_checkpoints.py \
--input $SAVE_DIR --num-epoch-checkpoints 5 \
--checkpoint-upper-bound $(ls $SAVE_DIR | head -n 5 | tail -n 1 | grep -o "[0-9]*") \
--output $SAVE_DIR/avg5.pt
if [ -f $SAVE_DIR/avg5.pt ]; then
rm $SAVE_DIR/checkpoint??.pt
fi
The ConfHyena models can be obtained by removing the --conformer-after-compression
parameter.
The causal version of the two architectures (- non causal Hyena
in the paper and tables below)
can be obtained by adding the parameter --hyena-causal
to the command.
The command is meant to be executed on 2 A100 GPUs with 40GB VRAM.
Once you downloaded the pretrained checkpoints and related config/dictionaries, generate the output with:
python ${FBK_fairseq}/fairseq_cli/generate.py ${MUSTC_ROOT} \
--user-dir examples/speech_to_text \
--config-yaml config.yaml --gen-subset tst-COMMON_st_src \
--max-source-positions 10000 --max-target-positions 1000 \
--task speech_to_text_ctc \
--criterion ctc_multi_loss --underlying-criterion label_smoothed_cross_entropy \
--beam 5 --no-repeat-ngram-size 5 --path ${PRETRAINED_CHECKPOINT} > ${OUTPUT_FILE}
@inproceedings{gaido-et-al-2024-hyena,
title={{How do Hyenas deal with Human Speech? Speech Recognition and Translation with ConfHyena}},
author={Marco Gaido and Sara Papi and Matteo Negri and Luisa Bentivogli},
year={2024},
address="Turin, Italy",
booktitle={Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
}