How do Hyenas deal with Human Speech? Speech Recognition and Translation with ConfHyena

This README contains the instructions to replicate the training and evaluation of the models in the paper How do Hyenas deal with Human Speech? Speech Recognition and Translation with ConfHyena. In addition, we release the pre-trained models used in the paper.

Setup

Clone this repository and install it as explained in the original Fairseq(-py). For the experiments we used MuST-C, make sure to download the corpus. Follow the preprocessing steps of Speechformer to preprocess the MuST-C data.

Pretrained models

Below we release the dictionary/config files and the pre-trained checkpoints obtained in our experiments. The dictionary and config files are the same as those used for the Conformer baseline, whose checkpoints can be found here.

Common files:

Source dictionary SentencePiece model and fairseq dictionary: srcdict.model, srcdict.txt
Target dictionary SentencePiece model and fairseq dictionary:
- en (ASR): same as srcdict.model and srcdict.txt
- en-de: tgtdict.model, tgtdict.txt
- en-es: tgtdict.model, tgtdict.txt
- en-fr: tgtdict.model, tgtdict.txt
- en-it: tgtdict.model, tgtdict.txt
- en-nl: tgtdict.model, tgtdict.txt
- en-pt: tgtdict.model, tgtdict.txt
- en-ro: tgtdict.model, tgtdict.txt
config yaml:

bpe_tokenizer:
  bpe: sentencepiece
  sentencepiece_model: tgtdict.model
bpe_tokenizer_src:
  bpe: sentencepiece
  sentencepiece_model: srcdict.model
input_channels: 1
input_feat_per_channel: 80
sampling_alpha: 1.0
specaugment:
  freq_mask_F: 27
  freq_mask_N: 1
  time_mask_N: 1
  time_mask_T: 100
  time_mask_p: 1.0
  time_wrap_W: 0
transforms:
  '*':
  - utterance_cmvn
  _train:
  - utterance_cmvn
  - specaugment
vocab_filename: tgtdict.txt
vocab_filename_src: srcdict.txt

Checkpoints

Model	en (ASR)	en-de	en-es	en-fr	en-it	en-nl	en-pt	en-ro
ConfHyena	ckp.pt	ckp.pt	ckp.pt	ckp.pt	ckp.pt	ckp.pt	ckp.pt	ckp.pt
- non-causal Hyena	ckp.pt	ckp.pt	ckp.pt	ckp.pt	ckp.pt	ckp.pt	ckp.pt	ckp.pt
Hybrid ConfHyena	ckp.pt	ckp.pt	ckp.pt	ckp.pt	ckp.pt	ckp.pt	ckp.pt	ckp.pt
- non-causal Hyena	ckp.pt	ckp.pt	ckp.pt	ckp.pt	ckp.pt	ckp.pt	ckp.pt	ckp.pt

Training

For the Conformer baseline, please refer to the bug-free Conformer README.

For the Hybrid ConfHyena models, our training has been executed with the following commands.

LANG=$1
MUSTC_ROOT=$2
TASK=$3
SAVE_DIR=$4

mkdir -p $SAVE_DIR

python ${FBK_fairseq}/train.py ${MUSTC_ROOT} \
        --train-subset train_${TASK}_src --valid-subset dev_${TASK}_src \
        --user-dir examples/speech_to_text --seed 1 \
        --num-workers 2 --max-update 100000 --patience 10 --keep-last-epochs 12 \
        --max-tokens 40000 --update-freq 4 \
        --task speech_to_text_ctc --config-yaml config.yaml  \
        --criterion ctc_multi_loss \
        --underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
        --arch confhyena --conformer-after-compression --stride 2 \
        --ctc-encoder-layer 8 --ctc-weight 0.5  --ctc-compress-strategy avg \
        --optimizer adam --adam-betas '(0.9, 0.98)' \
        --lr 2e-3 --lr-scheduler inverse_sqrt --warmup-updates 25000 \
        --clip-norm 10.0 \
        --skip-invalid-size-inputs-valid-test \
        --save-dir ${SAVE_DIR} \
        --log-format simple > $SAVE_DIR/train.log 2> $SAVE_DIR/train.err

python ${FBK_fairseq}/scripts/average_checkpoints.py \
        --input $SAVE_DIR --num-epoch-checkpoints 5 \
        --checkpoint-upper-bound $(ls $SAVE_DIR | head -n 5 | tail -n 1 | grep -o "[0-9]*") \
        --output $SAVE_DIR/avg5.pt

if [ -f $SAVE_DIR/avg5.pt ]; then
  rm $SAVE_DIR/checkpoint??.pt
fi

The ConfHyena models can be obtained by removing the --conformer-after-compression parameter.

The causal version of the two architectures (- non causal Hyena in the paper and tables below) can be obtained by adding the parameter --hyena-causal to the command.

The command is meant to be executed on 2 A100 GPUs with 40GB VRAM.

Evaluation

Once you downloaded the pretrained checkpoints and related config/dictionaries, generate the output with:

python ${FBK_fairseq}/fairseq_cli/generate.py ${MUSTC_ROOT} \
        --user-dir examples/speech_to_text \
        --config-yaml config.yaml --gen-subset tst-COMMON_st_src \
        --max-source-positions 10000 --max-target-positions 1000 \
        --task speech_to_text_ctc \
        --criterion ctc_multi_loss --underlying-criterion label_smoothed_cross_entropy \
        --beam 5 --no-repeat-ngram-size 5 --path ${PRETRAINED_CHECKPOINT} > ${OUTPUT_FILE}

Citation

@inproceedings{gaido-et-al-2024-hyena,
      title={{How do Hyenas deal with Human Speech? Speech Recognition and Translation with ConfHyena}}, 
      author={Marco Gaido and Sara Papi and Matteo Negri and Luisa Bentivogli},
      year={2024},
      address="Turin, Italy",
      booktitle={Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HYENA_COLING2024.md

HYENA_COLING2024.md

How do Hyenas deal with Human Speech? Speech Recognition and Translation with ConfHyena

Setup

Pretrained models

Common files:

Checkpoints

Training

Evaluation

Citation

Files

HYENA_COLING2024.md

Latest commit

History

HYENA_COLING2024.md

File metadata and controls

How do Hyenas deal with Human Speech? Speech Recognition and Translation with ConfHyena

Setup

Pretrained models

Common files:

Checkpoints

Training

Evaluation

Citation