We provide the implementation for speech-to-unit translation (S2UT) proposed in Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation (Popuri et al. 2022) and the various pretrained models used.
We used the multilingual HuBERT model open sourced in Textless S2ST with Real Data
Language | Block type | Model size | Dataset | Model |
---|---|---|---|---|
Es | Transformer | BASE | Voxpopuli | ckpt |
Es | Transformer | LARGE | Voxpopuli | ckpt |
Es | Conformer | LARGE | Voxpopuli | ckpt |
En | Transformer | BASE | Librilight | ckpt |
En | Conformer | LARGE | Librilight | ckpt |
Unit size | Dataset | Unit config | Model |
---|---|---|---|
1000 | Voxpopuli En, Es unlabelled speech | mbart_large | ckpt |
- To prepare data for S2UT finetuning, follow the steps from Direct S2ST with Discrete Units and format the data in the S2UT format. Note that we use 1000 units from the eleventh layer (
--layer 11
) of the multilingual hubert model linked above instead - Run
var="id\taudio\tn_frames\ttgt_text\ttgt_n_frames"
sed -i "1s/.*/$var/" ${SPLIT}.tsv
Speech-to-unit translation (S2UT)
Here's an example for finetuning S2UT models with 1000 discrete units as target. You can download the sample config file and vocabulary for Es-En from here:
fairseq-train $DATA_ROOT \
--config-yaml config.yaml \
--task speech_to_text --arch xm_transformer\
--criterion l --label-smoothing 0.2 \
--share-decoder-input-output-embed --adaptor-n-layers 1 --normalize\
--dropout 0.1 --attention-dropout 0.1 --relu-dropout 0.1 \
--train-subset train --valid-subset dev \
--load-pretrained-decoder-from ${unit_mBART} --w2v-path ${wav2vec2.0} \
--mask-prob 0.3 --mask-channel-length 32 --mask-channel-prob 0.25\
--save-dir ${MODEL_DIR} --checkpoint-activations --encoder-proj \
--lr 0.0005 --dropout 0.1 --attention-dropout 0.1 --lr-scheduler inverse_sqrt\
--warmup-init-lr 1e-7 --warmup-updates 10000 \
--optimizer adam --adam-betas "(0.9,0.98)" --clip-norm 10.0 \
--max-update 20000 --max-tokens 4000 --max-tokens-valid 4000 --max-source-positions 4000 \
--max-target-positions 4000 --update-freq 120 \
--seed 1 --fp16 --num-workers 1
- Adjust
--update-freq
accordingly for different #GPUs. In the above we set--update-freq 15
to simulate training with 120 GPUs. - In the above setting we finetune the model end to end, corresponding to the full setup in the paper.
- To apply LNA-E partial finetuning, add
--finetune-w2v-params layer_norm,self_attn
- For LNA-D partial finetuning add
--finetune-decoder-params encoder_attn,layer_norm,self_attn
. To optionally freeze the encoder by k updates, use--freeze-finetune-updates ${K}
- For LNA-E,D partial finetuning add both the above options.
Unit-based HiFi-GAN vocoder
We apply the open-sourced unit-based HiFi-GAN vocoders to convert the predicted unit sequences to waveform. They are open sourced in Textless S2ST with Real Data
Speech-to-unit translation (S2UT)
- Follow the same inference process as in fairseq-S2T to generate unit sequences (
${RESULTS_PATH}/generate-${GEN_SUBSET}.txt
).
fairseq-generate $DATA_ROOT \
--config-yaml config.yaml \
--task speech_to_text \
--path $MODEL_DIR/checkpoint_best.pt --gen-subset $GEN_SUBSET \
--max-tokens 10000 --max-source-positions 10000 --max-target-positions 10000\
--beam 10 --max-len-a 1 --max-len-b 200 \
--results-path ${RESULTS_PATH}
- Convert unit sequences to waveform.
grep "^D\-" ${RESULTS_PATH}/generate-${GEN_SUBSET}.txt | \
sed 's/^D-//ig' | sort -nk1 | cut -f3 \
> ${RESULTS_PATH}/generate-${GEN_SUBSET}.unit
python examples/speech_to_speech/generate_waveform_from_code.py \
--in-code-file ${RESULTS_PATH}/generate-${GEN_SUBSET}.unit \
--vocoder $VOCODER_CKPT --vocoder-cfg $VOCODER_CFG \
--results-path ${RESULTS_PATH} --dur-prediction
To evaluate speech translation output, we first apply ASR on the speech output and then compute BLEU score betweent the ASR decoded text and the references using sacreBLEU.
- Text normalization: We use the text cleaner at https://github.com/keithito/tacotron for pre-processing reference English text for ASR BLEU evaluation. The text cleaner used for Spanish text normalization will be updated here shortly.
- En ASR: We use the "Wav2Vec 2.0 Large (LV-60) + Self Training / 960 hours / Libri-Light + Librispeech" En ASR model open-sourced by the wav2vec project. The model is also available on Hugging Face.
- Es ASR: We use the Wav2Vec2-Large-XLSR-53-Spanish finetuned on spanish Common Voice Es ASR model open-sourced by Jonatasgrosman(https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-spanish) on Hugging Face.
- See instructions on how to run inference with a wav2vec-based ASR model.
ID | En - Es | Es - En |
---|---|---|
S2UT systems without pre-training | ||
S2UT with multitask | checkpoint | checkpoint |
S2UT systems with model pre-training | ||
w2v2-L | checkpoint | checkpoint |
w2v2-L + mBART (LNA-E) | checkpoint | checkpoint |
w2v2-L + mBART (LNA-D) | checkpoint | checkpoint |
w2v2-L + mBART (LNA-E,D) | checkpoint | checkpoint |
S2UT systems with model pre-training and data augmentation | ||
w2v2-L + mBART (LNA-D) | checkpoint | checkpoint |
Note: Some of the tasks use speech_to_text_sharded task which is yet to be open sourced. So make sure to override the task to speech_to_text to use those models.