specified by the command -d speech_translation
Following is the example to prepare COVOST2 en-de dataset
-
Download Common Voice audio clips and transcripts (english) (Common Voice Corpus 4).
-
Change the path in
prepare_data/prepare_covo.sh
(you can also change thesrc_lang
andtgt_lang
to prepare data of other language pairs)
covo_root="root directory of covost"
src_lang=en
tgt_lang=de
- Run the following script
cd downstream/speech_translation/prepare_data/
bash prepare_covo.sh
- Check the prepared file structure
- processed data
s3prl/
└── data/
└──covost_en_de/
├── train.tsv
├── dev.tsv
├── test.tsv
├── spm-[src|tgt]_text.[model|vocab|text]
├── config.yaml
└── prepare_data.log
- origin tsv files
s3prl/
└── downstream/
└──speech_translation/
└──prepare_data/
└──covost_tsv/
└──covost_v2.<src_lang>_<tgt_lang>.[train|dev|test].tsv
- [Optional] If you use other dataset or change the language/data config in
prepare_data/prepare_covo.sh
, you will also need to change the language/data config inconfig.yaml
.
downstream_expert:
src_lang: "other source language"
tgt_lang: "other target language"
taskrc:
data: "other data directory"
-
Details about preprocessing
-
For the text data, we do the following preprocessing
- transcript: lowercasing, removing punctuations execpt apostrophe and hyphen
- translation: normalizing punctuation
- the normalization is done with alvations/sacremoses
-
We also remove the noise examples by length and ratio of transcript and translation, also all the examples contains "REMOVE".
-
For the tokenization, we create char dictionary with google/sentencepiece for transcript and translation seperately.
-
For more details, you can check the files under
prepare_data/
.
-
python3 run_downstream.py -n ExpName -m train -u fbank -d speech_translation
- For downstream model architecture, we delegate the configuration and creation to pytorch/fairseq. You could adjust the model archtecture directly at
downstream_expert/modelrc
inconfig.yaml
. (For more configurations, please refer to pytorch/fairseq)
downstream_expert:
modelrc:
# you could set the model architecture here
arch: s2t_transformer
# set other model configurations here
max_source_positions: 6000
max_target_positions: 1024
encoder_layers: 3
decoder_layers: 3
-
we will truncate the wav to the maximum input size, which is
max_source_positions
*upstream_rate
. -
We also support multitask learning with ASR. You could set
downstream_expert/taskrc/use_asr=True
inconfig.yaml
to enable it. (Make sure you have transcripts in the training tsv file.)
downstream_expert:
taskrc:
use_asr: True
asrrc:
weight: 0.3 # the weight of ASR loss in [0, 1]
datarc:
key: src_text # header of transcript in tsv file
- You could downsample the upstream feature to certain upstream rate by setting
downstream_expert/upstream_rate
inconfig.yaml
with differentdownstream_expert/downsample_method
.
downstream_expert:
upstream_rate: -1 # -1 for no downsample, 320 for applying downsampling
downsample_method: 'drop' # 'drop'/'concat'/'average'
- For other training configurations, including batch size, learning rate, etc, you can also change them in
config.yaml
.
python3 run_downstream.py -m evaluate -t test -e result/downstream/ExpName/dev-best.ckpt
You could change the beam size and maximum decoding length in config.yaml
.
downstream_expert:
generatorrc:
beam: 20
max_len_a: 0
max_len_b: 400
We report case-sensitive detokenized BLEU for ST using mjpost/sacrebleu and CER/WER for ASR using roy-ht/editdistance when using multitask learning.
The decoding results will be written into file <output_prefix>-[st|asr]-[dev|test].tsv
. You could change the prefix of the output files in config.yaml
.
downstream_expert:
output_prefix: output # set prefix of output files