s3prl/s3prl/downstream/speech_translation at master · Liguinan/s3prl

History

Name		Name	Last commit message	Last commit date
parent directory ..
prepare_data		prepare_data
AdditionalDataset.py		AdditionalDataset.py
README.md		README.md
S3prl_SpeechToTextTask.py		S3prl_SpeechToTextTask.py
__init__.py		__init__.py
config.yaml		config.yaml
count_sacreBLEU.py		count_sacreBLEU.py
expert.py		expert.py

README.md

ST: Speech Translation

specified by the command -d speech_translation

Prepare data

Following is the example to prepare COVOST2 en-de dataset

Download Common Voice audio clips and transcripts (english) (Common Voice Corpus 4).
Change the path in prepare_data/prepare_covo.sh (you can also change the src_lang and tgt_lang to prepare data of other language pairs)

covo_root="root directory of covost"
src_lang=en
tgt_lang=de

Run the following script

cd downstream/speech_translation/prepare_data/
bash prepare_covo.sh

Check the prepared file structure

processed data

s3prl/
└── data/
    └──covost_en_de/
        ├── train.tsv
        ├── dev.tsv
        ├── test.tsv
        ├── spm-[src|tgt]_text.[model|vocab|text]
        ├── config.yaml
        └── prepare_data.log

origin tsv files

s3prl/
└── downstream/
    └──speech_translation/
        └──prepare_data/
            └──covost_tsv/
                └──covost_v2.<src_lang>_<tgt_lang>.[train|dev|test].tsv

[Optional] If you use other dataset or change the language/data config in prepare_data/prepare_covo.sh, you will also need to change the language/data config in config.yaml.

downstream_expert:
    src_lang: "other source language"
    tgt_lang: "other target language"
    taskrc:
        data: "other data directory"

Details about preprocessing
- For the text data, we do the following preprocessing
  - transcript: lowercasing, removing punctuations execpt apostrophe and hyphen
  - translation: normalizing punctuation
  - the normalization is done with alvations/sacremoses
- We also remove the noise examples by length and ratio of transcript and translation, also all the examples contains "REMOVE".
- For the tokenization, we create char dictionary with google/sentencepiece for transcript and translation seperately.
- For more details, you can check the files under prepare_data/.

Training

python3 run_downstream.py -n ExpName -m train -u fbank -d speech_translation

For downstream model architecture, we delegate the configuration and creation to pytorch/fairseq. You could adjust the model archtecture directly at downstream_expert/modelrc in config.yaml. (For more configurations, please refer to pytorch/fairseq)

downstream_expert:
    modelrc:
        # you could set the model architecture here
        arch: s2t_transformer
        
        # set other model configurations here
        max_source_positions: 6000
        max_target_positions: 1024
        encoder_layers: 3
        decoder_layers: 3

we will truncate the wav to the maximum input size, which is max_source_positions*upstream_rate.
We also support multitask learning with ASR. You could set downstream_expert/taskrc/use_asr=True in config.yaml to enable it. (Make sure you have transcripts in the training tsv file.)

downstream_expert:
    taskrc:
        use_asr: True
    asrrc:
        weight: 0.3 # the weight of ASR loss in [0, 1]
        datarc:
            key: src_text # header of transcript in tsv file

You could downsample the upstream feature to certain upstream rate by setting downstream_expert/upstream_rate in config.yaml with different downstream_expert/downsample_method.

downstream_expert:
    upstream_rate: -1 # -1 for no downsample, 320 for applying downsampling
    downsample_method: 'drop' # 'drop'/'concat'/'average'

For other training configurations, including batch size, learning rate, etc, you can also change them in config.yaml.

Testing

python3 run_downstream.py -m evaluate -t test -e result/downstream/ExpName/dev-best.ckpt

You could change the beam size and maximum decoding length in config.yaml.

downstream_expert:
    generatorrc:
        beam: 20
        max_len_a: 0
        max_len_b: 400

We report case-sensitive detokenized BLEU for ST using mjpost/sacrebleu and CER/WER for ASR using roy-ht/editdistance when using multitask learning. The decoding results will be written into file <output_prefix>-[st|asr]-[dev|test].tsv. You could change the prefix of the output files in config.yaml.

downstream_expert:
    output_prefix: output # set prefix of output files

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speech_translation

speech_translation

README.md

ST: Speech Translation

Prepare data

Training

Testing

Files

speech_translation

Directory actions

More options

Directory actions

More options

Latest commit

History

speech_translation

Folders and files

parent directory

README.md

ST: Speech Translation

Prepare data

Training

Testing