IWSLT 2022 Evaluation Campaign: Simultaneous Translation Baseline (Engilsh-to-Japanese Speech-to-Text)
- IWSLT 2022 Evaluation Campaign: Simultaneous Translation Baseline (Engilsh-to-Japanese Speech-to-Text)
- Linux-based system
- The scripts were tested with Ubuntu 18.04 LTS but would work on 20.04 LTS
- Bash
- Python >= 3.7.0
- (CUDA; not mandatory but highly recommended)
- PyTorch (the following command installs 1.10.1 working with CUDA 11.3)
$ pip3 install torch==1.10.1+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
$ git clone --recursive https://github.com/ksudoh/IWSLT2022_simul_s2t_baseline_enja.git
$ cd IWSLT2022_simul_s2t_baseline_enja
$ pip3 install -r requirements.txt
$ pushd fairseq
$ python3 setup.py build_ext --inplace
$ popd
$ pushd fairseq-mma-il/fairseq
$ python3 setup.py build_ext --inplace
$ popd
$ pushd SimulEval
$ python3 setup.py install --prefix ./
$ popd
- Download MuST-C v2.0 and extract the package
- Suppose you put the extracted directory
en-ja
in/path/to/MuST-C/
.
- Suppose you put the extracted directory
- The baseline system scripts use the following environment variables.
WORKDIR
specifies the directory to be used to store the data and models.- You may change the setting of
TMPDIR
if you would like to use the temporary space other than/tmp
. The scripts
$ export SRC=en
$ export TRG=ja
$ export MUSTC_ROOT=/path/to/MuST-C
$ export WORKDIR=/path/to/work
- The wrapper script
10-preprocess.sh
conducts the required preprocessing
$ bash ./10-preprocess.sh
- The wrapper script
11-prepare-eval-data.sh
prepares the test data
$ bash ./11-prepare-eval-data.sh
- Set a variable
CUDA_VISIBLE_DEVICES
- You may use multiple GPUs, but the batch size becomes larger accordingly.
$ env CUDA_VISIBLE_DEVICES=0 bash ./20-train-pretraining.sh
- Set variables
K
andCUDA_VISIBLE_DEVICES
. - You may use multiple GPUs, but the batch size becomes larger accordingly.
$ env K=20 CUDA_VISIBLE_DEVICES=0 bash ./30-train-st-wait-k.sh
SimulEval sometimes fails to establish the connection between the server and the client, so please terminate the process and re-run in such a case.
$ env K=20 CUDA_VISIBLE_DEVICES=0 bash ./31-test-wait-k.sh
- Set variable
CUDA_VISIBLE_DEVICES
. - You may use multiple GPUs, but the batch size becomes larger accordingly.
$ env CUDA_VISIBLE_DEVICES=0 bash ./40-train-st-mma-il.sh