IWSLT 2022 Evaluation Campaign: Simultaneous Translation Baseline (Engilsh-to-Japanese Text-to-Text)
- IWSLT 2022 Evaluation Campaign: Simultaneous Translation Baseline (Engilsh-to-Japanese Text-to-Text)
- Linux-based system
- The scripts were tested with Ubuntu 18.04 LTS but would work on 20.04 LTS
- Bash
- Python >= 3.7.0
- (CUDA; not mandatory but highly recommended)
- PyTorch (the following command installs 1.10.1 working with CUDA 11.3)
$ pip3 install torch==1.10.1+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
$ git clone --recursive https://github.com/ksudoh/IWSLT2022_simul_t2t_baseline_enja.git
$ cd IWSLT2022_simul_t2t_baseline_enja
$ pip3 install -r requirements.txt
$ pushd fairseq-wait-k/fairseq
$ python3 setup.py build_ext --inplace
$ popd
$ pushd fairseq-mma-il/fairseq
$ python3 setup.py build_ext --inplace
$ popd
$ pushd SimulEval
$ python3 setup.py install --prefix ./
$ popd
- Download MuST-C v2.0 and extract the package
- Suppose you put the extracted directory
en-ja
in/path/to/MuST-C/
.
- Suppose you put the extracted directory
- (If needed) Download WMT21 En-Ja data and extract the packages
- Suppose you put the extracted files/directories in
/path/to/WMT-train/
.- JParaCrawl
- News Commentary v16
- WikiTitles v3
- WikiMatrix v1
- JESC
- KFTT
- Data preparation helper modules for the datasets above are included in this repository
- Suppose you put the extracted files/directories in
- The baseline system scripts use the following environment variables.
WORKDIR
specifies the directory to be used to store the data and models.- You may change the setting of
TMPDIR
if you would like to use the temporary space other than/tmp
. The scripts
$ export SRC=en
$ export TRG=ja
$ export MUSTC_ROOT=/path/to/MuST-C
$ export WMT_DATA_ROOT=/path/to/WMT-train
$ export WORKDIR=/path/to/work
- The wrapper script
10-preprocess.sh
conducts the following preprocessings:- Extract bilingual sentences from the datasets
- Train a unigram subword model using SentencePiece (shared across languages)
- Tokenize the bilingual sentences
- Binarize the training and development datasets using
fairseq-preprocess
$ bash ./10-preprocess.sh
- Set variables
K
andCUDA_VISIBLE_DEVICES
. - You may use multiple GPUs, but the batch size becomes larger accordingly.
$ env K=20 CUDA_VISIBLE_DEVICES=0 bash ./20-train-wait-k.sh
$ env K=20 CUDA_VISIBLE_DEVICES=0 bash ./21-test-wait-k.sh
- Set variable
CUDA_VISIBLE_DEVICES
. - You may use multiple GPUs, but the batch size becomes larger accordingly.
$ env CUDA_VISIBLE_DEVICES=0 bash ./30-train-mma-il.sh
$ env CUDA_VISIBLE_DEVICES=0 bash ./31-test-mma-il.sh