IWSLT 2022 Evaluation Campaign: Simultaneous Translation Baseline (Engilsh-to-Japanese Text-to-Text)

Requirements

Linux-based system
- The scripts were tested with Ubuntu 18.04 LTS but would work on 20.04 LTS
Bash
Python >= 3.7.0
(CUDA; not mandatory but highly recommended)

PyTorch (the following command installs 1.10.1 working with CUDA 11.3)

$ pip3 install torch==1.10.1+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

Setup

Clone a repository and install required packages

$ git clone --recursive https://github.com/ksudoh/IWSLT2022_simul_t2t_baseline_enja.git
$ cd IWSLT2022_simul_t2t_baseline_enja
$ pip3 install -r requirements.txt

Setup fairseq for Wait-k (if needed)

$ pushd fairseq-wait-k/fairseq
$ python3 setup.py build_ext --inplace
$ popd

(Not avaiable now) Setup fairseq for MMA-IL (if needed)

$ pushd fairseq-mma-il/fairseq
$ python3 setup.py build_ext --inplace
$ popd

Setup SimulEval

$ pushd SimulEval
$ python3 setup.py install --prefix ./
$ popd

Data preparation

Download MuST-C v2.0 and extract the package
- Suppose you put the extracted directory en-ja in /path/to/MuST-C/.
(If needed) Download WMT21 En-Ja data and extract the packages
- Suppose you put the extracted files/directories in /path/to/WMT-train/.
  - JParaCrawl
  - News Commentary v16
  - WikiTitles v3
  - WikiMatrix v1
  - JESC
  - KFTT
- Data preparation helper modules for the datasets above are included in this repository

Setting Environment Variables

The baseline system scripts use the following environment variables.
- WORKDIR specifies the directory to be used to store the data and models.
- You may change the setting of TMPDIR if you would like to use the temporary space other than /tmp. The scripts

$ export SRC=en
$ export TRG=ja
$ export MUSTC_ROOT=/path/to/MuST-C
$ export WMT_DATA_ROOT=/path/to/WMT-train
$ export WORKDIR=/path/to/work

Preprocessing

The wrapper script 10-preprocess.sh conducts the following preprocessings:
- Extract bilingual sentences from the datasets
- Train a unigram subword model using SentencePiece (shared across languages)
- Tokenize the bilingual sentences
- Binarize the training and development datasets using fairseq-preprocess

$ bash ./10-preprocess.sh

Wait-K model

Set variables K and CUDA_VISIBLE_DEVICES.
You may use multiple GPUs, but the batch size becomes larger accordingly.

Training

$ env K=20 CUDA_VISIBLE_DEVICES=0 bash ./20-train-wait-k.sh

Test

$ env K=20 CUDA_VISIBLE_DEVICES=0 bash ./21-test-wait-k.sh

(Not avaiable now) MMA-IL (Monotonic Multihead Attention with Infinite Lookback) model

Set variable CUDA_VISIBLE_DEVICES.
You may use multiple GPUs, but the batch size becomes larger accordingly.

Training

$ env CUDA_VISIBLE_DEVICES=0 bash ./30-train-mma-il.sh

Test

$ env CUDA_VISIBLE_DEVICES=0 bash ./31-test-mma-il.sh

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
SimulEval @ 2db1a59		SimulEval @ 2db1a59
fairseq-mma-il		fairseq-mma-il
fairseq-waitk		fairseq-waitk
scripts		scripts
utils		utils
.gitmodules		.gitmodules
10-preprocess.sh		10-preprocess.sh
20-train-wait-k.sh		20-train-wait-k.sh
21-test-wait-k.sh		21-test-wait-k.sh
30-train-mma-il.sh		30-train-mma-il.sh
31-test-mma-il.sh		31-test-mma-il.sh
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IWSLT 2022 Evaluation Campaign: Simultaneous Translation Baseline (Engilsh-to-Japanese Text-to-Text)

Table of Contents

Requirements

Setup

Clone a repository and install required packages

Setup fairseq for Wait-k (if needed)

(Not avaiable now) Setup fairseq for MMA-IL (if needed)

Setup SimulEval

Data preparation

Setting Environment Variables

Preprocessing

Wait-K model

Training

Test

(Not avaiable now) MMA-IL (Monotonic Multihead Attention with Infinite Lookback) model

Training

Test

About

Releases

Packages

Languages

License

ksudoh/IWSLT2022_simul_t2t_baseline_enja

Folders and files

Latest commit

History

Repository files navigation

IWSLT 2022 Evaluation Campaign: Simultaneous Translation Baseline (Engilsh-to-Japanese Text-to-Text)

Table of Contents

Requirements

Setup

Clone a repository and install required packages

Setup fairseq for Wait-k (if needed)

(Not avaiable now) Setup fairseq for MMA-IL (if needed)

Setup SimulEval

Data preparation

Setting Environment Variables

Preprocessing

Wait-K model

Training

Test

(Not avaiable now) MMA-IL (Monotonic Multihead Attention with Infinite Lookback) model

Training

Test

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages