This is the repo for EMNLP2020 paper Pre-training Multilingual Neural Machine Translation by Leveraging Alignment Information.
[paper]
mRASP, representing multilingual Random Aligned Substitution Pre-training, is a pre-trained multilingual neural machine translation model. mRASP is pre-trained on large scale multilingual corpus containing 32 language pairs. The obtained model can be further fine-tuned on downstream language pairs. To effectively bring words and phrases with similar meaning closer in representation across multiple languages, we introduce Random Aligned Substitution (RAS) technique. Extensive experiments conducted on different scenarios demonstrate the efficacy of mRASP. For detailed information please refer to the paper.
.
├── experiments # Example files: including configs and data
├── preprocess # The preprocess step
│ ├── tools/
│ │ ├── __init__.py
│ │ ├── common.sh
│ │ ├── data_preprocess/ # clean + tokenize
│ │ │ ├── __init__.py
│ │ │ ├── clean_scripts/
│ │ │ ├── tokenize_scripts/
│ │ │ ├── clean_each.sh
│ │ │ ├── prep_each.sh
│ │ │ ├── prep_mono.sh # preprocess a monolingual corpus
│ │ │ ├── prep_parallel.sh # preprocess a parallel corpus
│ │ │ └── tokenize_each.sh
│ │ ├── misc/
│ │ │ ├── __init__.py
│ │ │ ├── multilingual_preprocess_yml_generator.py
│ │ │ └── multiprocess.sh
│ │ ├── ras/
│ │ │ ├── __init__.py
│ │ │ ├── random_alignment_substitution.sh
│ │ │ ├── random_alignment_substitution_w_multi.sh
│ │ │ ├── replace_word.py # RAS using MUSE bilingual dict
│ │ │ └── replace_word_w_multi.py # RAS using multi-way parallel dict
│ │ └── subword/
│ │ ├── __init__.py
│ │ ├── multilingual_apply_subword_vocab.sh # script to only apply subword (w/o learning new vocab)
│ │ ├── multilingual_learn_apply_subword_vocab_joint.sh # script to learn new vocab and apply subword
│ │ └── scripts/
│ ├── __init__.py
│ ├── multilingual_merge.sh # script to merge multiple parallel dataset
│ ├── multilingual_preprocess_main.sh # main entry for preprocess
│ └── README.md
├── train
│ ├── __init__.py
│ ├── misc/
│ │ ├── load_config.sh
│ │ └── monitor.sh # script to monitor the generation of checkpoint and evaluate them
│ ├── scripts/
│ │ ├── __init__.py
│ │ ├── average_checkpoints_from_file.py
│ │ ├── average_ckpt.sh # checkpoint average
│ │ ├── common_scripts.sh
│ │ ├── get_worst_ckpt.py
│ │ ├── keep_top_ckpt.py
│ │ ├── remove_bpe.py
│ │ └── rerank_utils.py
│ ├── pre-train.sh # main entry for pre-train
│ ├── fine-tune.sh # main entry for fine-tune
│ └── README.md
├── requirements.txt
└── README.md
pip install -r requirements.txt
The pipeline contains two steps: Pre-train and Fine-tune. We first pre-train our model on multiple language pairs jointly. Then we further fine-tune on downstream language pairs.
The preprocess pipeline is composed of the following 4 separate steps:
-
Data filtering and cleaning
-
Tokenization
-
Learn / Apply joint bpe sub-word vocabulary
-
Random Alignment Substitution (optional, only valid for train set)
We provide a script to run all the above steps in one command:
bash ${PROJECT_ROOT}/preprocess/multilingual_preprocess_main.sh ${config_yaml_file}
step1: preprocess train data and learn a joint BPE subword vocabulary across all languages.
bash ${PROJECT_ROOT}/preprocess/multilingual_preprocess_main.sh ${PROJECT_ROOT}/experiments/example/configs/preprocess/train.yml
The command above will do clean, subword, merge, ras, step by step. Now we have a BPE vocabulary and an RASed multilingual dataset merged from multiple language pairs.
step2: preprocess development data
bash ${PROJECT_ROOT}/preprocess/multilingual_preprocess_main.sh ${PROJECT_ROOT}/experiments/example/configs/preprocess/dev.yml
We create a multilingual development set to help choose the best pre-trained checkpoint.
step3: binarize data
bash ${PROJECT_ROOT}/experiments/example/bin_pretrain.sh
step4: pre-train on RASed multilingual corpus
export CUDA_VISIBLE_DEVICES=0,1,2,3 && bash ${PROJECT_ROOT}/train/pre-train.sh ${PROJECT_ROOT}/experiments/example/configs/train/pre-train/transformer_big.yml
You can modify the configs to choose the model architecture or dataset used.
step1: preprocess train/test data
bash ${PROJECT_ROOT}/preprocess/multilingual_preprocess_main.sh ${PROJECT_ROOT}/experiments/example/configs/preprocess/train_en2de.yml
bash ${PROJECT_ROOT}/preprocess/multilingual_preprocess_main.sh ${PROJECT_ROOT}/experiments/example/configs/preprocess/test_en2de.yml
The command above will do: clean and subword.
step2: binarize data
bash ${PROJECT_ROOT}/experiments/example/bin_finetune.sh
step3: fine-tune on specific language pairs
export CUDA_VISIBLE_DEVICES=0,1,2 && export EVAL_GPU_INDEX=${eval_gpu_index} && bash ${PROJECT_ROOT}/train/fine-tune.sh ${PROJECT_ROOT}/experiments/example/configs/train/fine-tune/en2de_transformer_big.yml ${PROJECT_ROOT}/experiments/example/configs/eval/en2de_eval.yml
eval_gpu_index
denotes the index of gpu on your machine that will be allocated to evaluate the model. if you set it to-1
, it means that cpu will be used for evaluating during training.
We merge 32 English-centric language pairs, resulting in 64 directed translation pairs in total. The original 32 language pairs corpus contains about 197M pairs of sentences. We get about 262M pairs of sentences after applying RAS, since we keep both the original sentences and the substituted sentences. We release both the original dataset and dataset after applying RAS.
Dataset | #Pair |
---|---|
32-lang-pairs-TRAIN | 197603294 |
32-lang-pairs-RAS-TRAIN | 262662792 |
32-lang-pairs-DEV | 156587 |
Vocab | - |
BPE Code | - |
We release checkpoints trained on 32-lang-pairs and 32-lang-pairs-RAS. We also extend our model to 58 language pairs.
Dataset | Checkpoint |
---|---|
32-lang-pairs | 32-lang-pairs-ckp |
32-lang-pairs-RAS | 32-lang-pairs-RAS-ckp |
58-lang-pairs-RAS | - |
We release En-Ro, En2De and En2Fr benchmark checkpoints and the corresponding configs.
Lang-Pair | Datasource | Checkpoints | Configs | tok-BLEU | detok-BLEU |
---|---|---|---|---|---|
En2Ro | WMT16 En-Ro | en2ro | en2ro_config | 39.0 | 37.6 |
Ro2En | WMT16 Ro-En | ro2en | ro2en_config | 37.7 | 36.9 |
En2De | WMT16 En-De | en2de | en2de_config | 30.3 | - |
En2Fr | WMT14 En-Fr | en2fr | en2fr_config | 44.3 | - |
mBART is a pre-trained model trained on large-scale multilingual corpora. To illustrate the superiority of mRASP, we also compare our results with mBART. We choose different scales of language pairs and use the same test sets as mBART.
Lang-pairs | Size | Direction | Datasource | Testset | Checkpoint | mBART | mRASP |
---|---|---|---|---|---|---|---|
En-Gu | 10K | ⟶ | en_gu_train | newstest19 | en2gu | 0.1 | 3.2 |
⟵ | en_gu_train | newstest19 | gu2en | 0.3 | 0.6 | ||
En-Kk | 128K | ⟶ | en_kk_train | newstest19 | en2kk | 2.5 | 8.2 |
⟵ | en_kk_train | newstest19 | kk2en | 7.4 | 12.3 | ||
En-Tr | 388K | ⟶ | en_tr_train | newstest17 | en2tr | 17.8 | 20.0 |
⟵ | en_tr_train | newstest17 | tr2en | 22.5 | 23.4 | ||
En-Et | 2.3M | ⟶ | en_et_train | newstest18 | en2et | 21.4 | 20.9 |
⟵ | en_et_train | newstest18 | et2en | 27.8 | 26.8 | ||
En-Fi | 4M | ⟶ | en_fi_train | newstest17 | en2fi | 22.4 | 24.0 |
⟵ | en_fi_train | newstest17 | fi2en | 28.5 | 28.0 | ||
En-Lv | 5.5M | ⟶ | en_lv_train | newstest17 | en2lv | 15.9 | 21.6 |
⟵ | en_lv_train | newstest17 | lv2en | 19.3 | 24.4 | ||
En-Cs | 978K | ⟶ | en_cs_train | newstest19 | en2cs | 18.0 | 19.9 |
En-De | 4.5M | ⟶ | en_de_train | newstest19 | en2de | 30.5 | 35.2 |
En-Fr | 40M | ⟶ | en_fr_train | newstest14 | en2fr | 41.0 | 44.3 |
If you are interested in mRASP, please consider citing our paper:
@inproceedings{lin-etal-2020-pre,
title = "Pre-training Multilingual Neural Machine Translation by Leveraging Alignment Information",
author = "Lin, Zehui and
Pan, Xiao and
Wang, Mingxuan and
Qiu, Xipeng and
Feng, Jiangtao and
Zhou, Hao and
Li, Lei",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.emnlp-main.210",
pages = "2649--2663",
}