This directory contains scripts for preparing data, training, and evaluating a bilingual neural machine translation model.
Training the model requires an NVIDIA GPU, CUDA, and NCCL. For cloud-based training, it is recommended to use a virtual machine with pre-installed packages for convenience, such as the Google Cloud Deep Learning VM or Amazon Deep Learning Containers.
-
Install
fairseq
:git clone https://github.com/pytorch/fairseq cd fairseq git checkout 920a548ca770fb1a951f7f4289b4d3a0c1bc226f pip install --editable ./
-
Install
SentencePiece
:git clone https://github.com/google/sentencepiece.git cd sentencepiece git checkout d8f741853847553169444afc12c00f4bbff3e9ce mkdir build cd build cmake .. make -j $(nproc) sudo make install sudo ldconfig -v
-
Install
sacrebleu
for evaluation:pip install sacrebleu
-
Clone the repository:
git clone https://github.com/josecols/seed-cat.git cd seed-cat/nmt
-
Set up the
train
,valid
, andtest
data directories.For instance, if you're training an English-Spanish model, you can use the Seed and
FLORES+
datasets. Download theeng_Latn
corpus from Seed and the latest release of the FLORES+ dataset. For FLORES+, use thedev
split for validation and thedevtest
split for testing.Example directory structure:
nmt/ ├── train/ │ ├── eng │ ├── spa ├── valid/ │ ├── eng │ ├── spa ├── test/ │ ├── eng │ ├── spa
Before preparing the data, configure the language pair and specify the path to the sentencepiece
installation in the vars.sh
script. You can also set a wandb
project name to track training jobs and metrics.
The prepare.sh
script trains a SentencePiece model using a combined vocabulary from the training data of both languages. This model is used to tokenize the text into sub-word units. Additionally, the script creates a fairseq
dictionary and binarizes the data.
bash prepare.sh
You can configure the model architecture and training parameters in the train.sh
script. This script trains a transformer model using fairseq-train
and selects the best model checkpoint based on the validation BLEU score.
bash train.sh
The eval.sh
script generates translation hypotheses for the test
set and evaluates them using sacrebleu
. The default metric reported is chrF
.
bash eval.sh