Name		Name	Last commit message	Last commit date
parent directory ..
train		train
.gitignore		.gitignore
README.md		README.md
eval.sh		eval.sh
interactive.sh		interactive.sh
prepare.sh		prepare.sh
train.sh		train.sh
vars.sh		vars.sh

README.md

Training a Machine Translation Model

This directory contains scripts for preparing data, training, and evaluating a bilingual neural machine translation model.

Getting Started

Training the model requires an NVIDIA GPU, CUDA, and NCCL. For cloud-based training, it is recommended to use a virtual machine with pre-installed packages for convenience, such as the Google Cloud Deep Learning VM or Amazon Deep Learning Containers.

Prerequisites

Install fairseq:

git clone https://github.com/pytorch/fairseq
cd fairseq
git checkout 920a548ca770fb1a951f7f4289b4d3a0c1bc226f
pip install --editable ./

Install SentencePiece:

git clone https://github.com/google/sentencepiece.git 
cd sentencepiece
git checkout d8f741853847553169444afc12c00f4bbff3e9ce
mkdir build
cd build
cmake ..
make -j $(nproc)
sudo make install
sudo ldconfig -v

Install sacrebleu for evaluation:
```
pip install sacrebleu
```

Installation

Clone the repository:

git clone https://github.com/josecols/seed-cat.git
cd seed-cat/nmt

Set up the train, valid, and test data directories.

For instance, if you're training an English-Spanish model, you can use the Seed and FLORES+ datasets. Download the eng_Latn corpus from Seed and the latest release of the FLORES+ dataset. For FLORES+, use the dev split for validation and the devtest split for testing.

Example directory structure:
```
nmt/
├── train/
│   ├── eng
│   ├── spa
├── valid/
│   ├── eng
│   ├── spa
├── test/
│   ├── eng
│   ├── spa
```

Usage

Before preparing the data, configure the language pair and specify the path to the sentencepiece installation in the vars.sh script. You can also set a wandb project name to track training jobs and metrics.

Data Preparation

The prepare.sh script trains a SentencePiece model using a combined vocabulary from the training data of both languages. This model is used to tokenize the text into sub-word units. Additionally, the script creates a fairseq dictionary and binarizes the data.

bash prepare.sh

Training

You can configure the model architecture and training parameters in the train.sh script. This script trains a transformer model using fairseq-train and selects the best model checkpoint based on the validation BLEU score.

bash train.sh

Evaluation

The eval.sh script generates translation hypotheses for the test set and evaluates them using sacrebleu. The default metric reported is chrF.

bash eval.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nmt

nmt

README.md

Training a Machine Translation Model

Getting Started

Prerequisites

Installation

Usage

Data Preparation

Training

Evaluation

Files

nmt

Directory actions

More options

Directory actions

More options

Latest commit

History

nmt

Folders and files

parent directory

README.md

Training a Machine Translation Model

Getting Started

Prerequisites

Installation

Usage

Data Preparation

Training

Evaluation