Skip to content

Latest commit

 

History

History

nmt

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Training a Machine Translation Model

This directory contains scripts for preparing data, training, and evaluating a bilingual neural machine translation model.

Getting Started

Training the model requires an NVIDIA GPU, CUDA, and NCCL. For cloud-based training, it is recommended to use a virtual machine with pre-installed packages for convenience, such as the Google Cloud Deep Learning VM or Amazon Deep Learning Containers.

Prerequisites

  1. Install fairseq:

    git clone https://github.com/pytorch/fairseq
    cd fairseq
    git checkout 920a548ca770fb1a951f7f4289b4d3a0c1bc226f
    pip install --editable ./
  2. Install SentencePiece:

    git clone https://github.com/google/sentencepiece.git 
    cd sentencepiece
    git checkout d8f741853847553169444afc12c00f4bbff3e9ce
    mkdir build
    cd build
    cmake ..
    make -j $(nproc)
    sudo make install
    sudo ldconfig -v
  3. Install sacrebleu for evaluation:

    pip install sacrebleu

Installation

  1. Clone the repository:

    git clone https://github.com/josecols/seed-cat.git
    cd seed-cat/nmt
  2. Set up the train, valid, and test data directories.

    For instance, if you're training an English-Spanish model, you can use the Seed and FLORES+ datasets. Download the eng_Latn corpus from Seed and the latest release of the FLORES+ dataset. For FLORES+, use the dev split for validation and the devtest split for testing.

    Example directory structure:

    nmt/
    ├── train/
    │   ├── eng
    │   ├── spa
    ├── valid/
    │   ├── eng
    │   ├── spa
    ├── test/
    │   ├── eng
    │   ├── spa

Usage

Before preparing the data, configure the language pair and specify the path to the sentencepiece installation in the vars.sh script. You can also set a wandb project name to track training jobs and metrics.

Data Preparation

The prepare.sh script trains a SentencePiece model using a combined vocabulary from the training data of both languages. This model is used to tokenize the text into sub-word units. Additionally, the script creates a fairseq dictionary and binarizes the data.

bash prepare.sh

Training

You can configure the model architecture and training parameters in the train.sh script. This script trains a transformer model using fairseq-train and selects the best model checkpoint based on the validation BLEU score.

bash train.sh

Evaluation

The eval.sh script generates translation hypotheses for the test set and evaluates them using sacrebleu. The default metric reported is chrF.

bash eval.sh