layout | title | permalink | icon |
---|---|---|---|
docs |
MT Marathon 2019 |
/examples/mtm2019 |
fa-cogs |
In this tutorial we will learn how to do efficient neural machine translation using the Marian toolkit by optimizing the speed, accuracy and use of resources for training and decoding of NMT models.
No background knowledge about Marian is required, but if you are completely new to neural machine translation and Marian, you may take a look at at the introductory tutorial to Marian first. We assume that the reader is familiar with basic Linux commands.
The tutorial requires Marian in version v1.7.12+, which is currently only available in the marian-dev repository. However, most parts are also applicable for older versions of Marian.
More information about Marian:
Marian can be compiled on machines with NVIDIA GPU devices and CUDA 8.0+ or on CPU-only machines. For this tutorial we need Marian compiled with support for CPU and SentencePiece. This requires installing Intel MKL and Protocol Buffers first.
To download the repository and compile Marian, run the following commands:
git clone https://github.com/marian-nmt/marian-dev
cd marian-dev
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DUSE_SENTENCEPIECE=on -DCOMPILE_CPU=on
make -j
cd ../..
In this tutorial we assume that Marian is compiled in your home directory. If everything worked correctly you can display the list of options with:
~/marian-dev/build/marian --help |& less
First we need to download models and data (611MB) that we will use for the tutorial:
wget http://data.statmt.org/romang/marian-examples/marian-tutorial-mtm19.tgz
tar zxvf marian-tutorial-mtm19.tgz
cd marian-tutorial-mtm19
This will give you the following files:
marian-tutorial-mtm19
├ 1_decoding
│ ├ data
│ │ ├ newstest2013.de
│ │ ├ newstest2013.en
│ │ ├ newstest2014.de
│ │ └ newstest2014.en
│ ├ model.npz
│ ├ run-me.sh
│ └ vocab.spm
├ 2_training
│ ├ config.yml
│ ├ download-data.sh
│ ├ run-me.sh
│ └ train.log
├ 3_student
│ ├ data
│ │ ├ newstest2014.bpe.en
│ │ ├ newstest2014.de
│ ├ lex.s2t
│ ├ model.student.base
│ │ └ model.npz
│ ├ model.student.small
│ │ └ model.npz
│ ├ model.student.small.aan.nogate.noffn
│ │ └ model.npz
│ ├ run-me.sh
│ └ vocab.ende.yml
├ detruecase.perl
└ multi-bleu.perl
For this part of the tutorial we will use an RNN model trained on the WMT'14
English-German corpus provided by the Stanford NLP
Group pre-processed with a true-caser
and SentencePiece-based subword segmentation. The model is in the
marian-tutorial-mtm19/1_decoding
folder.
Models trained with Marian can be decoded using the marian-decoder
command.
The basic usage requires specifying paths to the model and vocabularies:
~/marian-dev/build/marian-decoder -m model.npz -v vocab.spm vocab.spm < data/newstest2014.en > output.de
where newstest2014.en
is an input file with pre-processed source sentences
and output.de
is an output file with translations.
Now we want to calculate the speed and quality of the translation. Marian already prints the total time required to translate the input file (exclusive of loading times of models and vocabularies) at the end of the decoding, for example:
[2019-08-25 13:17:33] Total time: 119.24629s wall
To estimate the quality, we need to run a BLEU scorer:
cat output.de | ../multi-bleu.perl data/newstest2014.de
This should display:
BLEU = 23.90, 56.0/29.6/17.7/11.1 (BP=1.000, ratio=1.003, hyp_len=59469, ref_len=59297)
Batched decoding parallelizes translation of multiple sentences. It generates translation for whole mini-batches and significantly increases translation speed, roughly by a factor of 10 or more.
In Marian there are a couple of options that are important for batched translation:
--mini-batch
enables translation with a mini-batch size of N, i.e. translating N sentences at once. allows for better packing of the batch.--mini-batch-words
allows to specify the size of a mini-batch in terms of a number of words, not a number of sentences.--maxi-batch
preloads M mini-batches, i.e. M x N sentences.--maxi-batch-sort
sorts them according to source sentence length, this
An important option is also --workspace
or -w
, which set the working
memory. The default working memory is 512 MB and Marian will increase it to
match to requirements during translation, but pre-allocating memory makes it
usually a bit faster.
Last but not least, you can also parallelize translation by running it on
multiple GPUs using the --devices
option.
System | Speed (sec.) |
---|---|
Line-by-line translation | |
Adding mini-batch of 32 | |
Switching to mini-batch of 64 | |
Increasing workspace to 4GB | |
Adding maxi-batch of 1000 | |
Adding mini-batch-words of 2000 | |
{: .table .table-bordered .table-striped } |
There are a few options that impact the translation speed and quality, and beam size is one of the most important. It determines a number of translation hypotheses considered for each input word. Exploring too few hypotheses per word may lead to a globally suboptimal translation, while using too large beam size may increase resource usage and computation time without improving the translation quality that much.
Options that can be easily tuned include:
-
--beam-size
determines a number of translation hypotheses explored at each step of the beam search algorithm. Common beam sizes are 8 to 12, depending on the model. -
--max-length-factor
sets maximum target length as source length times factor. The default value is 3. -
--normalize
divides translation score by the translation length to the power of$$\alpha$$ .
Explore different configurations of the beam size and maximum length factor and try to improve the translation time, keeping BLEU at a good level. Next grid search the value of the length normalization parameter on newstest2013 and check if the improvement of BLEU transfers into newstest2014.de.
What are the reasonable values for the beam size? What is a danger of using too small value for the maximum length factor?
This can be done with a simple bash script:
for i in 1 2 4 8 12 24; do
~/marian-dev/build/marian-decoder -b $i [OTHER OPTIONS] < data/newstest2013.en > output.b$i.de
done
A reasonable values to consider for --max-length-factor
are in a range of 1
to 3. The normalization parameter
Generation of back-translations often require translation of a large amount of monolingual data (sometimes even hundreds of millions of sentences), so optimizing this process can save you quite a lot of time.
- If your dataset is large, consider splitting it into smaller chunks (see GNU split) and translate the chunks sequentially. This will not make your translation faster, but more stable.
- Use batched translation and adjust your decoding parameters too speed up the translation.
- Because monolingual data, especially if it come from web crawling, can
contain very long sentences, use smaller
--max-length
and turn on--max-length-crop
. - Use sampled back-translations with
--output-sampling
.
To test sampled translations, run several times:
head data/newstest2014.en | ~/marian-dev/build/marian-decoder -m model.npz -v vocab.spm vocab.spm \
--output-sampling
An efficient training of an NMT model might mean minimizing the training time or improving the convergence. This is also choosing a model architecture that is best suited for yours needs.
Transformer models often offer best quality, but are also more difficult to train than RNNs, and need more careful setting of training parameters. A good starting point for finding a good set of training hyper-parameters is our repository with Marian examples.
Since version 1.7.12 Marian provides the --task
option, which provides
pre-defined options for transformer-base, and transformer-big:
~/marian-dev/build/marian --task transformer-base --dump-config expand
You should keep in mind that those settings are just a starting point and they should be adjusted for your scenario. For example, a low-resource scenario will probably require stronger regularization, etc.
Training of some model architectures like Transformer may benefit from large mini-batches.
--mini-batch-fit
overrides the specified mini-batch size and automatically
chooses the largest mini-batch for a given sentence length that fits the
specified memory. When --mini-batch-fit
is set, memory requirements are
guaranteed to fit into the specified workspace. Choosing a too small workspace
will result in small mini-batches which can prohibit learning.
In multi-GPU setting, a synchronous training (--sync-sgd
) can also serve as a
method to increase the effective size of mini-batches.
The option --workspace
sets the size of the memory available for the forward
and backward step of the training procedure. This does not include model size
and optimizer parameters that are allocated outsize workspace. Hence you cannot
allocate all GPU memory to workspace.
It is a bit tricky to determine the effective size of a mini-batch with dynamically sized mini-batches. An alternative is to operate in terms of the total workspace size that is used to generate mini-batches for one update, which can be calculated with the following formula:
Cumulative or delayed gradient
updates increase the effective
batch size, or the amount of data used at each step of training. It can be
enabled with --optimizer-delay N
.
Download the training data:
cd ../2_training
bash download-data.sh
Start with the following command and set --workspace
, --devices
and
--optimizer-delay
options:
~/marian-dev/build/marian --task transformer-base -c config.yml
The optimal values for these options will differ depending on the number of GPUs and available memory per single GPU.
Knowledge distillation approaches, like the teacher-student method, are used for training a smaller student model to perform better by learning from a larger teacher model.
In NMT it might mean training a strong (and slow) ensemble of Transformer models as a teacher, then translating an entire source side of the training data with the teacher, and finally training a small (and fast) model on original source and translated target data. The student model has usually worse BLEU, but is much faster. A useful side effect of the teacher-student learning is an improved performance of the student with smaller beam sizes.
More details can be found in our paper on cost-effective and high-quality NMT with Marian.
The teacher-student method enables efficient translation on CPU. A couple of things can speed it up more:
- Batched translation as presented in the first part of the tutorial.
- A lexical shortlist restricts the output vocabulary to a small subset of translation candidates for each words.
- Auto-tuning of matrix product implementation with
--optimize
.
The --cpu-threads
option turns on the decoding on CPU.
Experiment with different settings for decoding on a single CPU. Check different student models, batched translation, lexical shortlists and auto-tuning.
How does a student model perform with small beam sizes? What is the speed improvement from a lexical shortlist on CPU and GPU? What are architectures of student models and how large are they? What is the issue with decoding on multiple CPU threads with model files in .npz format?
The models can be found in the ./3_student
folder. They have been trained on
true-cased and BPE-segmented data, so require pre- and post-processing:
cd ../3_student
cat data/newstest2014.bpe.en \
| ~/marian-dev/build/marian-decoder -m model.student.small/model.npz -v vocab.ende.yml vocab.ende.yml \
-d 0 --cpu-threads 1 \
--mini-batch 64 --maxi-batch 100 \
-b 1 --max-length-factor 1.5 -n 0.6 \
| perl -pe 's/@@ //g' \
| ../detruecase.perl \
| ../multi-bleu.perl data/newstest2014.de
The shortlist file is 3_student/data/lex.s2t
. Model-specific parameters can
be read from a model.npz
file using the script
marian-dev/scripts/contrib/model_info.py
.
That's all for now. Hope you have found something useful here!