From fe35a486198a8b44c2fa288e8bc343e827b390e6 Mon Sep 17 00:00:00 2001 From: Roman Grundkiewicz Date: Thu, 10 Feb 2022 17:17:54 +0000 Subject: [PATCH 1/7] Add new sections in docs/ --- docs/index.md | 104 +++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 82 insertions(+), 22 deletions(-) diff --git a/docs/index.md b/docs/index.md index efd410a5d..9720dc77d 100644 --- a/docs/index.md +++ b/docs/index.md @@ -40,28 +40,6 @@ for [previous releases]({% link docs/cmd/index.md %}). -### Model types - -- `s2s`: An RNN-based encoder-decoder model with attention mechanism. The - architecture is equivalent to the - [DL4MT](https://github.com/nyu-dl/dl4mt-tutorial) or - [Nematus](https://github.com/EdinburghNLP/nematus) models ([Senrich et al., - 2017](https://arxiv.org/abs/1703.04357)). -- `transformer`: A model originally proposed by Google [(Vaswani et al., - 2017)](https://arxiv.org/abs/1706.03762) based solely on attention mechanisms. -- `multi-s2s`: As `s2s`, but uses two or more encoders allowing multi-source - neural machine translation. -- `multi-transformer`: As `transformer`, but uses multiple encoders. -- `amun`: A model equivalent to Nematus models unless layer normalization is - used. Can be decoded with Amun as _nematus_ model type. -- `nematus`: A model type developed for decoding deep RNN-based encoder-decoder - models created by the Edinburgh MT group for WMT 2017 using Nematus toolkit. - Can be decoded with Amun as _nematus2_ model type. -- `lm`: An RNN language model. -- `lm-transformer`: An transformer-based language model. - - - ### Developer API [The developer documentation for Marian]({{ 'docs/api/' | relative_url }}) is @@ -283,6 +261,28 @@ Command-line options overwrite options stored in the configuration file. +### Model types + +- `s2s`: An RNN-based encoder-decoder model with attention mechanism. The + architecture is equivalent to the + [DL4MT](https://github.com/nyu-dl/dl4mt-tutorial) or + [Nematus](https://github.com/EdinburghNLP/nematus) models ([Senrich et al., + 2017](https://arxiv.org/abs/1703.04357)). +- `transformer`: A model originally proposed by Google [(Vaswani et al., + 2017)](https://arxiv.org/abs/1706.03762) based solely on attention mechanisms. +- `multi-s2s`: As `s2s`, but uses two or more encoders allowing multi-source + neural machine translation. +- `multi-transformer`: As `transformer`, but uses multiple encoders. +- `amun`: A model equivalent to Nematus models unless layer normalization is + used. Can be decoded with Amun as _nematus_ model type. +- `nematus`: A model type developed for decoding deep RNN-based encoder-decoder + models created by the Edinburgh MT group for WMT 2017 using Nematus toolkit. + Can be decoded with Amun as _nematus2_ model type. +- `lm`: An RNN language model. +- `lm-transformer`: An transformer-based language model. + + + ### Multi-GPU training For multi-GPU training you only need to specify the device ids of the GPUs you @@ -489,6 +489,11 @@ words in the corresponding target training sentence. +### Tied embeddings + +TODO + + ### Custom embeddings Marian can handle custom embedding vectors trained with @@ -518,6 +523,21 @@ Other options for managing embedding vectors: +### Model pre-training + +TODO + + +### Fine-tuning + +TODO + + +### Right-to-left models + +TODO + + ### Guided alignment Training with guided alignment may improve alignments produced by RNN models @@ -551,6 +571,31 @@ Marian has a few more options related to guided alignment training: alignment training; only for training transformer models +### Pre-defined architecture settings + +TODO + + +### Factored models + +TODO + + +### FP16 training + +TODO + + +### Multi-node training + +TODO + + +### Training from stdin + +TODO + + ## Translation @@ -695,6 +740,21 @@ directory contains `fast_align` and `atools` from +### Word-level scores + +TODO + + +### Noisy back-translation + +TODO + + +### Binary models + +TODO + + ### Web server The `marian-server` command starts a web-socket server providing CPU and GPU From 351cd9cae4c060b16ae1f635609b896bf32fe5e6 Mon Sep 17 00:00:00 2001 From: Roman Grundkiewicz Date: Thu, 10 Feb 2022 19:37:35 +0000 Subject: [PATCH 2/7] Add documentation --- docs/index.md | 92 ++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 84 insertions(+), 8 deletions(-) diff --git a/docs/index.md b/docs/index.md index 9720dc77d..a7504b6aa 100644 --- a/docs/index.md +++ b/docs/index.md @@ -571,30 +571,106 @@ Marian has a few more options related to guided alignment training: alignment training; only for training transformer models -### Pre-defined architecture settings +### Pre-defined configurations -TODO +Marian provides the `--task` options, which is a handy shortcut for setting +model architecture and training options for common NMT model configurations. +The list of predefined configurations includes: + +- `best-deep` - the RNN BiDeep architecture proposed by [Miceli Barone et al. + (2017)](http://www.aclweb.org/anthology/W17-4710) +- `transformer-base` and `transformer-big` - architectures and proposed + training settings for a Transformer "base" model and Transformer "big" model, + respectively, both introduced in [Vaswani et al. + (2019)](https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf) +- `transformer-base-prenorm` and `transformer-big-prenorm` - variants of two + Transformer models with "prenorm", i.e. the layer normalization is performed + as the first block-wise preprocessing step. + +Options that are automatically set via `--task ` can be overwritten by +separately specifying those options in the command line. For example, `--task +transformer-base --dim-emb 1024` will train a transformer "base" but with the +embedding size of 1024 instead of 512. ### Factored models -TODO +Marian supports training models with source and/or target side factors. To +train a factored model, the training data needs to be in a specific format, and +a special vocabulary is required. More information on using Marian with +factors can be found in [the documentation on factored +models](https://marian-nmt.github.io/docs/api/factors). -### FP16 training +### Mixed precision training -TODO +Marian supports mixed precision training available in NVIDIA Volta and newer +architectures. The option `--fp16` provides a shortcut with default settings +for mixed precision training with float16 and cost-scaling. +Other options related to mixed precision training: -### Multi-node training +- `--precision` - defines types for forward/backward pass and optimization, +- `--cost-scaling` - option values for dynamic cost scaling, +- `--gradient-norm-average` - window size over which the exponential average of + the gradient norm is recorded, +- `--dynamic-gradient-scaling` - re-scale gradient to have average gradient + norm if (log) gradient norm diverges from average by the given sigmas, +- `--check-gradient-nan` - skip parameter update in case of NaNs in gradient. -TODO + + ### Training from stdin -TODO +Parallel training data can be provided to Marian in a tab-separated file, where +commonly the first field corresponds to the source side and the second field +corresponds to the target side of the parallel corpus, for example, instead of +providing two files to `--train-sets`: +``` +./build/marian -c config.yml -t file.src file.trg +``` + +a single file can be specified with `--tsv` option: +``` +./build/marian -c config.yml --tsv -t file.src-trg +``` + +The example can be further extended to train from the corpus provided directly +into the standard input: + +``` +paste file.src file.trg | ./build/marian -c config.yml -t stdin --no-shuffle +``` + +This might be useful when using a custom tool for training data preparation. +Note that the user takes responsibility for randomizing the input data - this +is why `--no-shuffle` is added to the training command (alternatively, +`--shuffle batches` can be used). + +#### Logical epochs + +The notion of an epoch is less clear when providing the training data into +stdin as the corpus cannot be easily rewinded and shuffled by Marian. Thus, it +is possible to define a logical epoch in terms of the number of updates or +labes, for example `--logical-epoch 1Gt` will re-define the epoch as 1 billion +target tokens, instead of the traditional one pass over the training data. This +is especially useful if the data can be provided in an invinite stream into +stdin. + +#### Guided alignment and data weighting + +Training with guided alignment and data weighting is supported when providing +the corpus in stdin. Simply add new fields to the input TSV file and specify +the indices of fields with word alignments or weights. For example: + +``` +cat file.src-trg-aln-w | ./build/marian -t stdin --guided-alignment 2 --data-weighting 3 +``` ## Translation From f373f4eb78782742dc501c6252131ec3483c0cc6 Mon Sep 17 00:00:00 2001 From: Roman Grundkiewicz Date: Fri, 11 Feb 2022 10:49:47 +0000 Subject: [PATCH 3/7] Update docs --- docs/index.md | 78 +++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 70 insertions(+), 8 deletions(-) diff --git a/docs/index.md b/docs/index.md index a7504b6aa..c0055009e 100644 --- a/docs/index.md +++ b/docs/index.md @@ -491,7 +491,19 @@ words in the corresponding target training sentence. ### Tied embeddings -TODO +The tying of embedding matrices can help to reduce models size and memory +footprints during training. Tying target embeddings and the last layer of the +output does not decrease quality and helps saving significant amounts of +parameters. Tying all embedding layers and output layers is a common practice +for translation models between languages using the same scripts. + +Related options: + +- `--tied-embeddings` - tie target embeddings and output embeddings in output + layer, +- `--tied-embeddings-src` - tie source and target embeddings, +- `--tied-embeddings-all` - tie all embedding layers and output layer. + ### Custom embeddings @@ -523,19 +535,48 @@ Other options for managing embedding vectors: -### Model pre-training +### Fine-tuning -TODO +A common domain adaptation technique is continued training via fine-tuning of +an existing model on new training data. +You can start continued training by copying your model to a new folder and +setting the `--model` option to point to that model. This will reload the model +from the path and also overwrite it during the next checkpoint saving. Note +that this overrides the model parameters with the model parameters from the +file, so the architectures cannot be changed between continued trainings. -### Fine-tuning +This method also works well for normal continued training. You can interrupt +your running training, change the training corpus and run the same command you +used before for the training to resume. In the case where the training files +change, the option `--no-restore-corpus` should be added to not restore the +corpus positions. If your validation data change, consider adding +`--valid-reset-stalled` to reset validation counters. You can also change other +training parameters like learning rate or early stopping criteria. If the new +training corpus is much smaller, it is usually recommended to decrease the +learning rate and validate the model more frequently. -TODO +See also [model pre-training]({{ 'docs#model-pre-training' | relative_url }}). + + +### Model pre-training + +A transfer learning technique related to fine-tuning is initializing model +weights from a pre-trained model. Marian provides the `--pretrained-model +model.npz` option that will load weight matrices from the pre-trained model +that match in name corresponding parameters from the model's architecture. +Matrices that are not present in the pre-trained model are initialized randomly +by default. + +For instance, you can initialize the decoder of a encoder-decoder translation +model with a pre-trained language model or deep models with shallow models. ### Right-to-left models -TODO +Marian provides an option for training on reversed input sequence via +`--right-left`. Combining a traditional left-to-right models and right-to-left +models may lead to an improved performance for some tasks. ### Guided alignment @@ -599,7 +640,7 @@ Marian supports training models with source and/or target side factors. To train a factored model, the training data needs to be in a specific format, and a special vocabulary is required. More information on using Marian with factors can be found in [the documentation on factored -models](https://marian-nmt.github.io/docs/api/factors). +models]({{ 'docs/api/factors' | relative_url }}). ### Mixed precision training @@ -818,7 +859,28 @@ directory contains `fast_align` and `atools` from ### Word-level scores -TODO + +In addition to sentence-level scores, Marian can also output word-level scores. +The option `--word-scores` prints one score per subword unit, for example: + +``` +echo "This is a test." | ./build/marian-decoder -c config.yml --word-scores +Tohle je test. ||| WordScores= -1.51923 -0.21951 -1.48668 -0.24813 -0.22176 +``` + +Note that if you use the built-in SentencePiece subword segmentation, the +number of scores will not much the output tokens. Also, word scores are not +normalized even if `--normalize` is used. You may want to normalize and map the +word scores into output tokens as a custom post-processing step. Adding +`--no-spm-decode` or `--alignment` will deliver all information that is needed +to do that: + +``` +echo "This is a test." | ./build/marian-decoder -c config.yml --word-scores --no-spm-decode --alignment +▁Tohle ▁je ▁test . ||| 1-0 5-1 5-2 5-3 5-4 ||| WordScores= -1.51923 -0.21951 -1.48668 -0.24813 -0.22176 +``` + +The option `--word-scores` is also available in `marian-scorer`. ### Noisy back-translation From d7ce409c71f6a1063c04a4cd210c478aa891307c Mon Sep 17 00:00:00 2001 From: Graeme Nail Date: Fri, 11 Feb 2022 10:56:51 +0000 Subject: [PATCH 4/7] Describe early-stopping-on --- docs/index.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/docs/index.md b/docs/index.md index c0055009e..2964a80d5 100644 --- a/docs/index.md +++ b/docs/index.md @@ -410,6 +410,9 @@ By default we use early stopping with patience of 10, i.e. `--early-stopping steps. Usually this will signal convergence or --- if the scores get worse with later validation steps --- potential overfitting. +If using multiple metrics in validation, the stopping condition can be applied +to `any` or `all` of these metrics. This is achieved using the flag +`--early-stopping-on`. The default considers only the `first` listed metric. ### Regularization From a124ef4e6113179cb54b378527f02cad2da5137f Mon Sep 17 00:00:00 2001 From: Graeme Nail Date: Fri, 11 Feb 2022 12:02:29 +0000 Subject: [PATCH 5/7] Describe binary models --- docs/index.md | 26 +++++++++++++++++++++++++- 1 file changed, 25 insertions(+), 1 deletion(-) diff --git a/docs/index.md b/docs/index.md index 2964a80d5..63e11823c 100644 --- a/docs/index.md +++ b/docs/index.md @@ -893,8 +893,32 @@ TODO ### Binary models -TODO +Marian has support for models in a custom binary format. This format supports mmap loading as well as both normal and packed memory layouts. Binary models offer decreased load times compared to `.npz`, and are identifiable by their `.bin` extension. + +The `marian-conv` command is able to convert to and from `npz` and `bin` models. The memory layout of the binary model is influenced by the `--gemm-type` flag, by default this is retained as `float32`. + +To generate a binary model from an `npz` model +```shell +./marian-conv --from model.npz --to model.bin +``` + +The basic usage is as simple as replacing `model.npz` with `model.bin` in your command arguments. +When decoding on CPU, it is possible to enable mmap loading with the flag `--model-mmap`. +Lexical shortlists also have a binary format. From a shortlist `lex.s2t` the binary version can be generated by +```shell +./marian-conv --shortlist lex.s2t 50 50 0 \ + --dump lex.bin \ + --vocabs vocab.l1.spm vocab.l2.spm +``` +The `--shortlist` argument points to the lexical shortlist file, and specifies the `first` (50) `best` (50) `prune` (0) options for the shortlist. Note that these options are **hardcoded** into the binary shortlist at conversion! +The `--dump` option gives the location for the binary shortlist and `--vocabs` specifies the vocabulary files for the source (l1) and target (l2) languages. + +To use the binary shortlist the `--shortlist lex.s2t 50 50 0` argument in your command should be replaced with +``` +--shortlist lex.bin false +``` +which provides the path to the binary shortlist `lex.bin`, and the second option `false` (optional, true by default) specifies whether the contents should be verified. ### Web server From c867e337f0a4f03751561ab665af3e27b401563f Mon Sep 17 00:00:00 2001 From: Roman Grundkiewicz Date: Fri, 11 Feb 2022 13:27:52 +0000 Subject: [PATCH 6/7] Update docs --- docs/index.md | 51 ++++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 40 insertions(+), 11 deletions(-) diff --git a/docs/index.md b/docs/index.md index 63e11823c..5b4909cce 100644 --- a/docs/index.md +++ b/docs/index.md @@ -888,37 +888,66 @@ The option `--word-scores` is also available in `marian-scorer`. ### Noisy back-translation -TODO +The `--output-sampling` option in Marian allows to noise output layer with +gumbel noise, which can be used for generating [noisy +back-translations](https://aclanthology.org/D18-1045.pdf). +``` +./build/marian-decoder -b 1 -i input.src --output-sampling +``` + +By default the sampling is from full model distribution. Top-k sampling can be +achieved providing `topk N` as arguments, for example: +``` +./build/marian-decoder -b 1 -i input.src --output-sampling topk 10 +``` + +Note that output sampling and beam search are generally contradictory methods +and using them together is not recommended, so we advise to set `--beam-size 1` +when using the sampling. ### Binary models -Marian has support for models in a custom binary format. This format supports mmap loading as well as both normal and packed memory layouts. Binary models offer decreased load times compared to `.npz`, and are identifiable by their `.bin` extension. +Marian has support for models in a custom binary format. This format supports +mmap loading as well as both normal and packed memory layouts. Binary models +offer decreased load times compared to `.npz`, and are identifiable by their +`.bin` extension. -The `marian-conv` command is able to convert to and from `npz` and `bin` models. The memory layout of the binary model is influenced by the `--gemm-type` flag, by default this is retained as `float32`. +The `marian-conv` command is able to convert to and from `npz` and `bin` +models. The memory layout of the binary model is influenced by the +`--gemm-type` flag, by default this is retained as `float32`. To generate a binary model from an `npz` model ```shell ./marian-conv --from model.npz --to model.bin ``` -The basic usage is as simple as replacing `model.npz` with `model.bin` in your command arguments. -When decoding on CPU, it is possible to enable mmap loading with the flag `--model-mmap`. +The basic usage is as simple as replacing `model.npz` with `model.bin` in your +command arguments. When decoding on CPU, it is possible to enable mmap loading +with the flag `--model-mmap`. -Lexical shortlists also have a binary format. From a shortlist `lex.s2t` the binary version can be generated by +Lexical shortlists also have a binary format. From a shortlist `lex.s2t` the +binary version can be generated by ```shell ./marian-conv --shortlist lex.s2t 50 50 0 \ --dump lex.bin \ --vocabs vocab.l1.spm vocab.l2.spm ``` -The `--shortlist` argument points to the lexical shortlist file, and specifies the `first` (50) `best` (50) `prune` (0) options for the shortlist. Note that these options are **hardcoded** into the binary shortlist at conversion! -The `--dump` option gives the location for the binary shortlist and `--vocabs` specifies the vocabulary files for the source (l1) and target (l2) languages. - -To use the binary shortlist the `--shortlist lex.s2t 50 50 0` argument in your command should be replaced with +The `--shortlist` argument points to the lexical shortlist file, and specifies +the `first` (50) `best` (50) `prune` (0) options for the shortlist. Note that +these options are **hardcoded** into the binary shortlist at conversion! The +`--dump` option gives the location for the binary shortlist and `--vocabs` +specifies the vocabulary files for the source (l1) and target (l2) languages. + +To use the binary shortlist the `--shortlist lex.s2t 50 50 0` argument in your +command should be replaced with ``` --shortlist lex.bin false ``` -which provides the path to the binary shortlist `lex.bin`, and the second option `false` (optional, true by default) specifies whether the contents should be verified. +which provides the path to the binary shortlist `lex.bin`, and the second +option `false` (optional, true by default) specifies whether the contents +should be verified. + ### Web server From 9c94a0bfeb39fb34f95796d067240659371e9ee1 Mon Sep 17 00:00:00 2001 From: Graeme Nail Date: Fri, 11 Feb 2022 14:30:18 +0000 Subject: [PATCH 7/7] Fix typo and add clarifications --- docs/index.md | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/docs/index.md b/docs/index.md index 5b4909cce..a7bff04aa 100644 --- a/docs/index.md +++ b/docs/index.md @@ -578,8 +578,10 @@ model with a pre-trained language model or deep models with shallow models. ### Right-to-left models Marian provides an option for training on reversed input sequence via -`--right-left`. Combining a traditional left-to-right models and right-to-left -models may lead to an improved performance for some tasks. +`--right-left`. Combining traditional left-to-right models and right-to-left +models may lead to an improved performance for some tasks. One such approach +would be to perform sequential decoding. However, combining left-to-right and +right-to-left models together in an ensemble is not possible. ### Guided alignment @@ -703,7 +705,7 @@ stdin as the corpus cannot be easily rewinded and shuffled by Marian. Thus, it is possible to define a logical epoch in terms of the number of updates or labes, for example `--logical-epoch 1Gt` will re-define the epoch as 1 billion target tokens, instead of the traditional one pass over the training data. This -is especially useful if the data can be provided in an invinite stream into +is especially useful if the data can be provided in an infinite stream into stdin. #### Guided alignment and data weighting @@ -888,15 +890,15 @@ The option `--word-scores` is also available in `marian-scorer`. ### Noisy back-translation -The `--output-sampling` option in Marian allows to noise output layer with -gumbel noise, which can be used for generating [noisy +The `--output-sampling` option in Marian allows one to noise the output layer +with gumbel noise, which can be used for generating [noisy back-translations](https://aclanthology.org/D18-1045.pdf). ``` ./build/marian-decoder -b 1 -i input.src --output-sampling ``` -By default the sampling is from full model distribution. Top-k sampling can be -achieved providing `topk N` as arguments, for example: +By default the sampling is from the full model distribution. Top-k sampling can +be achieved providing `topk N` as arguments, for example: ``` ./build/marian-decoder -b 1 -i input.src --output-sampling topk 10 ```