v0.10.1 (2018-07-18)
- Fix error when starting seq2seq training
v0.10.0 (2018-07-18)
- Introduce hook mechanism for additional customization of workflows
- Sentence-level negative log-likelihood criterion for sequence tagging
- '-' stands for stdin for inference tools (translate, lm, tag)
- Optional source features per request (for domain control) with rest translation server
- Display oov rate (source/target) in translate
- Introduce
max_tokens
allowing longer sentence handling and larger batch size - Add
-log_tag
option to add tag in logs for better automatic processing - The
withAttn
option in rest translation server now also returns the source and target tokens
- Misc fixes on lexical beam search
- Fix batch size non function with
rest_translation_server.lua
- Introduce
-tokenizer max
option to scorer for evaluation on non tokenized test data. - Fix non deterministic inference of language models
- Fix language model sampling mode
- Fix retraining from a language model
- Fix
-update_vocab
option for language models - Fix error when using translation-based validation metrics
- Correct error handling for all file open commands
- Reduce the Docker image size
v0.9.7 (2017-12-19)
- Fix detokenization when replaced target tokens contain spaces
v0.9.6 (2017-12-11)
- Protected sequence outputs correctly deserialize protected characters (%abcd)
- Bypass case management on protected sequences
v0.9.5 (2017-12-07)
- Enable constrained beam search for protected sequence
- Fix invalid
NOERROR
log level (rename it toNONE
)
v0.9.4 (2017-11-30)
- Fix regression when normalizing protected sequences
v0.9.3 (2017-11-30)
- Fix vocabulary extraction of protected sequences (#444)
v0.9.2 (2017-11-27)
- Fix empty translation returned by the REST translation server
- Fix random split of protected sequences by BPE (#441)
- Fix error when using
-update_vocab
with additional word features
v0.9.1 (2017-11-16)
- Fix missing normalization during translation
- Fix normalization when the command contains pipes
- Fix incorrect TER normalization (#424)
- Fix error when the file to translate contains empty lines
v0.9.0 (2017-11-07)
- Learning rate is also decayed when using Adam
- Fix some wrong tokenization rules (punctuation-numbers)
-report_every
option is renamed to-report_progress_every
-EOT_marker
option is renamed to-bpe_EOT_marker
fortokenize.lua
-BOT_marker
option is renamed to-bpe_BOT_marker
fortokenize.lua
bit32
package is now required for LuaJIT users
- Dynamic dataset to train on large and raw training data repository
- Convolutional encoder
- Shallow fusion of language model in decoder
- Lexically constrained beam search
- TER validation metric
- Protection blocks for tokenization - and implement placeholder
- Hook to call external normalization
- JSON log formatting when the log file suffix is
.json
- Training option to save the validation translation to a file
- Training option to reset the optimizer states when the learning rate is decayed
- Training option to update the vocabularies during a retraining
- Translation option to save alignment history
- Translation translation option to mark replaced tokens with
⦅unk:xxxxx⦆
- Tokenization option to split numbers on each digit
- Multi-model rest server using yaml config file
- Allow disabling gradients clipping with
-max_grad_norm 0
- Allow disabling global parameters initialization with
-param_init 0
- Introduce error estimation in scorer for all metrics
- Reduce memory footprint of Adam, Adadelta and Adagrad optimizers
- Make validation data optional for training
- Faster tokenization (up to x2 speedup)
- Fix missing final model with some values of
-save_every_epochs
- Fix validation score delta that was applied in the incorrect direction
- Fix LuaJIT out of memory issues in
learn_bpe.lua
- Fix documentation generation of embedded tokenization options
- Fix release of sequence tagger models
v0.8.0 (2017-06-28)
- Models previously trained with
-pdbrnn
or-dbrnn
are no more compatible -start_decay_ppl_delta
option is renamed to-start_decay_score_delta
-decay perplexity_only
option is renamed to-decay score_only
-brnn
,-dbrnn
and-pdbrnn
options are replaced by-encoder_type <type>
for future extensions-sample_tgt_vocab
option is renamed-sample_vocab
and is extended to language models
- Implement inference for language models for scoring or sampling
- Support variational dropout and dropout on source sequence
- Support several validation metrics: loss, perplexity, BLEU and Damerau-Levenshtein edit ratio
- Add option in preprocessing to check that lengths of source and target are equal (e.g. for sequence tagging)
- Add
-pdbrnn_merge
option to define how to reduce the time dimension - Add option to segment mixed cased words
- Add option to segment words of given alphabets or when switching alphabets
- Add Google's NMT encoder
- Add external scorer script for BLEU and Damerau-Levenshtein edit ratio
- Add script to average multiple models
- Add option to save the beam search as JSON
- Support input vectors for sequence tagging
- Fix incorrect gradients when using variable length batches and bidirectional encoders
v0.7.1 (2017-05-29)
- Fix backward compatibility with older models using target features
- Fix importance sampling when using multiple GPUs
- Fix language models training
v0.7.0 (2017-05-19)
-sample_w_ppl
option is renamed-sample_type
for future extensions
- Support vectors as inputs using Kaldi input format
- Support parallel file alignment by index in addition to line-by-line
- Add script to generate pretrained word embeddings:
- from Polyglot repository
- from pretrained word2vec, GloVe or fastText files
- Add an option to only fix the pretrained part of word embeddings
- Add a bridge layer between the encoder and decoder to define how encoder states are passed to the decoder
- Add
epoch_only
decay strategy to only decay learning based on epochs - Make epoch models save frequency configurable
- Optimize decoding and training with target vocabulary reduction (importance sampling)
- Introduce
partition
data sampling
- Improve command line and configuration file parser
- space-separated list of values
- boolean arguments
- disallow duplicate command line options
- clearer error messages
- Improve correctness of
DBiEncoder
andPDBiEncoder
implementation - Improve unicode support for languages using combining marks like Hindi
- Improve logging during preprocessing with dataset statistics
- Fix translation error of models profiled during training
- Fix translation error of models trained without attention
- Fix error when using one-layer GRU
- Fix incorrect coverage normalization formula applied during the beam search
v0.6.0 (2017-04-07)
-fix_word_vecs
options now requires0
and1
as argument for a better retraining experience
- Add new encoders: deep bidirectional and pyramidal deep bidirectional
- Add attention variants: no attention and dot, general or concat global attention
- Add alternative learning rate decay strategy for SGD training
- Introduce dynamic parameter change for dropout and fixed word embeddings
- Add length and coverage normalization during the beam search
- Add translation option to dump input sentence encoding
- Add TensorBoard metrics visualisation with Crayon
- [experimental] Add sequence tagger model
- Check consistency of option settings when training from checkpoints
- Save and restore random number generator states from checkpoints
- Output more dataset metrics during the preprocessing
- Improve error message on invalid options
- Fix missing n-best hypotheses list in the output file
- Fix individual losses that were always computed when using random sampling
- Fix duplicated logs in parallel mode
v0.5.3 (2017-03-30)
- Fix data loading during training
v0.5.2 (2017-03-29)
- Improve compatibility with older Torch versions missing the
fmod
implementation
v0.5.1 (2017-03-28)
- Fix translation with FP16 precision
- Fix regression that make
tds
mandatory for translation
v0.5.0 (2017-03-06)
- Training code is now part of the library
- Add
-fallback_to_cpu
option to continue execution on CPU if GPU can't be used - Add standalone script to generate vocabularies
- Add script to extract word embeddings
- Add option to prune vocabularies by minimum word frequency
- New REST server
- [experimental] Add data sampling during training
- [experimental] Add half floating point (fp16) support (with cutorch@359ee80)
- Make sure released model does not contain any serialized function
- Reduce size of released BRNN models (up to 2x smaller)
- Reported metrics are no longer averaged on the entire epoch
- Improve logging in asynchronous training
- Allow fixing word embeddings without providing pre-trained embeddings
- Fix pretrained word embeddings that were overriden by parameters initialization
- Fix error when using translation server with GPU model
- Fix gold data perplexity reporting during translation
- Fix wrong number of attention vectors returned by the translator
v0.4.1 (2017-02-16)
- Fix translation server error when clients send escaped unicode sequences
- Fix compatibility issue with the
:split()
function
v0.4.0 (2017-02-10)
- New translator API for better integration
- Profiler option
- Support hypotheses filtering during the beam search
- Support individually setting features vocabulary and embedding size
- [experimental] Scripts to interact with the benchmark platform
- [experimental] Language modeling example
- Improve beam search speed (up to 90% faster)
- Reduce released model size (up to 2x smaller)
- Fix tokenization of text containing the joiner marker character
- Fix
-joiner_new
option when using BPE - Fix joiner marker generated without the option enabled
- Fix translation server crash on Lua errors
- Fix error when loading configuration files containing the
gpuid
option - Fix BLEU drop when applying beam search on some models
- Fix error when using asynchronous parallel mode
- Fix non SGD model serialization after retraining
- Fix error when using
-replace_unk
with empty sentences in the batch - Fix error when translating empty batch
v0.3.0 (2017-01-23)
- Rename
-epochs
option to-end_epoch
to clarify its behavior - Remove
-nparallel
option and support a list of comma-separated identifiers on-gpuid
- Rename
-sep_annotate
option to-joiner_annotate
- ZeroMQ translation server
- Advanced log management
- GRU cell
- Tokenization option to make the token separator an independent token
- Tokenization can run in parallel mode
- Zero-Width Joiner unicode character (ZWJ) is now tokenizing but as a joiner
- Fix Hangul tokenization
- Fix duplicated tokens in aggressive tokenization
- Fix error when using BRNN and multiple source features
- Fix error when preprocessing empty lines and using additional features
- Fix error when translating empty sentences
- Fix error when retraining a BRNN model on multiple GPUs
v0.2.0 (2017-01-02)
-seq_length
option is split into-src_seq_length
and-tgt_seq_length
- Asynchronous SGD
- Detokenization
- BPE support in tokenization
- Smaller memory footprint during training
- Smaller released model size after a non-SGD training
- Fix out of memory errors in preprocessing
- Fix BRNN models serialization and release
- Fix error when retraining a model
- Fix error when using more than one feature
v0.1.0 (2016-12-19)
Initial release.