Releases: google/sentencepiece
Releases · google/sentencepiece
v0.1.92
v0.1.91
New API
- [Python] Added a feature to feed training data as Python's iterable object.
https://github.com/google/sentencepiece/tree/master/python#training-without-local-filesystem - [Python] Added a feature to set model writer to emit the output model to any non-local devices.
https://github.com/google/sentencepiece/tree/master/python#training-without-local-filesystem - [C++] Add an API to returns the trained model directly as std::string.
Bug Fix
- Ignores nbest parameter in BPE-dropout
- fixed build error when SPM_ENABLE_NFKC_COMPILE=ON
- fixed the cost computation around user_defined_symbol and faster encoding introduced in the previous release.
v0.1.90
v0.1.9
Features:
- --byte_fallback: fallback UNK token into UTF-8 byte sequences. 256 byte symbols are reserved in advance.
https://arxiv.org/pdf/1909.03341.pdf Note that you need to set --character_coverage less than 1.0, otherwise byte-fall-backed token may not appear in the training data. - BPE-dropout: Implemented BPE dropout. https://arxiv.org/abs/1910.13267
Sampling API is available for the BPE.
https://github.com/google/sentencepiece/blob/master/src/sentencepiece_processor.h#L287 - --required_chars=chars: Specify the set of Unicode chars that must be included in the final vocab.
- --split_digits: Split all digits (0-9) into separate pieces (disabled by default)
- Denormalization: Apply extra normalization rule after decoding. We can specify the rule as TSV via --denormalization_rule_tsv=file flag. Note that offset information may not always be preserved.
- --train_extremely_large_corpus: Train the unigram model from extremely large corpus (> 10M sentences) to avoid integer overflow. Note that it will increase the memory usage. 300GB or larger memory might be necessary.
Performance improvement:
- 30%-50% performance improvement is obtained in the default unigram one-best tokenization.
New API
- [Python] Added Python friendly API. New API allows to feed any chars to user_defined_symbols during the training. The old methods are still available.
https://github.com/google/sentencepiece/tree/master/python#segmentation - [C++] Added the interface to feed training data via arbitrary iterator object.
https://github.com/google/sentencepiece/blob/master/src/sentencepiece_trainer.h#L40 - [C++] Added the interface to set set a pre-tokenizer to specify the word boundary. This is used as a word-boundary constraint to set the seed vocabulary, and not used in the inference time.
https://github.com/google/sentencepiece/blob/master/src/pretokenizer_for_training.h
v0.1.86
v0.1.85
v0.1.84
v0.1.83
Sentencepiece re-release
Releases a new version of Sentencepiece with major refactorings:
- Builds with Bazel
- Re-uses existing open source libraries whenever possible
- Refactors internal dependencies
- New sets of features for configuring tokenizers
- Separation from Tensorflow