Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor spm_encode #952

Closed
wants to merge 49 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
35c2eba
Add "nfkc_code" normalization rules
vmarkovtsev May 18, 2023
9e15573
Add multiple memory and performance optimizations
vmarkovtsev May 18, 2023
8fc56b5
Add installation instructions and use external abseil
vmarkovtsev May 18, 2023
180a6fb
Allow switching between internal and external abseil
vmarkovtsev May 18, 2023
527af0c
Fix the overflow errors
vmarkovtsev May 18, 2023
23c0c09
Update the built-in protobuf sources
vmarkovtsev May 19, 2023
f64cbf4
Repair the python package build
vmarkovtsev May 19, 2023
20be455
Allow missing tcmalloc
vmarkovtsev May 19, 2023
8d932e5
Support --verbatim_control_char to mark source code sentences
vmarkovtsev May 22, 2023
8f26fad
Ensure that Encode/Decode restores whitespace in the code
vmarkovtsev May 22, 2023
20f12dc
Fix not adding all-whitespace pairs on --verbatim_control_char
vmarkovtsev May 23, 2023
6faa2d5
Get rid of TBB to sort in parallel
vmarkovtsev May 23, 2023
27f78bd
Parallel spm_encode
vmarkovtsev May 23, 2023
d020431
Indicate progress in spm_encode
vmarkovtsev May 23, 2023
581efe6
Link external protobuf statically
vmarkovtsev May 24, 2023
4d156ed
Cache sentence frequencies to a file
vmarkovtsev May 24, 2023
e8f8d3b
Implement merging cached bpe frequency dicts
vmarkovtsev May 25, 2023
28f7bf1
Reduce memory pressure in bpe trainer symbols construction
vmarkovtsev May 26, 2023
4f984b0
Fix memory pressure in sorting final merged bpe cache
vmarkovtsev May 26, 2023
8e28f03
Initialize bpe symbols in parallel
vmarkovtsev May 26, 2023
429c830
Successfully load 80% of the Pile
vmarkovtsev May 26, 2023
cc6a20f
Switch to 32-bit bpe pointers
vmarkovtsev May 26, 2023
f637bf0
Parallelize bpe pairs construction
vmarkovtsev May 26, 2023
aeb2bd7
Polish UpdateActiveSymbols
vmarkovtsev May 26, 2023
ebd85f3
Add cache only mode in GetPairSymbol
vmarkovtsev May 26, 2023
cc8c56d
Make bpe position updates parallel
vmarkovtsev May 28, 2023
b6c7666
Avoid the second overhead of GetCachedPairSymbol()
vmarkovtsev May 29, 2023
17acb6d
Optimize the parallel BPE iterations
vmarkovtsev May 29, 2023
ba48031
Balance sorting and freq computation
vmarkovtsev May 29, 2023
150ce9d
Fix encoding non-verbatim text
vmarkovtsev May 31, 2023
d1a4bb4
Add some install instructions for finding the build path.
joerowell Oct 5, 2023
5cec47d
Fixing tests (#6)
kuba-- Oct 28, 2023
f0821a6
Add development documentation
Oct 31, 2023
dc661b8
Merge pull request #10 from poolsideai/devdocs
rbehjati Nov 1, 2023
8985958
handle 0x01~0x04 delimiters
yiyangh-ps Oct 26, 2023
f5f0392
Refactor code
yiyangh-ps Oct 31, 2023
dde84eb
Process lines fewer than 1000
yiyangh-ps Oct 23, 2023
515e65b
Support mixed code-text format
yiyangh-ps Oct 25, 2023
d092158
Fix build command
yiyangh-ps Oct 25, 2023
a237ae4
Remove unsed eos/bos/verbatim_control_char
yiyangh-ps Oct 25, 2023
752a0c0
Remove 0x04 from encoding sequence
yiyangh-ps Oct 26, 2023
7f16648
Refactor encoding using MixedTextCodeIterator
yiyangh-ps Oct 31, 2023
d16e5da
Read from stdin
Oct 30, 2023
1bce046
Fix a race condition issue
yiyangh-ps Nov 3, 2023
aa566d3
Add ReadLineStdin to allow reading from stdin
Nov 13, 2023
c4bd5ea
Use std::shared_ptr and a slow-down mechanism
Nov 14, 2023
61265db
Fixes after review
Nov 15, 2023
ce412d7
Add log lines instead of assert
Nov 18, 2023
2055686
Refactor spm_encode_main
vmarkovtsev Dec 19, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
.idea
Makefile
Makefile.in
/ar-lib
Expand Down Expand Up @@ -72,6 +73,5 @@ libsentencepiece.so*
libsentencepiece_train.so*
python/bundled
_sentencepiece.*.so
third_party/abseil-cpp

python/sentencepiece
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[submodule "third_party/abseil-cpp"]
path = third_party/abseil-cpp
url = https://github.com/abseil/abseil-cpp.git
2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.!

cmake_minimum_required(VERSION 3.1 FATAL_ERROR)
cmake_minimum_required(VERSION 3.9 FATAL_ERROR)
file(STRINGS "VERSION.txt" SPM_VERSION)
message(STATUS "VERSION: ${SPM_VERSION}")

Expand Down
24 changes: 24 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,30 @@ with the extension of direct training from raw sentences. SentencePiece allows u

**This is not an official Google product.**

## Vadim's notes

Proper installation:

```
sudo apt install libgoogle-perftools-dev
cmake -D CMAKE_BUILD_TYPE=RelWithDebInfo -D SPM_USE_EXTERNAL_ABSL=off -D SPM_ENABLE_TCMALLOC=on -D SPM_ENABLE_NFKC_COMPILE=on ..
```

1. The built-in abseil's containers are aliases to stdlib. Building with a real abseil is WIP.
2. Adding new spec options requires regenerating the protobuf sources.
3. tcmalloc is a must. The stdlib's malloc fails to return the freed memory back to the system.
4. nfkc compilation is needed to edit the normalization rules

If you want to install the library and you're not using installing the python package globally, you may need to set the package configuration path:

```
make install
ldconfig
cd ../python
PKG_CONFIG_PATH=../build python -m pip install -ve .
```


## Technical highlights
- **Purely data driven**: SentencePiece trains tokenization and detokenization
models from sentences. Pre-tokenization ([Moses tokenizer](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl)/[MeCab](http://taku910.github.io/mecab/)/[KyTea](http://www.phontron.com/kytea/)) is not always required.
Expand Down
Loading
Loading