Skip to content

Commit

Permalink
Merge pull request #400 from sjosund/master
Browse files Browse the repository at this point in the history
Removes reference to that Korean has no spaces
  • Loading branch information
taku910 authored Sep 17, 2019
2 parents bca47c0 + bd0ea9b commit 60e42a7
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ The number of merge operations is a BPE-specific parameter and not applicable to

#### Trains from raw sentences
Previous sub-word implementations assume that the input sentences are pre-tokenized. This constraint was required for efficient training, but makes the preprocessing complicated as we have to run language dependent tokenizers in advance.
The implementation of SentencePiece is fast enough to train the model from raw sentences. This is useful for training the tokenizer and detokenizer for Chinese, Japanese and Korean where no explicit spaces exist between words.
The implementation of SentencePiece is fast enough to train the model from raw sentences. This is useful for training the tokenizer and detokenizer for Chinese and Japanese where no explicit spaces exist between words.

#### Whitespace is treated as a basic symbol
The first step of Natural Language processing is text tokenization. For
Expand Down

0 comments on commit 60e42a7

Please sign in to comment.