You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
From what I understand, the allow_whitespace_only_pieces training argument, implemented in the word-level pretokeniser at this line, allows multiple spaces to appear next to each other in the strings that result from the pretokeniser (let's call them "pre-tokens"). Because the trainer gets its substrings from inside pre-tokens, having multiple spaces in one pre-token allows it to learn tokens consisting of more than one space.
I have two questions:
Is this not a confusing way to name this option? When allow_whitespace_only_pieces is false, it produces pre-tokens that consist of whitespace only, which is completely counterintuitive. (It also means that there will be at least one token allowed that is whitespace-only.)
For my application, what I need is what you would actually expect the option "allow whitespace-only pieces" to do, which is to produce pre-tokens with only whitespace and never mix whitespace with non-whitespace in tokens. Is this straight-forward to do by setting training options, or does it need extra implementation?
To illustrate all of this with an example: the sentence This is a test sentence. is split as follows in the three cases outlined above:
allow_whitespace_only_pieces = false: This ▁is ▁a ▁ ▁ ▁ ▁test ▁sentence. (seemingly allows pieces that are whitespace-only)
allow_whitespace_only_pieces = true: This ▁is ▁a ▁▁▁▁test ▁sentence.
What I need: This ▁ is ▁ a ▁▁▁▁ test ▁ sentence.
Thanks.
The text was updated successfully, but these errors were encountered:
From what I understand, the
allow_whitespace_only_pieces
training argument, implemented in the word-level pretokeniser at this line, allows multiple spaces to appear next to each other in the strings that result from the pretokeniser (let's call them "pre-tokens"). Because the trainer gets its substrings from inside pre-tokens, having multiple spaces in one pre-token allows it to learn tokens consisting of more than one space.I have two questions:
allow_whitespace_only_pieces
is false, it produces pre-tokens that consist of whitespace only, which is completely counterintuitive. (It also means that there will be at least one token allowed that is whitespace-only.)To illustrate all of this with an example: the sentence
This is a test sentence.
is split as follows in the three cases outlined above:allow_whitespace_only_pieces = false
:This ▁is ▁a ▁ ▁ ▁ ▁test ▁sentence.
(seemingly allows pieces that are whitespace-only)allow_whitespace_only_pieces = true
:This ▁is ▁a ▁▁▁▁test ▁sentence.
This ▁ is ▁ a ▁▁▁▁ test ▁ sentence.
Thanks.
The text was updated successfully, but these errors were encountered: