-
Notifications
You must be signed in to change notification settings - Fork 819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
is it possible to use Metaspace pretokenizer without split them? #1168
Comments
i came to the solution that use but how funny is that a problem haunted you for days and immediately solved when you seeking help. |
For the merge table containing spaces, please check out : |
No no, that PR changes the representation of merges from Since there wasn't a huge push for them that PR is still dangling (it would change the save on disk layout of the tokenizer so there needs to be a strong reason for us to do it). However it should in the short term, allow you to experiment with it. |
thanks to the community! that makes sense, since using space as a dilimilter is not that formal, i know the format is inherited from OpenAI GPT. for the time being, i will use the metaspace workaround, because then i have to convert it to |
hi to the community, recently i'd like to do some work on code, so the indentation, which is made of several spaces, has great semantics. so i think different lengths of space should be taken as different tokens. e.g.
and
and
are different tokens.
but the vocab cannot directly save a token which is several spaces, becuase the merge table is saved as
a b
. when a (or b) is consecutive spaces, it cannot determine the merge rule, you will received an error when you load such atokenizer.json
.following the trend it is suitable to replace space with metaspace “▁” (U+2581), but the
→
pre_tokenizers.Metaspace
not only replaces space with metaspace, but also split them into individuals, that is,"▁","▁","▁","▁"
, what is want is"▁▁▁▁"
.i didn't see any workaround, i can surely group different lengths of space using
pre_tokenizer.Split(tokenizers.Regex)
, but i can not replace them into metaspace, since there is no such method likepre_tokenizer.Replace
. but when i usepre_tokenizer.Metaspace
, it split multiple metaspaces into single characters.it is also not possible to maunally add "▁▁▁▁" into vocab and rules like "▁▁ ▁▁" into merge table, since the Metaspace pre_tokenizer will always split spaces into individuals. if remove the Metaspace pre_tokenizer, it cannot do anything with space and make it a .
what should i do, thanks in advance!
The text was updated successfully, but these errors were encountered: