Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is it possible to use Metaspace pretokenizer without split them? #1168

Closed
Maxlinn opened this issue Mar 6, 2023 · 5 comments
Closed

is it possible to use Metaspace pretokenizer without split them? #1168

Maxlinn opened this issue Mar 6, 2023 · 5 comments

Comments

@Maxlinn
Copy link

Maxlinn commented Mar 6, 2023

hi to the community, recently i'd like to do some work on code, so the indentation, which is made of several spaces, has great semantics. so i think different lengths of space should be taken as different tokens. e.g. and and are different tokens.

but the vocab cannot directly save a token which is several spaces, becuase the merge table is saved as a b. when a (or b) is consecutive spaces, it cannot determine the merge rule, you will received an error when you load such a tokenizer.json.

following the trend it is suitable to replace space with metaspace “▁” (U+2581), but the pre_tokenizers.Metaspace not only replaces space with metaspace, but also split them into individuals, that is, "▁","▁","▁","▁", what is want is "▁▁▁▁".

i didn't see any workaround, i can surely group different lengths of space using pre_tokenizer.Split(tokenizers.Regex), but i can not replace them into metaspace, since there is no such method like pre_tokenizer.Replace. but when i use pre_tokenizer.Metaspace, it split multiple metaspaces into single characters.

it is also not possible to maunally add "▁▁▁▁" into vocab and rules like "▁▁ ▁▁" into merge table, since the Metaspace pre_tokenizer will always split spaces into individuals. if remove the Metaspace pre_tokenizer, it cannot do anything with space and make it a .

what should i do, thanks in advance!

@Maxlinn
Copy link
Author

Maxlinn commented Mar 6, 2023

i came to the solution that use normalizer.Replace(' ', '▁') and do not use pre_tokenizer.Metaspace), but in decoder we can still use decoders.Metaspace.

but normalizer.Replace seems really slow. so i replace in dataloader. and if using UTF-8 normalizers like normailizers.NFKC, some chars other than space will produce space, causing consecutive space in merge table, it can be deleted manually.

how funny is that a problem haunted you for days and immediately solved when you seeking help.

@Maxlinn Maxlinn closed this as completed Mar 6, 2023
@Narsil
Copy link
Collaborator

Narsil commented Mar 6, 2023

For the merge table containing spaces, please check out :

#909

@Maxlinn
Copy link
Author

Maxlinn commented Mar 6, 2023

For the merge table containing spaces, please check out :

#909

much thanks for replying! it seems the solution proposed in #909 was to replace space with other chars like Ġ and then replace it back? that seems not much difference in replace them with metaspace.

@Narsil
Copy link
Collaborator

Narsil commented Mar 6, 2023

No no, that PR changes the representation of merges from "a b" to ["a", "b"] so that we could include spaces within the merges themselves.

Since there wasn't a huge push for them that PR is still dangling (it would change the save on disk layout of the tokenizer so there needs to be a strong reason for us to do it).

However it should in the short term, allow you to experiment with it.

@Maxlinn
Copy link
Author

Maxlinn commented Mar 6, 2023

No no, that PR changes the representation of merges from "a b" to ["a", "b"] so that we could include spaces within the merges themselves.

Since there wasn't a huge push for them that PR is still dangling (it would change the save on disk layout of the tokenizer so there needs to be a strong reason for us to do it).

However it should in the short term, allow you to experiment with it.

thanks to the community! that makes sense, since using space as a dilimilter is not that formal, i know the format is inherited from OpenAI GPT.

for the time being, i will use the metaspace workaround, because then i have to convert it to merges.txt as OpenAI does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants