is it possible to use Metaspace pretokenizer without split them? #1168

Maxlinn · 2023-03-06T06:57:26Z

hi to the community, recently i'd like to do some work on code, so the indentation, which is made of several spaces, has great semantics. so i think different lengths of space should be taken as different tokens. e.g. and and are different tokens.

but the vocab cannot directly save a token which is several spaces, becuase the merge table is saved as a b. when a (or b) is consecutive spaces, it cannot determine the merge rule, you will received an error when you load such a tokenizer.json.

following the trend it is suitable to replace space with metaspace “▁” (U+2581）, but the pre_tokenizers.Metaspace not only replaces space with metaspace, but also split them into individuals, that is, →"▁","▁","▁","▁", what is want is "▁▁▁▁".

i didn't see any workaround, i can surely group different lengths of space using pre_tokenizer.Split(tokenizers.Regex), but i can not replace them into metaspace, since there is no such method like pre_tokenizer.Replace. but when i use pre_tokenizer.Metaspace, it split multiple metaspaces into single characters.

it is also not possible to maunally add "▁▁▁▁" into vocab and rules like "▁▁ ▁▁" into merge table, since the Metaspace pre_tokenizer will always split spaces into individuals. if remove the Metaspace pre_tokenizer, it cannot do anything with space and make it a .

what should i do, thanks in advance!

The text was updated successfully, but these errors were encountered:

Maxlinn · 2023-03-06T09:24:12Z

i came to the solution that use normalizer.Replace(' ', '▁') and do not use pre_tokenizer.Metaspace), but in decoder we can still use decoders.Metaspace.

but normalizer.Replace seems really slow. so i replace in dataloader. and if using UTF-8 normalizers like normailizers.NFKC, some chars other than space will produce space, causing consecutive space in merge table, it can be deleted manually.

how funny is that a problem haunted you for days and immediately solved when you seeking help.

Narsil · 2023-03-06T09:29:22Z

For the merge table containing spaces, please check out :

#909

Maxlinn · 2023-03-06T10:27:18Z

For the merge table containing spaces, please check out :

#909

much thanks for replying! it seems the solution proposed in #909 was to replace space with other chars like Ġ and then replace it back? that seems not much difference in replace them with metaspace.

Narsil · 2023-03-06T11:05:31Z

No no, that PR changes the representation of merges from "a b" to ["a", "b"] so that we could include spaces within the merges themselves.

Since there wasn't a huge push for them that PR is still dangling (it would change the save on disk layout of the tokenizer so there needs to be a strong reason for us to do it).

However it should in the short term, allow you to experiment with it.

Maxlinn · 2023-03-06T11:38:17Z

No no, that PR changes the representation of merges from "a b" to ["a", "b"] so that we could include spaces within the merges themselves.

Since there wasn't a huge push for them that PR is still dangling (it would change the save on disk layout of the tokenizer so there needs to be a strong reason for us to do it).

However it should in the short term, allow you to experiment with it.

thanks to the community! that makes sense, since using space as a dilimilter is not that formal, i know the format is inherited from OpenAI GPT.

for the time being, i will use the metaspace workaround, because then i have to convert it to merges.txt as OpenAI does.

Maxlinn closed this as completed Mar 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

is it possible to use Metaspace pretokenizer without split them? #1168

is it possible to use Metaspace pretokenizer without split them? #1168

Maxlinn commented Mar 6, 2023 •

edited

Loading

Maxlinn commented Mar 6, 2023

Narsil commented Mar 6, 2023

Maxlinn commented Mar 6, 2023

Narsil commented Mar 6, 2023

Maxlinn commented Mar 6, 2023

is it possible to use Metaspace pretokenizer without split them? #1168

is it possible to use Metaspace pretokenizer without split them? #1168

Comments

Maxlinn commented Mar 6, 2023 • edited Loading

Maxlinn commented Mar 6, 2023

Narsil commented Mar 6, 2023

Maxlinn commented Mar 6, 2023

Narsil commented Mar 6, 2023

Maxlinn commented Mar 6, 2023

Maxlinn commented Mar 6, 2023 •

edited

Loading