How can I get the mapping relationship between byte values and Unicode characters of the fast tokenizer? #1545

LuoKaiGSW · 2024-06-04T07:37:39Z

I have a model that uses BloomTokenizerFast, which does not have properties like byte_decoder and sp_model, so I can't figure out how it implements the mapping between byte values and Unicode characters. I've looked through the source code and only found that the pre_tokenize_str function can convert input text characters into Unicode characters, but I didn't see the mapping relationship it depends on. So I want to ask, how can I find this mapping relationship? Or is the mapping relationship used by the fast tokenizer the same as that of gpt2?

ArthurZucker · 2024-06-05T07:29:04Z

Hey! I suppose you are using python and can't see what's inside your tokenizer! #1542 should help you with this 🤗

LuoKaiGSW · 2024-06-05T08:06:29Z

Hey! I suppose you are using python and can't see what's inside your tokenizer! #1542 should help you with this 🤗

Thank you for your reply, but I didn't fully understand what you meant. After using tokenizer._tokenizer.model, I got a BPE object, but I didn't see the attribute I wanted in it - that is, the mapping from byte values to Unicode. Could you explain it a bit more clearly, please?

ArthurZucker · 2024-06-11T13:32:50Z

You cannot see any attributes because both __repr__ and __str__ are not implemented

LuoKaiGSW · 2024-06-11T13:47:13Z

You cannot see any attributes because both __repr__ and __str__ are not implemented

So, is it impossible to read this mapping relationship from the fast tokenizer?

ArthurZucker · 2024-06-11T16:42:41Z

It is coming with the PR that I linked 😉

github-actions · 2024-08-12T01:57:10Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

ArthurZucker · 2024-08-16T09:55:19Z

Closing as we do have the capabilities merged now!

github-actions bot added the Stale label Jul 12, 2024

huggingface deleted a comment from github-actions bot Jul 12, 2024

github-actions bot removed the Stale label Jul 13, 2024

github-actions bot added the Stale label Aug 12, 2024

ArthurZucker closed this as completed Aug 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I get the mapping relationship between byte values and Unicode characters of the fast tokenizer? #1545

How can I get the mapping relationship between byte values and Unicode characters of the fast tokenizer? #1545

LuoKaiGSW commented Jun 4, 2024

ArthurZucker commented Jun 5, 2024

LuoKaiGSW commented Jun 5, 2024

ArthurZucker commented Jun 11, 2024

LuoKaiGSW commented Jun 11, 2024

ArthurZucker commented Jun 11, 2024

github-actions bot commented Aug 12, 2024

ArthurZucker commented Aug 16, 2024

How can I get the mapping relationship between byte values and Unicode characters of the fast tokenizer? #1545

How can I get the mapping relationship between byte values and Unicode characters of the fast tokenizer? #1545

Comments

LuoKaiGSW commented Jun 4, 2024

ArthurZucker commented Jun 5, 2024

LuoKaiGSW commented Jun 5, 2024

ArthurZucker commented Jun 11, 2024

LuoKaiGSW commented Jun 11, 2024

ArthurZucker commented Jun 11, 2024

github-actions bot commented Aug 12, 2024

ArthurZucker commented Aug 16, 2024