LLamaCausalLM add support for tokenizer.json #9967

robbiemu · 2024-10-20T22:58:09Z

SentencePiece by default uses BPE, which by default also uses a tokenizer.json. This does not have to be customized with tokens you cannot read from tokenizer.model, but as the BSC-LT/salamanda-7b and related models show, it can be. modified the LlamaModel class to augment the vocabulatry from the json file if it is present.

related issue: #9899

note: this was recreated because I made the mistake of force pushing the other day to pull the fork and update my PR branch, leading to all changes in master to look like they were being reapplied... this cost a like by LIN72H on the PR :( I'll not do it again.

…casuallm-sp-bpe

sed -i '' 's/\r$//' .gitignore (apparently I had a couple in there?)

robbiemu added 5 commits October 17, 2024 23:10

basic concept

3c86af2

basic concept

730756f

Merge remote-tracking branch 'origin/llamacasuallm-sp-bpe' into llama…

a8e48e3

…casuallm-sp-bpe

basic concept

ff906dc

Merge remote-tracking branch 'origin/llamacasuallm-sp-bpe' into llama…

d89f49b

…casuallm-sp-bpe

github-actions bot added the python python script changes label Oct 20, 2024

robbiemu added 2 commits October 20, 2024 19:03

line endings check in PR

c3363f6

Auto stash before rebase of "llamacasuallm-sp-bpe" onto "master"

413a19e

sed -i '' 's/\r$//' .gitignore (apparently I had a couple in there?)

ngxson requested a review from compilade October 22, 2024 13:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLamaCausalLM add support for tokenizer.json #9967

LLamaCausalLM add support for tokenizer.json #9967

robbiemu commented Oct 20, 2024 •

edited

Loading

LLamaCausalLM add support for tokenizer.json #9967

Are you sure you want to change the base?

LLamaCausalLM add support for tokenizer.json #9967

Conversation

robbiemu commented Oct 20, 2024 • edited Loading

robbiemu commented Oct 20, 2024 •

edited

Loading