-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I can not extend vocab of LLaMA-3 using sentencepiece anymore vs LLaMA-2 ?!? #67
Comments
is there a way this could be handled in hf tokenizers ? a few pointers and/or some code would really help a lot of folks |
Llama 3 has improved tokenizer based on Tiktoken v/s Llama 2 which was based on Sentencepiece. Llama 3 tokenizer expands the vocabulary size to 128k (from 32k tokens in the previous version). https://github.com/meta-llama/llama3/blob/main/llama/tokenizer.py Can you try AutoTokenizer instead of LlamaTokenizer? |
@amitsangani AutoTokenizer doesnt work ideally the following was the go to script to extend the tokenizer in LLaMa-2
upon changing the LlamaTokenizer to AutoTokenizer and trying to extend the tokenizer on LLaMa-3 the following is the error.
cc @ArthurZucker does this looks like a hf issue ? |
I tried but no hope. If there is a quick codes will help. 128k vocab still does not cover basic vocab of VNese. Thanks and advanced. |
despite setting
|
Any help please...! |
@osanseviero @HamidShojanazeri - any ideas on how to resolve this? |
@StephennFernandes , Any update? I am also trying to do the same. |
I did likes this, and not very sure if this destroy the LLaMA-2 tokenizer or not !!! Please comment.
Can save tokenizer but reload took forever because new tokens are not standard token but added one. Also I am NOT very sure to add token/word from sentencepiece training into LLaMA-3 tiktoken ís a correct one eithẻ. Please comment and hints if any. We need solid soluton from Meta. |
Hi all! This is not a from tokenizers import Tokenizer
tok = Tokenizer.from_pretrained("my-new-tokenizer") |
okay that pretty much solves this. i currently do this:
|
@amitsangani could you also share the steps on how to train tiktoken tokenizer from scratch, given that you guys have found better tokenizer efficiency would be great to train the extended tokenizer using tiktoken and extend it to llama tokenizer. |
I did the way as suggested by @thusinh1969 . |
It is a completely different Tokenizer. Have to do likes this:
Now you can use tokenizer_new_fast as tokenizer as usual. |
@thusinh1969 |
No. It is a different error regarding your model setting, probably to do with gradient.
That should do it. |
FYI. In order to finetune further LlaMA-3 finetuned model, with this new extended tokenizer with proper LLaMA-3 format, you have to change the ChatFormat function as follows:
|
Regarding efficiency, I'll check as well, the |
Something is WRONG. The decoding of PreTrainedTokenizerFast (which LLaMA-3 are using) decode weird output once you add that token to the vocab using .add_tokens(word) function. I use standard tokenizer from LLaMA-3 repo and add only ONE word to the origin tokenizer and...:
It does NOT use the newly recently added token at all?!?!?! Why ? Any help please. Must be something missed. |
When adding a new token , Why is that? @ArthurZucker @thusinh1969 |
@VishnuPJ are you saving the tokenizer and then expanding the token embedding by loading the tokenizer freshly ? I don't understand you error clearly, can you elaborate more
trying doing this and then saving the fast tokenizer then freshly load the tokenizer as usual and try to expand token embedding |
Sorry for the confusion. I was able to add the tokens and tokenizer works as expected. But whie running |
@VishnuPJ ok seems like a trainer issue. @thusinh1969 can you check what this issue could actually be ? Id recommend cross checking your code with Chinese LLama alpaca 2 incase you haven't already. besides this I feel only @ArthurZucker and/or @osanseviero could help us out in this |
Regarding the new added token, the "issue" is that you need to make sure you add the correct representation of the string: >>> from tokenizers import AddedToken, pre_tokenizers
>>> from transformers import AutoTokenizer
>>> pre_tokenizers.ByteLevel(False,False).pre_tokenize_str("Bác")
[('Bác', (0, 3))]
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> tokenizer.add_tokens(AddedToken("Bác", normalized=False,special=False))
>>> tokenizer.decode(tokenizer.encode("Bác"))
'<|begin_of_text|>Bác' |
Since the strings are pre-tokenized to their bytelevel representation (it's not a normalization) then you need to add it using |
Thanks a lot @ArthurZucker 😊 it really means a ton !! |
Does not help. This will create 3 tokens for 1 word "Bác" which is exactly what we want to avoid. Should be only 1 token.
This is very ineffective. |
Mmm no then it's not added properly, let me try again, sorry forgot to check the ids |
Ok: >>> from tokenizers import AddedToken, pre_tokenizers
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> tokenizer.add_tokens(AddedToken("Bác", normalized=False,special=False))
>>> tokenizer.encode("Bác")
128256 # a new token this is alright, the only issue is the decoding. |
This issue is resolved. We need to add the below lines before calling
Previously I added those lines after |
@ArthurZucker so just for clarification the decoder produces char /byte based tokenization while decoding ? |
Yep overall the token that was added is |
I think the easiest solution is to simply make sure the Bytelevel decoder does not process the added tokens |
@ArthurZucker How to output bos_token? It doesn't work when I set "tokenizer.add_bos_token = True" Thanks |
huggingface/tokenizers#1513 will fix the issue for the new tokens |
Wonderful. How would it merged to which repo so we can get back and test. Cheers, |
@ArthurZucker I am confused about the Tokenizer for tiktoken training. What encoding is used for the corpus (such as cl100k_base, or p50k_base) when training the Tokenizer? What is the encoding of these characters? For example,['åIJ¦', 'ãĢĤ']
Input Chinese characters and output similar to this
|
It is a unicode representation of the bytes! >>> word = "否。"
>>> bword = b'\xe5\x90\xa6'
>>> decoded = b'\xe5\x90\xa6'.decode("latin-1")
>>> [ord(char) for char in decoded.decode("latin-1")]
[229, 144, 166] then you fetch the unicode representation of the bytes (which are supposed to come from utf-8): {33: '!', 34: '"', 35: '#', 36: '$', 37: '%', 38: '&', 39: "'", 40: '(', 41: ')', 42: '*', 43: '+', 44: ',', 45: '-', 46: '.', 47: '/', 48: '0', 49: '1', 50: '2', 51: '3', 52: '4', 53: '5', 54: '6', 55: '7', 56: '8', 57: '9', 58: ':', 59: ';', 60: '<', 61: '=', 62: '>', 63: '?', 64: '@', 65: 'A', 66: 'B', 67: 'C', 68: 'D', 69: 'E', 70: 'F', 71: 'G', 72: 'H', 73: 'I', 74: 'J', 75: 'K', 76: 'L', 77: 'M', 78: 'N', 79: 'O', 80: 'P', 81: 'Q', 82: 'R', 83: 'S', 84: 'T', 85: 'U', 86: 'V', 87: 'W', 88: 'X', 89: 'Y', 90: 'Z', 91: '[', 92: '\\', 93: ']', 94: '^', 95: '_', 96: '`', 97: 'a', 98: 'b', 99: 'c', 100: 'd', 101: 'e', 102: 'f', 103: 'g', 104: 'h', 105: 'i', 106: 'j', 107: 'k', 108: 'l', 109: 'm', 110: 'n', 111: 'o', 112: 'p', 113: 'q', 114: 'r', 115: 's', 116: 't', 117: 'u', 118: 'v', 119: 'w', 120: 'x', 121: 'y', 122: 'z', 123: '{', 124: '|', 125: '}', 126: '~', 161: '¡', 162: '¢', 163: '£', 164: '¤', 165: '¥', 166: '¦', 167: '§', 168: '¨', 169: '©', 170: 'ª', 171: '«', 172: '¬', 174: '®', 175: '¯', 176: '°', 177: '±', 178: '²', 179: '³', 180: '´', 181: 'µ', 182: '¶', 183: '·', 184: '¸', 185: '¹', 186: 'º', 187: '»', 188: '¼', 189: '½', 190: '¾', 191: '¿', 192: 'À', 193: 'Á', 194: 'Â', 195: 'Ã', 196: 'Ä', 197: 'Å', 198: 'Æ', 199: 'Ç', 200: 'È', 201: 'É', 202: 'Ê', 203: 'Ë', 204: 'Ì', 205: 'Í', 206: 'Î', 207: 'Ï', 208: 'Ð', 209: 'Ñ', 210: 'Ò', 211: 'Ó', 212: 'Ô', 213: 'Õ', 214: 'Ö', 215: '×', 216: 'Ø', 217: 'Ù', 218: 'Ú', 219: 'Û', 220: 'Ü', 221: 'Ý', 222: 'Þ', 223: 'ß', 224: 'à', 225: 'á', 226: 'â', 227: 'ã', 228: 'ä', 229: 'å', 230: 'æ', 231: 'ç', 232: 'è', 233: 'é', 234: 'ê', 235: 'ë', 236: 'ì', 237: 'í', 238: 'î', 239: 'ï', 240: 'ð', 241: 'ñ', 242: 'ò', 243: 'ó', 244: 'ô', 245: 'õ', 246: 'ö', 247: '÷', 248: 'ø', 249: 'ù', 250: 'ú', 251: 'û', 252: 'ü', 253: 'ý', 254: 'þ', 255: 'ÿ', 0: 'Ā', 1: 'ā', 2: 'Ă', 3: 'ă', 4: 'Ą', 5: 'ą', 6: 'Ć', 7: 'ć', 8: 'Ĉ', 9: 'ĉ', 10: 'Ċ', 11: 'ċ', 12: 'Č', 13: 'č', 14: 'Ď', 15: 'ď', 16: 'Đ', 17: 'đ', 18: 'Ē', 19: 'ē', 20: 'Ĕ', 21: 'ĕ', 22: 'Ė', 23: 'ė', 24: 'Ę', 25: 'ę', 26: 'Ě', 27: 'ě', 28: 'Ĝ', 29: 'ĝ', 30: 'Ğ', 31: 'ğ', 32: 'Ġ', 127: 'ġ', 128: 'Ģ', 129: 'ģ', 130: 'Ĥ', 131: 'ĥ', 132: 'Ħ', 133: 'ħ', 134: 'Ĩ', 135: 'ĩ', 136: 'Ī', 137: 'ī', 138: 'Ĭ', 139: 'ĭ', 140: 'Į', 141: 'į', 142: 'İ', 143: 'ı', 144: 'IJ', 145: 'ij', 146: 'Ĵ', 147: 'ĵ', 148: 'Ķ', 149: 'ķ', 150: 'ĸ', 151: 'Ĺ', 152: 'ĺ', 153: 'Ļ', 154: 'ļ', 155: 'Ľ', 156: 'ľ', 157: 'Ŀ', 158: 'ŀ', 159: 'Ł', 160: 'ł', 173: 'Ń'} this basically allows you to represent any byte array in unicodes, simplifying the tokenization process. |
( |
Gents and @ArthurZucker is the decoder fixes merged already somewhere ? Thanks, |
huggingface/tokenizers#1513 can be used, gonna merge today and prepare the update to |
how do i train tiktoken tokenizer from scratch ? i see even Phi-3 uses tiktoken tokenizer. bu i cannot find any documentation on how to train the tiktoken tokenizer. all help would be greatly appreciated. |
Train sentence piece and merge, see above code. But its decoder is buggy, hence have to wait for the change to merge into HF's tokenizers package. @ArthurZucker when should we expect the change to be part of oficial tokenizers package ? Thanks, |
I know that we could train spm and merge but that's not the point is there a way we could train tiktoken from scratch was my actual query. as i see even other orgs using their own custom trained versions of tiktoken like phi-3 model used |
Gents, I installed tokenizers from source (tokenizers-0.19.1.dev0) from main branch. It is now working.
I am closing the issue, we can now extend vocab and continually pretrain LlaMA-3 further. Thanks @ArthurZucker et all, |
🤗 glad I was of help! |
I'm newbie in LLM field. I want to extend llama3 tokenizer through korean corpus. Anyone help. plz. |
@ArthurZucker when i run the code above, i got the error as below: 2 bword = b'\xe5\x90\xa6'
3 decoded = b'\xe5\x90\xa6'.decode("latin-1")
----> 4 [ord(char) for char in decoded.decode("latin-1")]
AttributeError: 'str' object has no attribute 'decode' |
Hey! Decoded is already a string, you probably wanted to do |
@amitsangani |
I usually extend vocab to make the model closer to Vietnames language. The code is below. However, it seems that the tokenizer of LLaMA-3 is no longer work with SentencePiece. Even LlamaTokenizer is no longer compatible with LLaMA-3. Any hint please ?
In the meanwhile, standard AutoTokenizer can no longer load new LlaMA-3 's tokenizer.model. Any help highly appreciated.
Thanks,
Steve
The text was updated successfully, but these errors were encountered: