-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature request - support huggingface tokenizer #1764
Comments
I'm trying to convert a model that I've trained and exported in huggingface format, but when I try to convert, it is giving me an error saying it can't find vocab.json. What exactly is this file, where would I get it? |
|
Ok, thank you! I've tried taking the vocab section out and putting it in it's own file, but I'm getting this error: raise Exception(f"Expected the {len(actual_ids)} added token ID(s) to be sequential in the range {vocab_size} - {expected_end_id}; got {actual_ids}") Any idea what I'm doing wrong? Thanks for the help. |
The tokenizer is the mapping between the vocab and encodings representing the mapped sequence for a language model. That's what sentencepiece is for.
So, I could train a model on a vocab of a fixed size and then use differing techniques like BPE, SPE, Unigram, etc. They all represent supported subword algorithms. 16:29:54 | /mnt/valerie/stanford_alpaca
(.venv) git:(main | Δ) λ bpython
bpython version 0.24 on top of Python 3.11.5 /mnt/valerie/stanford_alpaca/.venv/bin/python
>>> import sentencepiece as spm
>>> model_path = "models/tatsu-lab/alpaca-7b/tokenizer.model"
>>> sp = spm.SentencePieceProcessor(model_file=model_path)
>>> vocab_size = sp.get_piece_size()
>>> print(vocab_size)
32000
>>> sp.encode("This is a sample sentence with <s>special</s> tokens.", out_type=int)
[910, 338, 263, 4559, 10541, 411, 529, 29879, 29958, 18732, 829, 29879, 29958, 18897, 29889]
>>> A transformer model creates associations for next token predictions using the vocab during training usually using SGD (stochastic gradient descent). The derivative allows the model to error correct throughout the process; This is where the models weights are adjusted. A transformer model might use the Root Mean Square Epsilon as the derivative? (I'm still not sure about this part); Someone more experienced with this can feel free to error correct me. I had GPT analyze my comment and this was part of its analysis.
So, the 17:16:10 | /mnt/valerie/stanford_alpaca
(.venv) git:(main | Δ) λ bpython
bpython version 0.24 on top of Python 3.11.5 /mnt/valerie/stanford_alpaca/.venv/bin/python
>>> import json
>>> def load_tokenizer_data(file_path):
... """Load tokenizer data from a given JSON file."""
... with open(file_path, "r", encoding="utf-8") as json_file:
... return json.load(json_file)
...
>>> def find_key_path(data, target_key):
... """Recursively search for a key in a nested dictionary and return its path as a list."""
... if isinstance(data, dict):
... if target_key in data:
... return [target_key]
... for key, value in data.items():
... path = find_key_path(value, target_key)
... if path:
... return [key] + path
... elif isinstance(data, list):
... for idx, item in enumerate(data):
... path = find_key_path(item, target_key)
... if path:
... return [idx] + path
... return None
...
>>> tokenizer_data = load_tokenizer_data("models/tatsu-lab/alpaca-7b/tokenizer.json")
>>> tokenizer_data["model"]["type"]
'BPE'
>>> find_key_path(tokenizer_data, "▁This")
['model', 'vocab', '▁This']
>>> tokenizer_data["model"]["vocab"]["▁This"]
910
>>> It's more complicated than this, but this is my general understanding as of the moment. |
I tried just removing the items in added_tokens.json, and the model converted, but it seems to have lost it's ability to stop on it's own now, and just rambles about random things after finishing saying what I'd expect. |
Hi. I write a PR that runs convert.py based on the huggingface tokenizer without tokenizer.model. I hope this helps. PR: #3633 |
@strutive07 that's great! thanks for letting us know |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
currently in llama.cpp,
convert.py
assumestokenizer.model
file in the model path. seems like this works for any case that uses asentencepiece
tokenizer, but nothing else.huggingface's tokenizer library is neat and provides more options than
sentencepiece
. it would be really great if ggml support any tokenizers from huggingface. i believe this means it'd expectmerges.txt
andvocab.json
.The text was updated successfully, but these errors were encountered: