Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request - support huggingface tokenizer #1764

Closed
keunwoochoi opened this issue Jun 8, 2023 · 8 comments
Closed

feature request - support huggingface tokenizer #1764

keunwoochoi opened this issue Jun 8, 2023 · 8 comments
Labels

Comments

@keunwoochoi
Copy link

currently in llama.cpp, convert.py assumes tokenizer.model file in the model path. seems like this works for any case that uses a sentencepiece tokenizer, but nothing else.

huggingface's tokenizer library is neat and provides more options than sentencepiece. it would be really great if ggml support any tokenizers from huggingface. i believe this means it'd expect merges.txt and vocab.json.

@Ender436
Copy link

Ender436 commented Oct 9, 2023

I'm trying to convert a model that I've trained and exported in huggingface format, but when I try to convert, it is giving me an error saying it can't find vocab.json. What exactly is this file, where would I get it?

@goerch
Copy link
Collaborator

goerch commented Oct 10, 2023

vocab.json is the vocabulary part of tokenizer.json. I believe these are serialization formats of different versions of HF.

@Ender436
Copy link

Ok, thank you! I've tried taking the vocab section out and putting it in it's own file, but I'm getting this error:

raise Exception(f"Expected the {len(actual_ids)} added token ID(s) to be sequential in the range {vocab_size} - {expected_end_id}; got {actual_ids}")
Exception: Expected the 3 added token ID(s) to be sequential in the range 32000 - 32002; got [0, 1, 2]

Any idea what I'm doing wrong? Thanks for the help.

@teleprint-me
Copy link
Contributor

teleprint-me commented Oct 11, 2023

The tokenizer is the mapping between the vocab and encodings representing the mapped sequence for a language model. That's what sentencepiece is for.

SentencePiece treats the sentences just as sequences of Unicode characters. There is no language-dependent logic.

So, I could train a model on a vocab of a fixed size and then use differing techniques like BPE, SPE, Unigram, etc. They all represent supported subword algorithms.

16:29:54 | /mnt/valerie/stanford_alpaca
(.venv) git:(main | Δ) λ bpython            
bpython version 0.24 on top of Python 3.11.5 /mnt/valerie/stanford_alpaca/.venv/bin/python
>>> import sentencepiece as spm
>>> model_path = "models/tatsu-lab/alpaca-7b/tokenizer.model"
>>> sp = spm.SentencePieceProcessor(model_file=model_path)
>>> vocab_size = sp.get_piece_size()
>>> print(vocab_size)
32000
>>> sp.encode("This is a sample sentence with <s>special</s> tokens.", out_type=int)
[910, 338, 263, 4559, 10541, 411, 529, 29879, 29958, 18732, 829, 29879, 29958, 18897, 29889]
>>> 

A transformer model creates associations for next token predictions using the vocab during training usually using SGD (stochastic gradient descent). The derivative allows the model to error correct throughout the process; This is where the models weights are adjusted. A transformer model might use the Root Mean Square Epsilon as the derivative? (I'm still not sure about this part); Someone more experienced with this can feel free to error correct me.


I had GPT analyze my comment and this was part of its analysis.

GPT: Regarding the Root Mean Square Epsilon, I believe you might be referring to optimization algorithms like RMSprop. As previously mentioned, RMSprop is not a derivative, but it uses the gradient of the loss with respect to model weights.


So, the tokenizer.json will have the encoded vocabulary mapping in JSON format. We can use this freely to decipher what the encodings are mapped to.

17:16:10 | /mnt/valerie/stanford_alpaca
(.venv) git:(main | Δ) λ bpython
bpython version 0.24 on top of Python 3.11.5 /mnt/valerie/stanford_alpaca/.venv/bin/python
>>> import json
>>> def load_tokenizer_data(file_path):
...     """Load tokenizer data from a given JSON file."""
...     with open(file_path, "r", encoding="utf-8") as json_file:
...         return json.load(json_file)
... 
>>> def find_key_path(data, target_key):
...     """Recursively search for a key in a nested dictionary and return its path as a list."""
...     if isinstance(data, dict):
...         if target_key in data:
...             return [target_key]
...         for key, value in data.items():
...             path = find_key_path(value, target_key)
...             if path:
...                 return [key] + path
...     elif isinstance(data, list):
...         for idx, item in enumerate(data):
...             path = find_key_path(item, target_key)
...             if path:
...                 return [idx] + path
...     return None
... 
>>> tokenizer_data = load_tokenizer_data("models/tatsu-lab/alpaca-7b/tokenizer.json")
>>> tokenizer_data["model"]["type"]
'BPE'
>>> find_key_path(tokenizer_data, "▁This")
['model', 'vocab', '▁This']
>>> tokenizer_data["model"]["vocab"]["▁This"]
910
>>> 

It's more complicated than this, but this is my general understanding as of the moment.

https://huggingface.co/docs/tokenizers/quicktour

@Ender436
Copy link

I tried just removing the items in added_tokens.json, and the model converted, but it seems to have lost it's ability to stop on it's own now, and just rambles about random things after finishing saying what I'd expect.

@strutive07
Copy link
Contributor

strutive07 commented Oct 15, 2023

Hi. I write a PR that runs convert.py based on the huggingface tokenizer without tokenizer.model. I hope this helps.

PR: #3633

@keunwoochoi
Copy link
Author

@strutive07 that's great! thanks for letting us know

@github-actions github-actions bot added the stale label Mar 25, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants