feature request - support huggingface tokenizer #1764

keunwoochoi · 2023-06-08T19:34:25Z

currently in llama.cpp, convert.py assumes tokenizer.model file in the model path. seems like this works for any case that uses a sentencepiece tokenizer, but nothing else.

huggingface's tokenizer library is neat and provides more options than sentencepiece. it would be really great if ggml support any tokenizers from huggingface. i believe this means it'd expect merges.txt and vocab.json.

The text was updated successfully, but these errors were encountered:

Ender436 · 2023-10-09T01:49:20Z

I'm trying to convert a model that I've trained and exported in huggingface format, but when I try to convert, it is giving me an error saying it can't find vocab.json. What exactly is this file, where would I get it?

goerch · 2023-10-10T17:58:17Z

vocab.json is the vocabulary part of tokenizer.json. I believe these are serialization formats of different versions of HF.

Ender436 · 2023-10-11T02:03:37Z

Ok, thank you! I've tried taking the vocab section out and putting it in it's own file, but I'm getting this error:

raise Exception(f"Expected the {len(actual_ids)} added token ID(s) to be sequential in the range {vocab_size} - {expected_end_id}; got {actual_ids}")
Exception: Expected the 3 added token ID(s) to be sequential in the range 32000 - 32002; got [0, 1, 2]

Any idea what I'm doing wrong? Thanks for the help.

teleprint-me · 2023-10-11T21:01:20Z

The tokenizer is the mapping between the vocab and encodings representing the mapped sequence for a language model. That's what sentencepiece is for.

SentencePiece treats the sentences just as sequences of Unicode characters. There is no language-dependent logic.

So, I could train a model on a vocab of a fixed size and then use differing techniques like BPE, SPE, Unigram, etc. They all represent supported subword algorithms.

16:29:54 | /mnt/valerie/stanford_alpaca
(.venv) git:(main | Δ) λ bpython            
bpython version 0.24 on top of Python 3.11.5 /mnt/valerie/stanford_alpaca/.venv/bin/python
>>> import sentencepiece as spm
>>> model_path = "models/tatsu-lab/alpaca-7b/tokenizer.model"
>>> sp = spm.SentencePieceProcessor(model_file=model_path)
>>> vocab_size = sp.get_piece_size()
>>> print(vocab_size)
32000
>>> sp.encode("This is a sample sentence with <s>special</s> tokens.", out_type=int)
[910, 338, 263, 4559, 10541, 411, 529, 29879, 29958, 18732, 829, 29879, 29958, 18897, 29889]
>>>

A transformer model creates associations for next token predictions using the vocab during training usually using SGD (stochastic gradient descent). The derivative allows the model to error correct throughout the process; This is where the models weights are adjusted. A transformer model might use the Root Mean Square Epsilon as the derivative? (I'm still not sure about this part); Someone more experienced with this can feel free to error correct me.

I had GPT analyze my comment and this was part of its analysis.

GPT: Regarding the Root Mean Square Epsilon, I believe you might be referring to optimization algorithms like RMSprop. As previously mentioned, RMSprop is not a derivative, but it uses the gradient of the loss with respect to model weights.

So, the tokenizer.json will have the encoded vocabulary mapping in JSON format. We can use this freely to decipher what the encodings are mapped to.

17:16:10 | /mnt/valerie/stanford_alpaca
(.venv) git:(main | Δ) λ bpython
bpython version 0.24 on top of Python 3.11.5 /mnt/valerie/stanford_alpaca/.venv/bin/python
>>> import json
>>> def load_tokenizer_data(file_path):
...     """Load tokenizer data from a given JSON file."""
...     with open(file_path, "r", encoding="utf-8") as json_file:
...         return json.load(json_file)
... 
>>> def find_key_path(data, target_key):
...     """Recursively search for a key in a nested dictionary and return its path as a list."""
...     if isinstance(data, dict):
...         if target_key in data:
...             return [target_key]
...         for key, value in data.items():
...             path = find_key_path(value, target_key)
...             if path:
...                 return [key] + path
...     elif isinstance(data, list):
...         for idx, item in enumerate(data):
...             path = find_key_path(item, target_key)
...             if path:
...                 return [idx] + path
...     return None
... 
>>> tokenizer_data = load_tokenizer_data("models/tatsu-lab/alpaca-7b/tokenizer.json")
>>> tokenizer_data["model"]["type"]
'BPE'
>>> find_key_path(tokenizer_data, "▁This")
['model', 'vocab', '▁This']
>>> tokenizer_data["model"]["vocab"]["▁This"]
910
>>>

It's more complicated than this, but this is my general understanding as of the moment.

https://huggingface.co/docs/tokenizers/quicktour

Ender436 · 2023-10-12T16:36:16Z

I tried just removing the items in added_tokens.json, and the model converted, but it seems to have lost it's ability to stop on it's own now, and just rambles about random things after finishing saying what I'd expect.

strutive07 · 2023-10-15T10:19:26Z

Hi. I write a PR that runs convert.py based on the huggingface tokenizer without tokenizer.model. I hope this helps.

PR: #3633

keunwoochoi · 2023-10-15T16:08:12Z

@strutive07 that's great! thanks for letting us know

github-actions · 2024-04-10T01:07:33Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

genenwoochoi mentioned this issue Jun 8, 2023

feature request - disabling tokenizer in conversion / inference #1765

Open

strutive07 mentioned this issue Oct 15, 2023

support loading vocab from fast tokenizer config in convert.py #3633

Merged

teleprint-me mentioned this issue Oct 15, 2023

convert.py : handle special tokens #2820

Closed

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature request - support huggingface tokenizer #1764

feature request - support huggingface tokenizer #1764

keunwoochoi commented Jun 8, 2023

Ender436 commented Oct 9, 2023

goerch commented Oct 10, 2023

Ender436 commented Oct 11, 2023

teleprint-me commented Oct 11, 2023 •

edited

Loading

Ender436 commented Oct 12, 2023

strutive07 commented Oct 15, 2023 •

edited

Loading

keunwoochoi commented Oct 15, 2023

github-actions bot commented Apr 10, 2024

feature request - support huggingface tokenizer #1764

feature request - support huggingface tokenizer #1764

Comments

keunwoochoi commented Jun 8, 2023

Ender436 commented Oct 9, 2023

goerch commented Oct 10, 2023

Ender436 commented Oct 11, 2023

teleprint-me commented Oct 11, 2023 • edited Loading

Ender436 commented Oct 12, 2023

strutive07 commented Oct 15, 2023 • edited Loading

keunwoochoi commented Oct 15, 2023

github-actions bot commented Apr 10, 2024

teleprint-me commented Oct 11, 2023 •

edited

Loading

strutive07 commented Oct 15, 2023 •

edited

Loading