-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading GGUF files support #30391
Loading GGUF files support #30391
Conversation
Co-authored-by: Younes Belkada <[email protected]> Co-authored-by: 99991 <[email protected]>
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥳 - exciting to see this added!
Mostly nits and small comments. Addition of gguf flags to common logic in modeling_utils makes me slightly uneasy e.g.
if gguf_path is None and (low_cpu_mem_usage or (use_keep_in_fp32_modules and is_accelerate_available())):
It indicates to me passing gguf through modeling_utils isn't really compatible. I think it's OK atm. If we find there's other formats we end up wanting to support, then we might have to restructure the logic flow s.t. we're not having to pass around these flags everywhere.
Main comment is about the structure of the if/else
statements in the code.
I have no idea about the intended logic for dequantization - looks sensible to me but I haven't looked in depth at those methods :)
docs/source/en/gguf.md
Outdated
filename = "tinyllama-1.1b-chat-v1.0.Q6_K.gguf" | ||
|
||
tokenizer = AutoTokenizer.from_pretrained(model_id, from_gguf=filename) | ||
model = AutoModelForCausalLM.from_pretrained(model_id, from_gguf=filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if I passed in a quantization config in with the from_pretrained
call? gguf -> unquantized -> requantized?
I see this is handled in modeling utils ❤️
tokenizer.save_pretrained('directory') | ||
model.save_pretrained('directory') | ||
|
||
!python ${path_to_llama_cpp}/convert-hf-to-gguf.py ${directory} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if we had this within save_pretrained using e.g. a save_gguf
flag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is part of the full integration, will do that in a follow up PR !
if "fast_tokenizer_files" in tokenizer_config: | ||
fast_tokenizer_file = get_fast_tokenizer_file(tokenizer_config["fast_tokenizer_files"]) | ||
vocab_files["tokenizer_file"] = fast_tokenizer_file | ||
if not from_gguf: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ultranit - it's a bit funny to define the default case as "not gguf" i.e. it's centering on gguf as how we look at our objects. If we end up adding another format, this would then have to follow the pattern "if not x and not y", it's easier to do if from_gguf
and else
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to change it: e6c6f6c
@@ -112,6 +115,10 @@ def __init__(self, *args, **kwargs): | |||
elif slow_tokenizer is not None: | |||
# We need to convert a slow tokenizer to build the backend | |||
fast_tokenizer = convert_slow_tokenizer(slow_tokenizer) | |||
elif from_gguf is not None: | |||
# We need to convert a slow tokenizer to build the backend | |||
tokenizer_dict = load_gguf_checkpoint(kwargs.get("vocab_file"))["tokenizer"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work if kwargs.get("vocab_file")
is None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The correct vocab file is always passed here: #30391 (comment) so this will less likely to happen, but if one passes None
indeed it'll fail
|
||
|
||
def load_dequant_gguf_tensor(shape, ggml_type, data): | ||
if ggml_type == GGML_TYPES["F32"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - more of a stylistic choice - the checking pattern in the if/elif/else statement looks like it lends itself to an IntEnum
if "llama" in architecture and "mistral" in model_name: | ||
updated_architecture = "mistral" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unfortunately this is because mistral in llama.cpp uses the exact same arch in mistral 😢 will add a comment explaining
if architecture == "llama" and (".attn_k." in name or ".attn_q." in name): | ||
num_heads = parsed_parameters["config"]["num_attention_heads"] | ||
tmp_shape = (int(shape[-1] // num_heads // 2), num_heads, 2, shape[0]) | ||
weights = weights.reshape(tmp_shape) | ||
weights = weights.transpose(0, 2, 1, 3) | ||
weights = weights.reshape(shape[::-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll want to make this more general when we have more models - a problem for future us!
if len(reader_keys) > 0: | ||
logger.info(f"Some keys of the GGUF file were not considered: {reader_keys}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
AddedToken("<unk>", normalized=False, special=True), | ||
AddedToken("<s>", normalized=False, special=True), | ||
AddedToken("</s>", normalized=False, special=True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these always the same? For Llama based, no. THis would add extra tokens and can mess order etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not adressed! the added_tokens should all be added, and the special tokens as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
], | ||
axis=1, | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably completely separate what was entirely vendored and what we added. Splitting the file here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some methods above were written by us so not entirely vendored
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still some are vendored -> which ones? Which ones did we write ourselved?
class GGUFLlamaConverter(LlamaConverter): | ||
def __init__(self, tokenizer_dict): | ||
self.proto = GGUFTokenizerSkeleton(tokenizer_dict) | ||
self.original_tokenizer = self.proto |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the original tokenizer is usually a PreTrainedTokenizer. not a proto
elif from_gguf is not None: | ||
# We need to convert a slow tokenizer to build the backend | ||
tokenizer_dict = load_gguf_checkpoint(kwargs.get("vocab_file"))["tokenizer"] | ||
fast_tokenizer = convert_gguf_tokenizer(tokenizer_dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note for myself: convert from tiktoken could also be added here. Maybe a mapping from_xxx with the function ?
out = model.generate(**text, max_new_tokens=10) | ||
|
||
EXPECTED_TEXT = "<s> Hello,\n\nI'm trying to create a" | ||
self.assertEqual(tokenizer.decode(out[0], skip_special_tokens=True), EXPECTED_TEXT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing tests on special tokens / additional special tokens:
- do we correctly skip special tokens like GGUF would
- do we add extra spaces or not like GGUF
- checkout some of these tests: https://github.com/huggingface/transformers/blob/main/tests/models/llama/test_tokenization_llama.py#L731
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense ! As discussed offline, I fixed some issues and added some tests here: 3bdbb2e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, can you add just one test with added tokens. SOmething like ".Hey How.Hey<token>. <token>" with
` being one of gguf.added_tokens ? (so or or )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: amyeroberts <[email protected]>
Co-authored-by: amyeroberts <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 nits and 1 test to add!
out = model.generate(**text, max_new_tokens=10) | ||
|
||
EXPECTED_TEXT = "<s> Hello,\n\nI'm trying to create a" | ||
self.assertEqual(tokenizer.decode(out[0], skip_special_tokens=True), EXPECTED_TEXT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, can you add just one test with added tokens. SOmething like ".Hey How.Hey<token>. <token>" with
` being one of gguf.added_tokens ? (so or or )
AddedToken("<unk>", normalized=False, special=True), | ||
AddedToken("<s>", normalized=False, special=True), | ||
AddedToken("</s>", normalized=False, special=True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not adressed! the added_tokens should all be added, and the special tokens as well.
) | ||
return tokenizer | ||
|
||
def decoder(self, replacement, add_prefix_space): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add prefix space is defined in the gguf? Might not be good to always take it from the class (which is what's happening now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not defined from what I read in the GGML docs + when inspecting various checkpoints from the Hub
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it's always adding a prefix space I suppose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks again for adding this - excited to see it in action 🎬
Two general comments:
- The tokenizer logic @ArthurZucker highlighted will need to be addressed before merge
from_gguf
as a flag name doesn't align with otherfrom_xxx
flags in from_pretrained e.g.from_tf
which are bools. Could we align it with the flag closer to its meaning e.g.gguf_id
orgguf_file
?
@@ -658,6 +659,8 @@ def _get_config_dict( | |||
from_auto_class = kwargs.pop("_from_auto", False) | |||
commit_hash = kwargs.pop("_commit_hash", None) | |||
|
|||
from_gguf = kwargs.get("from_gguf", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be pop
here?
from_gguf = kwargs.get("from_gguf", None) | |
from_gguf = kwargs.pop("from_gguf", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I think it should be get
as from_gguf
is used later in case one uses Auto classes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, OK!
tests/quantization/ggml/test_ggml.py
Outdated
# Otherwise the test takes too long | ||
if i > 100: | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A cleaner way to do this is a take a slice in the dataset such that you iterate over a small subset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense ! Done in 65433c4
Thanks both for the extensive review ! 🚀 |
tokenizer.add_special_tokens( | ||
[AddedToken(added_token, normalized=False, special=False) for added_token in self.added_tokens] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all of them are special here. You can add them all as special
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@younesbelkada this just means that added tokens that are not special will be skipped when decoding.
* Adds support for loading GGUF files Co-authored-by: Younes Belkada <[email protected]> Co-authored-by: 99991 <[email protected]> * add q2_k q3_k q5_k support from @99991 * fix tests * Update doc * Style * Docs * fix CI * Update docs/source/en/gguf.md * Update docs/source/en/gguf.md * Compute merges * change logic * add comment for clarity * add comment for clarity * Update src/transformers/models/auto/tokenization_auto.py Co-authored-by: amyeroberts <[email protected]> * change logic * Update src/transformers/modeling_utils.py Co-authored-by: amyeroberts <[email protected]> * change * Apply suggestions from code review Co-authored-by: amyeroberts <[email protected]> * Update src/transformers/modeling_gguf_pytorch_utils.py Co-authored-by: amyeroberts <[email protected]> * put back comment * add comment about mistral * comments and added tests * fix unconsistent type * more * fix tokenizer * Update src/transformers/modeling_utils.py Co-authored-by: amyeroberts <[email protected]> * address comments about tests and tokenizer + add added_tokens * from_gguf -> gguf_file * replace on docs too --------- Co-authored-by: Younes Belkada <[email protected]> Co-authored-by: 99991 <[email protected]> Co-authored-by: Younes Belkada <[email protected]> Co-authored-by: amyeroberts <[email protected]>
is this correct, or in-progress in v4.41-release?
|
Hi @brandon-lockaby ! |
Same error. Created and loaded from this repo https://huggingface.co/brandonglockaby/Meta-Llama-3-8B-Q4_K_M-GGUF I should point out, previous attempts are ggufs that work correctly with current releases of llama.cpp and llama-cpp-python |
Indeed I was able to repro, this is because the tokenizer is registered as |
@brandon-lockaby - #31175 has been merged and might include a fix for the issue you are facing, can you try to re-run the snippet using transformers main branch? |
Same error related to tokenizer filename, produced with updated repo from gguf-my-repo as well as a gguf from my storage |
Hi @brandon-lockaby |
Hi @younesbelkada, I have tried #31358, and now the tokenizer can be loaded successfully. However, when I attempt to load the GGUF model, an OSError occurs:
Here is my code: from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, LlamaForCausalLM
model_id = "QuantFactory/Meta-Llama-3-8B-GGUF"
filename = "Meta-Llama-3-8B.Q4_K_M.gguf"
tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)
model = LlamaForCausalLM.from_pretrained(model_id, gguf_file=filename) Many GGUF models on Hugging Face do not have a config.json. So, I tried to load the config from the raw Meta-Llama-3-8B: config = AutoConfig.from_pretrained("meta-llama/Meta-Llama-3-8B")
model = LlamaForCausalLM.from_pretrained(model_id, gguf_file=filename, config=config) This approach works, but I think this is not an elegant solution. Perhaps more modifications are needed here. Thank you for your contribution. |
Hi @Lin-xs |
Thank you, this approach works. |
Hi @younesbelkada @Isotr0py , I encountered a bug when trying to use from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "QuantFactory/Qwen2-7B-GGUF"
filename = "Qwen2-7B.Q4_K_M.gguf"
tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)
model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename) The error message I received is:
The same error occurs when I try to load Could you please take a look at it? Thanks! |
I think probably this is because the default vocab_size of from transformers.modeling_gguf_pytorch_utils import load_gguf_checkpoint
from transformers.utils import cached_file
model_id = "Qwen/Qwen2-7B-Instruct-GGUF"
filename = "qwen2-7b-instruct-q2_k.gguf"
gguf_path = cached_file(model_id, filename,)
config_dict = load_gguf_checkpoint(gguf_path, return_tensors=False)["config"]
print(config_dict) the output is
|
very appreciate the effort to easily use GGUF models w the transformers library :) e.g. when i do the following
i get: |
cc @SunMarc |
That's right ! The goal of this feature is to let users load their gguf in transformers so that they can fine-tune them before reconverting them to gguf format ! |
ok got it! thank you for confirming :)!! |
This PR offers the ability to load
.gguf
files withintransformers
, dequantizing them to float32.Doing so enables further training on the GGUF files, before converting them back in the GGUF format for usage in the GGML ecosystem.
We enable this through the
from_gguf
keyword argument of thefrom_pretrained
methods of configurations, tokenizers, and PyTorch models. Here is an example of the API:Supported quantization types
The initial supported quantization types are decided according to the popular quantized files that have been shared
on the Hub.
We take example from, and credit the excellent 99991/pygguf Python parser to dequantize the weights.
Supported model architectures
For now the supported model architectures are the architectures that have been very popular on the Hub, namely: