Loading GGUF files support #30391

LysandreJik · 2024-04-22T12:18:09Z

This PR offers the ability to load .gguf files within transformers, dequantizing them to float32.
Doing so enables further training on the GGUF files, before converting them back in the GGUF format for usage in the GGML ecosystem.

We enable this through the from_gguf keyword argument of the from_pretrained methods of configurations, tokenizers, and PyTorch models. Here is an example of the API:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF"
filename = "tinyllama-1.1b-chat-v1.0.Q6_K.gguf"

tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)
model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename)

Note
This is still very experimental and we don't expect all files to work seamlessly. We'll work on improving the current implementation and welcome PRs that help us do so.

Supported quantization types

The initial supported quantization types are decided according to the popular quantized files that have been shared
on the Hub.

F32
Q2_K
Q3_K
Q4_0
Q4_K
Q5_K
Q6_K
Q8_0

We take example from, and credit the excellent 99991/pygguf Python parser to dequantize the weights.

Supported model architectures

For now the supported model architectures are the architectures that have been very popular on the Hub, namely:

LLaMa
Mistral

Co-authored-by: Younes Belkada <[email protected]> Co-authored-by: 99991 <[email protected]>

HuggingFaceDocBuilderDev · 2024-04-22T16:18:07Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

docs/source/en/gguf.md

amyeroberts

🥳 - exciting to see this added!

Mostly nits and small comments. Addition of gguf flags to common logic in modeling_utils makes me slightly uneasy e.g.

if gguf_path is None and (low_cpu_mem_usage or (use_keep_in_fp32_modules and is_accelerate_available())):

It indicates to me passing gguf through modeling_utils isn't really compatible. I think it's OK atm. If we find there's other formats we end up wanting to support, then we might have to restructure the logic flow s.t. we're not having to pass around these flags everywhere.

Main comment is about the structure of the if/else statements in the code.

I have no idea about the intended logic for dequantization - looks sensible to me but I haven't looked in depth at those methods :)

amyeroberts · 2024-04-25T11:11:04Z

docs/source/en/gguf.md

+filename = "tinyllama-1.1b-chat-v1.0.Q6_K.gguf"
+
+tokenizer = AutoTokenizer.from_pretrained(model_id, from_gguf=filename)
+model = AutoModelForCausalLM.from_pretrained(model_id, from_gguf=filename)


~~What would happen if I passed in a quantization config in with the from_pretrained call? gguf -> unquantized -> requantized?~~

I see this is handled in modeling utils ❤️

amyeroberts · 2024-04-25T11:12:10Z

docs/source/en/gguf.md

+tokenizer.save_pretrained('directory')
+model.save_pretrained('directory')
+
+!python ${path_to_llama_cpp}/convert-hf-to-gguf.py ${directory}


It would be nice if we had this within save_pretrained using e.g. a save_gguf flag

Yes this is part of the full integration, will do that in a follow up PR !

amyeroberts · 2024-04-25T11:15:19Z

src/transformers/tokenization_utils_base.py

-                        if "fast_tokenizer_files" in tokenizer_config:
-                            fast_tokenizer_file = get_fast_tokenizer_file(tokenizer_config["fast_tokenizer_files"])
-                vocab_files["tokenizer_file"] = fast_tokenizer_file
+            if not from_gguf:


ultranit - it's a bit funny to define the default case as "not gguf" i.e. it's centering on gguf as how we look at our objects. If we end up adding another format, this would then have to follow the pattern "if not x and not y", it's easier to do if from_gguf and else.

Makes sense to change it: e6c6f6c

amyeroberts · 2024-04-25T11:20:39Z

src/transformers/tokenization_utils_fast.py

@@ -112,6 +115,10 @@ def __init__(self, *args, **kwargs):
        elif slow_tokenizer is not None:
            # We need to convert a slow tokenizer to build the backend
            fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
+        elif from_gguf is not None:
+            # We need to convert a slow tokenizer to build the backend
+            tokenizer_dict = load_gguf_checkpoint(kwargs.get("vocab_file"))["tokenizer"]


Does this work if kwargs.get("vocab_file") is None?

The correct vocab file is always passed here: #30391 (comment) so this will less likely to happen, but if one passes None indeed it'll fail

src/transformers/tokenization_utils_base.py

amyeroberts · 2024-04-25T12:01:16Z

src/transformers/integrations/ggml.py

+
+
+def load_dequant_gguf_tensor(shape, ggml_type, data):
+    if ggml_type == GGML_TYPES["F32"]:


nit - more of a stylistic choice - the checking pattern in the if/elif/else statement looks like it lends itself to an IntEnum

amyeroberts · 2024-04-25T12:04:38Z

src/transformers/modeling_gguf_pytorch_utils.py

+    if "llama" in architecture and "mistral" in model_name:
+        updated_architecture = "mistral"


Why is this the case?

unfortunately this is because mistral in llama.cpp uses the exact same arch in mistral 😢 will add a comment explaining

src/transformers/modeling_gguf_pytorch_utils.py

amyeroberts · 2024-04-25T12:14:15Z

src/transformers/modeling_gguf_pytorch_utils.py

+            if architecture == "llama" and (".attn_k." in name or ".attn_q." in name):
+                num_heads = parsed_parameters["config"]["num_attention_heads"]
+                tmp_shape = (int(shape[-1] // num_heads // 2), num_heads, 2, shape[0])
+                weights = weights.reshape(tmp_shape)
+                weights = weights.transpose(0, 2, 1, 3)
+                weights = weights.reshape(shape[::-1])


We'll want to make this more general when we have more models - a problem for future us!

amyeroberts · 2024-04-25T12:14:54Z

src/transformers/modeling_gguf_pytorch_utils.py

+    if len(reader_keys) > 0:
+        logger.info(f"Some keys of the GGUF file were not considered: {reader_keys}")


ArthurZucker

Looks great!

ArthurZucker · 2024-04-25T14:39:39Z

src/transformers/integrations/ggml.py

+                AddedToken("<unk>", normalized=False, special=True),
+                AddedToken("<s>", normalized=False, special=True),
+                AddedToken("</s>", normalized=False, special=True),


are these always the same? For Llama based, no. THis would add extra tokens and can mess order etc.

not adressed! the added_tokens should all be added, and the special tokens as well.

ArthurZucker · 2024-04-25T14:40:57Z

src/transformers/integrations/ggml.py

+        ],
+        axis=1,
+    )
+


I would probably completely separate what was entirely vendored and what we added. Splitting the file here

some methods above were written by us so not entirely vendored

Still some are vendored -> which ones? Which ones did we write ourselved?

ArthurZucker · 2024-04-25T14:42:13Z

src/transformers/integrations/ggml.py

+class GGUFLlamaConverter(LlamaConverter):
+    def __init__(self, tokenizer_dict):
+        self.proto = GGUFTokenizerSkeleton(tokenizer_dict)
+        self.original_tokenizer = self.proto


the original tokenizer is usually a PreTrainedTokenizer. not a proto

src/transformers/tokenization_utils_fast.py

ArthurZucker · 2024-04-25T14:47:43Z

src/transformers/tokenization_utils_fast.py

+        elif from_gguf is not None:
+            # We need to convert a slow tokenizer to build the backend
+            tokenizer_dict = load_gguf_checkpoint(kwargs.get("vocab_file"))["tokenizer"]
+            fast_tokenizer = convert_gguf_tokenizer(tokenizer_dict)


note for myself: convert from tiktoken could also be added here. Maybe a mapping from_xxx with the function ?

src/transformers/tokenization_utils_base.py

src/transformers/models/auto/tokenization_auto.py

ArthurZucker · 2024-04-25T14:53:43Z

tests/quantization/ggml/test_ggml.py

+        out = model.generate(**text, max_new_tokens=10)
+
+        EXPECTED_TEXT = "<s> Hello,\n\nI'm trying to create a"
+        self.assertEqual(tokenizer.decode(out[0], skip_special_tokens=True), EXPECTED_TEXT)


missing tests on special tokens / additional special tokens:

do we correctly skip special tokens like GGUF would

do we add extra spaces or not like GGUF

checkout some of these tests: https://github.com/huggingface/transformers/blob/main/tests/models/llama/test_tokenization_llama.py#L731

Makes sense ! As discussed offline, I fixed some issues and added some tests here: 3bdbb2e

Ok, can you add just one test with added tokens. SOmething like ".Hey How.Hey<token>. <token>" with ` being one of gguf.added_tokens ? (so or or )

Co-authored-by: amyeroberts <[email protected]>

ArthurZucker

2 nits and 1 test to add!

ArthurZucker · 2024-05-15T09:31:45Z

tests/quantization/ggml/test_ggml.py

+        out = model.generate(**text, max_new_tokens=10)
+
+        EXPECTED_TEXT = "<s> Hello,\n\nI'm trying to create a"
+        self.assertEqual(tokenizer.decode(out[0], skip_special_tokens=True), EXPECTED_TEXT)


Ok, can you add just one test with added tokens. SOmething like ".Hey How.Hey<token>. <token>" with ` being one of gguf.added_tokens ? (so or or )

ArthurZucker · 2024-05-15T09:34:04Z

src/transformers/integrations/ggml.py

+                AddedToken("<unk>", normalized=False, special=True),
+                AddedToken("<s>", normalized=False, special=True),
+                AddedToken("</s>", normalized=False, special=True),


not adressed! the added_tokens should all be added, and the special tokens as well.

ArthurZucker · 2024-05-15T09:34:44Z

src/transformers/integrations/ggml.py

+        )
+        return tokenizer
+
+    def decoder(self, replacement, add_prefix_space):


add prefix space is defined in the gguf? Might not be good to always take it from the class (which is what's happening now)

It is not defined from what I read in the GGML docs + when inspecting various checkpoints from the Hub

So it's always adding a prefix space I suppose?

amyeroberts

Looks great! Thanks again for adding this - excited to see it in action 🎬

Two general comments:

The tokenizer logic @ArthurZucker highlighted will need to be addressed before merge
from_gguf as a flag name doesn't align with other from_xxx flags in from_pretrained e.g. from_tf which are bools. Could we align it with the flag closer to its meaning e.g. gguf_id or gguf_file?

src/transformers/modeling_utils.py

amyeroberts · 2024-05-15T09:47:37Z

src/transformers/configuration_utils.py

@@ -658,6 +659,8 @@ def _get_config_dict(
        from_auto_class = kwargs.pop("_from_auto", False)
        commit_hash = kwargs.pop("_commit_hash", None)

+        from_gguf = kwargs.get("from_gguf", None)


Should this be pop here?

Suggested change

from_gguf = kwargs.get("from_gguf", None)

from_gguf = kwargs.pop("from_gguf", None)

Here I think it should be get as from_gguf is used later in case one uses Auto classes

amyeroberts · 2024-05-15T09:51:54Z

tests/quantization/ggml/test_ggml.py

+            # Otherwise the test takes too long
+            if i > 100:
+                break


A cleaner way to do this is a take a slice in the dataset such that you iterate over a small subset

Makes sense ! Done in 65433c4

Co-authored-by: amyeroberts <[email protected]>

younesbelkada · 2024-05-15T12:22:10Z

Thanks both for the extensive review ! 🚀

ArthurZucker · 2024-05-15T13:16:33Z

src/transformers/integrations/ggml.py

+            tokenizer.add_special_tokens(
+                [AddedToken(added_token, normalized=False, special=False) for added_token in self.added_tokens]
+            )


Not all of them are special here. You can add them all as special

@younesbelkada this just means that added tokens that are not special will be skipped when decoding.

@99991

* Adds support for loading GGUF files Co-authored-by: Younes Belkada <[email protected]> Co-authored-by: 99991 <[email protected]> * add q2_k q3_k q5_k support from @99991 * fix tests * Update doc * Style * Docs * fix CI * Update docs/source/en/gguf.md * Update docs/source/en/gguf.md * Compute merges * change logic * add comment for clarity * add comment for clarity * Update src/transformers/models/auto/tokenization_auto.py Co-authored-by: amyeroberts <[email protected]> * change logic * Update src/transformers/modeling_utils.py Co-authored-by: amyeroberts <[email protected]> * change * Apply suggestions from code review Co-authored-by: amyeroberts <[email protected]> * Update src/transformers/modeling_gguf_pytorch_utils.py Co-authored-by: amyeroberts <[email protected]> * put back comment * add comment about mistral * comments and added tests * fix unconsistent type * more * fix tokenizer * Update src/transformers/modeling_utils.py Co-authored-by: amyeroberts <[email protected]> * address comments about tests and tokenizer + add added_tokens * from_gguf -> gguf_file * replace on docs too --------- Co-authored-by: Younes Belkada <[email protected]> Co-authored-by: 99991 <[email protected]> Co-authored-by: Younes Belkada <[email protected]> Co-authored-by: amyeroberts <[email protected]>

brandon-lockaby · 2024-05-26T10:11:35Z

is this correct, or in-progress in v4.41-release?

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "NousResearch/Meta-Llama-3-8B-GGUF"
filename = "Meta-Llama-3-8B-Q4_K_M.gguf"

tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)
model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename)

print(model)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[5], line 6
      3 model_id = "NousResearch/Meta-Llama-3-8B-GGUF"
      4 filename = "Meta-Llama-3-8B-Q4_K_M.gguf"
----> 6 tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)
      7 model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename)
      9 print(model)

File ~/.local/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py:899, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    896 tokenizer_class_py, tokenizer_class_fast = TOKENIZER_MAPPING[type(config)]
    898 if tokenizer_class_fast and (use_fast or tokenizer_class_py is None):
--> 899     return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    900 else:
    901     if tokenizer_class_py is not None:

File ~/.local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2110, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, trust_remote_code, *init_inputs, **kwargs)
   2107     else:
   2108         logger.info(f"loading file {file_path} from cache at {resolved_vocab_files[file_id]}")
-> 2110 return cls._from_pretrained(
   2111     resolved_vocab_files,
   2112     pretrained_model_name_or_path,
   2113     init_configuration,
   2114     *init_inputs,
   2115     token=token,
   2116     cache_dir=cache_dir,
   2117     local_files_only=local_files_only,
   2118     _commit_hash=commit_hash,
   2119     _is_local=is_local,
   2120     trust_remote_code=trust_remote_code,
   2121     **kwargs,
   2122 )

File ~/.local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2336, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, token, cache_dir, local_files_only, _commit_hash, _is_local, trust_remote_code, *init_inputs, **kwargs)
   2334 # Instantiate the tokenizer.
   2335 try:
-> 2336     tokenizer = cls(*init_inputs, **init_kwargs)
   2337 except OSError:
   2338     raise OSError(
   2339         "Unable to load vocabulary from file. "
   2340         "Please check that the provided vocabulary is accessible and not corrupted."
   2341     )

File ~/.local/lib/python3.10/site-packages/transformers/models/gpt2/tokenization_gpt2_fast.py:100, in GPT2TokenizerFast.__init__(self, vocab_file, merges_file, tokenizer_file, unk_token, bos_token, eos_token, add_prefix_space, **kwargs)
     89 def __init__(
     90     self,
     91     vocab_file=None,
   (...)
     98     **kwargs,
     99 ):
--> 100     super().__init__(
    101         vocab_file,
    102         merges_file,
    103         tokenizer_file=tokenizer_file,
    104         unk_token=unk_token,
    105         bos_token=bos_token,
    106         eos_token=eos_token,
    107         add_prefix_space=add_prefix_space,
    108         **kwargs,
    109     )
    111     self.add_bos_token = kwargs.pop("add_bos_token", False)
    113     pre_tok_state = json.loads(self.backend_tokenizer.pre_tokenizer.__getstate__())

File ~/.local/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py:120, in PreTrainedTokenizerFast.__init__(self, *args, **kwargs)
    117     fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
    118 elif gguf_file is not None:
    119     # We need to convert a slow tokenizer to build the backend
--> 120     tokenizer_dict = load_gguf_checkpoint(kwargs.get("vocab_file"))["tokenizer"]
    121     fast_tokenizer = convert_gguf_tokenizer(tokenizer_dict)
    122 elif self.slow_tokenizer_class is not None:
    123     # We need to create and convert a slow tokenizer to build the backend

File ~/.local/lib/python3.10/site-packages/transformers/modeling_gguf_pytorch_utils.py:81, in load_gguf_checkpoint(gguf_checkpoint_path, return_tensors)
     75     logger.error(
     76         "Loading a GGUF checkpoint in PyTorch, requires both PyTorch and GGUF to be installed. Please see "
     77         "https://pytorch.org/ and https://github.com/ggerganov/llama.cpp/tree/master/gguf-py for installation instructions."
     78     )
     79     raise
---> 81 reader = GGUFReader(gguf_checkpoint_path)
     82 fields = reader.fields
     83 reader_keys = list(fields.keys())

File ~/.local/lib/python3.10/site-packages/gguf/gguf_reader.py:85, in GGUFReader.__init__(self, path, mode)
     84 def __init__(self, path: os.PathLike[str] | str, mode: Literal['r' | 'r+' | 'c'] = 'r'):
---> 85     self.data = np.memmap(path, mode = mode)
     86     offs = 0
     87     if self._get(offs, np.uint32, override_order = '<')[0] != GGUF_MAGIC:

File /usr/lib/python3/dist-packages/numpy/core/memmap.py:228, in memmap.__new__(subtype, filename, dtype, mode, offset, shape, order)
    226     f_ctx = nullcontext(filename)
    227 else:
--> 228     f_ctx = open(os_fspath(filename), ('r' if mode == 'c' else mode)+'b')
    230 with f_ctx as fid:
    231     fid.seek(0, 2)

TypeError: expected str, bytes or os.PathLike object, not NoneType

younesbelkada · 2024-05-27T06:56:34Z

Hi @brandon-lockaby !
I think that GGUF file is broken: https://huggingface.co/NousResearch/Meta-Llama-3-8B-GGUF/discussions/1 - can you try to freshly convert llama-3 using this Space: https://huggingface.co/spaces/ggml-org/gguf-my-repo ?

brandon-lockaby · 2024-05-29T08:14:32Z

@younesbelkada

Same error. Created and loaded from this repo https://huggingface.co/brandonglockaby/Meta-Llama-3-8B-Q4_K_M-GGUF

I should point out, previous attempts are ggufs that work correctly with current releases of llama.cpp and llama-cpp-python

younesbelkada · 2024-05-29T15:39:29Z

Indeed I was able to repro, this is because the tokenizer is registered as gpt2 tokenizer, will hhave a look and provide a fix !

younesbelkada · 2024-06-03T13:58:08Z

@brandon-lockaby - #31175 has been merged and might include a fix for the issue you are facing, can you try to re-run the snippet using transformers main branch?

brandon-lockaby · 2024-06-03T16:24:27Z

@younesbelkada

pip install --upgrade --force-reinstall git+https://github.com/huggingface/transformers
<snip>
Successfully uninstalled transformers-4.41.2

Same error related to tokenizer filename, produced with updated repo from gguf-my-repo as well as a gguf from my storage

younesbelkada · 2024-06-10T15:48:33Z

Hi @brandon-lockaby
Please see: #31358 for the final fix, let me know if that fixes your issue. It fixes the same issue I had locally

Lin-xs · 2024-06-15T08:16:47Z

Hi @brandon-lockaby Please see: #31358 for the final fix, let me know if that fixes your issue. It fixes the same issue I had locally

Hi @younesbelkada,

I have tried #31358, and now the tokenizer can be loaded successfully. However, when I attempt to load the GGUF model, an OSError occurs:

OSError: QuantFactory/Meta-Llama-3-8B-GGUF does not appear to have a file named config.json. Checkout 'https://huggingface.co/QuantFactory/Meta-Llama-3-8B-GGUF/tree/main' for available files.

Here is my code:

from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, LlamaForCausalLM

model_id = "QuantFactory/Meta-Llama-3-8B-GGUF"
filename = "Meta-Llama-3-8B.Q4_K_M.gguf"
tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)
model = LlamaForCausalLM.from_pretrained(model_id, gguf_file=filename)

Many GGUF models on Hugging Face do not have a config.json. So, I tried to load the config from the raw Meta-Llama-3-8B:

config = AutoConfig.from_pretrained("meta-llama/Meta-Llama-3-8B")
model = LlamaForCausalLM.from_pretrained(model_id, gguf_file=filename, config=config)

This approach works, but I think this is not an elegant solution. Perhaps more modifications are needed here.

Thank you for your contribution.

younesbelkada · 2024-06-15T10:26:51Z

Hi @Lin-xs
Thanks a lot! hmmm indeed there might be a bug when not using autoclasses, can you try to load the model with AutoModelForCausalLM instead of LlamaModelForCausalLM

Lin-xs · 2024-06-16T05:41:53Z

Hi @Lin-xs Thanks a lot! hmmm indeed there might be a bug when not using autoclasses, can you try to load the model with AutoModelForCausalLM instead of LlamaModelForCausalLM

Thank you, this approach works.

Lin-xs · 2024-06-17T05:04:28Z

Hi @younesbelkada @Isotr0py ,

I encountered a bug when trying to use AutoModelForCausalLM to load the QuantFactory/Qwen2-7B-GGUF model. Here is the code I used:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "QuantFactory/Qwen2-7B-GGUF"
filename = "Qwen2-7B.Q4_K_M.gguf"

tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)
model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename)

The error message I received is:

File ~/miniconda3/envs/llama3/lib/python3.11/site-packages/accelerate/utils/modeling.py:358, in set_module_tensor_to_device(module, tensor_name, device, value, dtype, fp16_statistics, tied_params_map)
    356 if value is not None:
    357     if old_value.shape != value.shape:
--> 358         raise ValueError(
    359             f'Trying to set a tensor of shape {value.shape} in \"{tensor_name}\" (which has shape {old_value.shape}), this looks incorrect.'
    360         )
    362     if dtype is None:
    363         # For compatibility with PyTorch load_state_dict which converts state dict dtype to existing dtype in model
    364         value = value.to(old_value.dtype)

ValueError: Trying to set a tensor of shape torch.Size([152064, 3584]) in \"weight\" (which has shape torch.Size([151936, 3584])), this looks incorrect.

The same error occurs when I try to load Qwen/Qwen2-7B-Instruct-GGUF. It seems that 151936 is the vocab size for Qwen1.5 rather than Qwen2. In test_ggml.py, the attribute q4_0_qwen2_model_id is set to "qwen1_5-0_5b-chat-q4_0.gguf", which might cause the test to pass incorrectly.

Could you please take a look at it?

Thanks!

Lin-xs · 2024-06-17T05:50:58Z

I think probably this is because the default vocab_size of Qwen2Config is set to 151936 in configuration_qwen2.py and the config loaded from Qwen2 gguf file do not have "vocab_size":

from transformers.modeling_gguf_pytorch_utils import load_gguf_checkpoint
from transformers.utils import cached_file

model_id = "Qwen/Qwen2-7B-Instruct-GGUF"
filename = "qwen2-7b-instruct-q2_k.gguf"

gguf_path = cached_file(model_id, filename,)
config_dict = load_gguf_checkpoint(gguf_path, return_tensors=False)["config"]
print(config_dict)

the output is

{'model_type': 'qwen2',
 '_model_name_or_path': 'qwen2-7b-instruct',
 'num_hidden_layers': 28,
 'max_position_embeddings': 32768,
 'hidden_size': 3584,
 'intermediate_size': 18944,
 'num_attention_heads': 28,
 'num_key_value_heads': 4,
 'rope_theta': 1000000.0,
 'rms_norm_eps': 9.999999974752427e-07,
 'eos_token_id': 151645,
 'pad_token_id': 151643,
 'bos_token_id': 151643}

AghaDurrani · 2024-09-08T17:20:33Z

Hi @younesbelkada

very appreciate the effort to easily use GGUF models w the transformers library :)
just to understand -- does the current functionality convert the weights back to FP 32 and therefore "reverse" the entire quantization scheme when loading the model (instead of dequantizing on the "fly") ?

e.g. when i do the following

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF"
filename = "tinyllama-1.1b-chat-v1.0.Q4_K_S.gguf"

tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)
model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename)

print(model.model.layers[0].self_attn.k_proj.weight.dtype)

i get:
torch.float32

amyeroberts · 2024-09-09T18:09:07Z

cc @SunMarc

SunMarc · 2024-09-10T11:47:05Z

That's right ! The goal of this feature is to let users load their gguf in transformers so that they can fine-tune them before reconverting them to gguf format !

AghaDurrani · 2024-09-10T18:11:27Z

ok got it! thank you for confirming :)!!

LysandreJik and others added 6 commits April 19, 2024 17:39

Adds support for loading GGUF files

fb00288

Co-authored-by: Younes Belkada <[email protected]> Co-authored-by: 99991 <[email protected]>

add q2_k q3_k q5_k support from @99991

81e4324

fix tests

8a0d5b8

Update doc

08534f3

Style

ebd9944

Docs

5c913ec

LysandreJik mentioned this pull request Apr 22, 2024

Add support for llama.cpp #27712

Open

younesbelkada added 2 commits April 22, 2024 18:59

Merge remote-tracking branch 'upstream/main' into HEAD

8b81bfb

fix CI

c49f1a8

younesbelkada reviewed Apr 22, 2024

View reviewed changes

docs/source/en/gguf.md Show resolved Hide resolved

Update docs/source/en/gguf.md

7fa538b

younesbelkada reviewed Apr 22, 2024

View reviewed changes

docs/source/en/gguf.md Show resolved Hide resolved

younesbelkada and others added 5 commits April 22, 2024 19:42

Update docs/source/en/gguf.md

5485327

Merge branch 'main' into gguf-support

074f05e

Compute merges

ca8363e

Merge branch 'main' into gguf-support

2a0c9b0

Merge branch 'main' into gguf-support

fac7bb3

younesbelkada requested review from amyeroberts and ArthurZucker April 25, 2024 09:36

amyeroberts reviewed Apr 25, 2024

View reviewed changes

ArthurZucker reviewed Apr 25, 2024

View reviewed changes

younesbelkada and others added 8 commits April 30, 2024 15:24

Merge remote-tracking branch 'upstream/main' into HEAD

45983db

change logic

e6c6f6c

add comment for clarity

a6cd08c

add comment for clarity

6611877

Update src/transformers/models/auto/tokenization_auto.py

455163b

Co-authored-by: amyeroberts <[email protected]>

change logic

42d5815

Update src/transformers/modeling_utils.py

1d3acec

Co-authored-by: amyeroberts <[email protected]>

change

af3c42c

ArthurZucker approved these changes May 15, 2024

View reviewed changes

amyeroberts approved these changes May 15, 2024

View reviewed changes

younesbelkada and others added 4 commits May 15, 2024 13:03

Update src/transformers/modeling_utils.py

0ab79f6

Co-authored-by: amyeroberts <[email protected]>

address comments about tests and tokenizer + add added_tokens

65433c4

from_gguf -> gguf_file

1b5ae54

replace on docs too

d6b67c6

younesbelkada merged commit a428449 into huggingface:main May 15, 2024
23 checks passed

ArthurZucker reviewed May 15, 2024

View reviewed changes

younesbelkada mentioned this pull request Jun 3, 2024

Add Qwen2 GGUF loading support #31175

Merged

5 tasks

This was referenced Jun 3, 2024

GGUF / Llama3: Fixes for loading llama3 models with GGUF #31215

Closed

GGUF: Fix llama 3 GGUF #31358

Merged

Lin-xs mentioned this pull request Jun 26, 2024

The behavior of the tokenizer loaded from GGUF file is incorrect. #31630

Closed

4 tasks



		def load_dequant_gguf_tensor(shape, ggml_type, data):
		if ggml_type == GGML_TYPES["F32"]:

		if "llama" in architecture and "mistral" in model_name:
		updated_architecture = "mistral"

		if len(reader_keys) > 0:
		logger.info(f"Some keys of the GGUF file were not considered: {reader_keys}")

	from_gguf = kwargs.get("from_gguf", None)
	from_gguf = kwargs.pop("from_gguf", None)

Loading GGUF files support #30391

Loading GGUF files support #30391

Conversation

LysandreJik commented Apr 22, 2024 • edited by younesbelkada Loading

Supported quantization types

Supported model architectures

HuggingFaceDocBuilderDev commented Apr 22, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

amyeroberts Apr 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

younesbelkada commented May 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brandon-lockaby commented May 26, 2024 • edited Loading

younesbelkada commented May 27, 2024

brandon-lockaby commented May 29, 2024

younesbelkada commented May 29, 2024

younesbelkada commented Jun 3, 2024

brandon-lockaby commented Jun 3, 2024

younesbelkada commented Jun 10, 2024

Lin-xs commented Jun 15, 2024

younesbelkada commented Jun 15, 2024

Lin-xs commented Jun 16, 2024

Lin-xs commented Jun 17, 2024 • edited Loading

Lin-xs commented Jun 17, 2024

AghaDurrani commented Sep 8, 2024

amyeroberts commented Sep 9, 2024

SunMarc commented Sep 10, 2024

AghaDurrani commented Sep 10, 2024

LysandreJik commented Apr 22, 2024 •

edited by younesbelkada

Loading

amyeroberts Apr 25, 2024 •

edited

Loading

brandon-lockaby commented May 26, 2024 •

edited

Loading

Lin-xs commented Jun 17, 2024 •

edited

Loading