Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading GGUF files support #30391

Merged
merged 37 commits into from
May 15, 2024
Merged

Conversation

LysandreJik
Copy link
Member

@LysandreJik LysandreJik commented Apr 22, 2024

This PR offers the ability to load .gguf files within transformers, dequantizing them to float32.
Doing so enables further training on the GGUF files, before converting them back in the GGUF format for usage in the GGML ecosystem.

We enable this through the from_gguf keyword argument of the from_pretrained methods of configurations, tokenizers, and PyTorch models. Here is an example of the API:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF"
filename = "tinyllama-1.1b-chat-v1.0.Q6_K.gguf"

tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)
model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename)

Note
This is still very experimental and we don't expect all files to work seamlessly. We'll work on improving the current implementation and welcome PRs that help us do so.

Supported quantization types

The initial supported quantization types are decided according to the popular quantized files that have been shared
on the Hub.

  • F32
  • Q2_K
  • Q3_K
  • Q4_0
  • Q4_K
  • Q5_K
  • Q6_K
  • Q8_0

We take example from, and credit the excellent 99991/pygguf Python parser to dequantize the weights.

Supported model architectures

For now the supported model architectures are the architectures that have been very popular on the Hub, namely:

  • LLaMa
  • Mistral

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥳 - exciting to see this added!

Mostly nits and small comments. Addition of gguf flags to common logic in modeling_utils makes me slightly uneasy e.g.

if gguf_path is None and (low_cpu_mem_usage or (use_keep_in_fp32_modules and is_accelerate_available())):

It indicates to me passing gguf through modeling_utils isn't really compatible. I think it's OK atm. If we find there's other formats we end up wanting to support, then we might have to restructure the logic flow s.t. we're not having to pass around these flags everywhere.

Main comment is about the structure of the if/else statements in the code.

I have no idea about the intended logic for dequantization - looks sensible to me but I haven't looked in depth at those methods :)

filename = "tinyllama-1.1b-chat-v1.0.Q6_K.gguf"

tokenizer = AutoTokenizer.from_pretrained(model_id, from_gguf=filename)
model = AutoModelForCausalLM.from_pretrained(model_id, from_gguf=filename)
Copy link
Collaborator

@amyeroberts amyeroberts Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if I passed in a quantization config in with the from_pretrained call? gguf -> unquantized -> requantized?

I see this is handled in modeling utils ❤️

Comment on lines +92 to +95
tokenizer.save_pretrained('directory')
model.save_pretrained('directory')

!python ${path_to_llama_cpp}/convert-hf-to-gguf.py ${directory}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we had this within save_pretrained using e.g. a save_gguf flag

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is part of the full integration, will do that in a follow up PR !

if "fast_tokenizer_files" in tokenizer_config:
fast_tokenizer_file = get_fast_tokenizer_file(tokenizer_config["fast_tokenizer_files"])
vocab_files["tokenizer_file"] = fast_tokenizer_file
if not from_gguf:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ultranit - it's a bit funny to define the default case as "not gguf" i.e. it's centering on gguf as how we look at our objects. If we end up adding another format, this would then have to follow the pattern "if not x and not y", it's easier to do if from_gguf and else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to change it: e6c6f6c

@@ -112,6 +115,10 @@ def __init__(self, *args, **kwargs):
elif slow_tokenizer is not None:
# We need to convert a slow tokenizer to build the backend
fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
elif from_gguf is not None:
# We need to convert a slow tokenizer to build the backend
tokenizer_dict = load_gguf_checkpoint(kwargs.get("vocab_file"))["tokenizer"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work if kwargs.get("vocab_file") is None?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The correct vocab file is always passed here: #30391 (comment) so this will less likely to happen, but if one passes None indeed it'll fail

src/transformers/tokenization_utils_base.py Outdated Show resolved Hide resolved


def load_dequant_gguf_tensor(shape, ggml_type, data):
if ggml_type == GGML_TYPES["F32"]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - more of a stylistic choice - the checking pattern in the if/elif/else statement looks like it lends itself to an IntEnum

Comment on lines +90 to +91
if "llama" in architecture and "mistral" in model_name:
updated_architecture = "mistral"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this the case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately this is because mistral in llama.cpp uses the exact same arch in mistral 😢 will add a comment explaining

src/transformers/modeling_gguf_pytorch_utils.py Outdated Show resolved Hide resolved
Comment on lines +146 to +151
if architecture == "llama" and (".attn_k." in name or ".attn_q." in name):
num_heads = parsed_parameters["config"]["num_attention_heads"]
tmp_shape = (int(shape[-1] // num_heads // 2), num_heads, 2, shape[0])
weights = weights.reshape(tmp_shape)
weights = weights.transpose(0, 2, 1, 3)
weights = weights.reshape(shape[::-1])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll want to make this more general when we have more models - a problem for future us!

Comment on lines +160 to +161
if len(reader_keys) > 0:
logger.info(f"Some keys of the GGUF file were not considered: {reader_keys}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice :)

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Comment on lines +539 to +541
AddedToken("<unk>", normalized=False, special=True),
AddedToken("<s>", normalized=False, special=True),
AddedToken("</s>", normalized=False, special=True),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these always the same? For Llama based, no. THis would add extra tokens and can mess order etc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not adressed! the added_tokens should all be added, and the special tokens as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

],
axis=1,
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably completely separate what was entirely vendored and what we added. Splitting the file here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some methods above were written by us so not entirely vendored

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still some are vendored -> which ones? Which ones did we write ourselved?

class GGUFLlamaConverter(LlamaConverter):
def __init__(self, tokenizer_dict):
self.proto = GGUFTokenizerSkeleton(tokenizer_dict)
self.original_tokenizer = self.proto
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the original tokenizer is usually a PreTrainedTokenizer. not a proto

src/transformers/tokenization_utils_fast.py Show resolved Hide resolved
elif from_gguf is not None:
# We need to convert a slow tokenizer to build the backend
tokenizer_dict = load_gguf_checkpoint(kwargs.get("vocab_file"))["tokenizer"]
fast_tokenizer = convert_gguf_tokenizer(tokenizer_dict)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note for myself: convert from tiktoken could also be added here. Maybe a mapping from_xxx with the function ?

src/transformers/tokenization_utils_base.py Show resolved Hide resolved
src/transformers/tokenization_utils_base.py Outdated Show resolved Hide resolved
src/transformers/tokenization_utils_base.py Outdated Show resolved Hide resolved
src/transformers/models/auto/tokenization_auto.py Outdated Show resolved Hide resolved
out = model.generate(**text, max_new_tokens=10)

EXPECTED_TEXT = "<s> Hello,\n\nI'm trying to create a"
self.assertEqual(tokenizer.decode(out[0], skip_special_tokens=True), EXPECTED_TEXT)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing tests on special tokens / additional special tokens:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense ! As discussed offline, I fixed some issues and added some tests here: 3bdbb2e

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, can you add just one test with added tokens. SOmething like ".Hey How.Hey<token>. <token>" with ` being one of gguf.added_tokens ? (so or or )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 nits and 1 test to add!

out = model.generate(**text, max_new_tokens=10)

EXPECTED_TEXT = "<s> Hello,\n\nI'm trying to create a"
self.assertEqual(tokenizer.decode(out[0], skip_special_tokens=True), EXPECTED_TEXT)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, can you add just one test with added tokens. SOmething like ".Hey How.Hey<token>. <token>" with ` being one of gguf.added_tokens ? (so or or )

Comment on lines +539 to +541
AddedToken("<unk>", normalized=False, special=True),
AddedToken("<s>", normalized=False, special=True),
AddedToken("</s>", normalized=False, special=True),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not adressed! the added_tokens should all be added, and the special tokens as well.

)
return tokenizer

def decoder(self, replacement, add_prefix_space):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add prefix space is defined in the gguf? Might not be good to always take it from the class (which is what's happening now)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not defined from what I read in the GGML docs + when inspecting various checkpoints from the Hub

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's always adding a prefix space I suppose?

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks again for adding this - excited to see it in action 🎬

Two general comments:

  • The tokenizer logic @ArthurZucker highlighted will need to be addressed before merge
  • from_gguf as a flag name doesn't align with other from_xxx flags in from_pretrained e.g. from_tf which are bools. Could we align it with the flag closer to its meaning e.g. gguf_id or gguf_file?

src/transformers/modeling_utils.py Outdated Show resolved Hide resolved
@@ -658,6 +659,8 @@ def _get_config_dict(
from_auto_class = kwargs.pop("_from_auto", False)
commit_hash = kwargs.pop("_commit_hash", None)

from_gguf = kwargs.get("from_gguf", None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be pop here?

Suggested change
from_gguf = kwargs.get("from_gguf", None)
from_gguf = kwargs.pop("from_gguf", None)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I think it should be get as from_gguf is used later in case one uses Auto classes

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, OK!

Comment on lines 194 to 196
# Otherwise the test takes too long
if i > 100:
break
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A cleaner way to do this is a take a slice in the dataset such that you iterate over a small subset

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense ! Done in 65433c4

@younesbelkada
Copy link
Contributor

Thanks both for the extensive review ! 🚀

@younesbelkada younesbelkada merged commit a428449 into huggingface:main May 15, 2024
23 checks passed
Comment on lines +547 to +549
tokenizer.add_special_tokens(
[AddedToken(added_token, normalized=False, special=False) for added_token in self.added_tokens]
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all of them are special here. You can add them all as special

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@younesbelkada this just means that added tokens that are not special will be skipped when decoding.

itazap pushed a commit that referenced this pull request May 24, 2024
* Adds support for loading GGUF files

Co-authored-by: Younes Belkada <[email protected]>
Co-authored-by: 99991 <[email protected]>

* add q2_k q3_k q5_k support from @99991

* fix tests

* Update doc

* Style

* Docs

* fix CI

* Update docs/source/en/gguf.md

* Update docs/source/en/gguf.md

* Compute merges

* change logic

* add comment for clarity

* add comment for clarity

* Update src/transformers/models/auto/tokenization_auto.py

Co-authored-by: amyeroberts <[email protected]>

* change logic

* Update src/transformers/modeling_utils.py

Co-authored-by: amyeroberts <[email protected]>

* change

* Apply suggestions from code review

Co-authored-by: amyeroberts <[email protected]>

* Update src/transformers/modeling_gguf_pytorch_utils.py

Co-authored-by: amyeroberts <[email protected]>

* put back comment

* add comment about mistral

* comments and added tests

* fix unconsistent type

* more

* fix tokenizer

* Update src/transformers/modeling_utils.py

Co-authored-by: amyeroberts <[email protected]>

* address comments about tests and tokenizer + add added_tokens

* from_gguf -> gguf_file

* replace on docs too

---------

Co-authored-by: Younes Belkada <[email protected]>
Co-authored-by: 99991 <[email protected]>
Co-authored-by: Younes Belkada <[email protected]>
Co-authored-by: amyeroberts <[email protected]>
@brandon-lockaby
Copy link

brandon-lockaby commented May 26, 2024

is this correct, or in-progress in v4.41-release?

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "NousResearch/Meta-Llama-3-8B-GGUF"
filename = "Meta-Llama-3-8B-Q4_K_M.gguf"

tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)
model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename)

print(model)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[5], line 6
      3 model_id = "NousResearch/Meta-Llama-3-8B-GGUF"
      4 filename = "Meta-Llama-3-8B-Q4_K_M.gguf"
----> 6 tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)
      7 model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename)
      9 print(model)

File ~/.local/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py:899, in AutoTokenizer.from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    896 tokenizer_class_py, tokenizer_class_fast = TOKENIZER_MAPPING[type(config)]
    898 if tokenizer_class_fast and (use_fast or tokenizer_class_py is None):
--> 899     return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    900 else:
    901     if tokenizer_class_py is not None:

File ~/.local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2110, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, cache_dir, force_download, local_files_only, token, revision, trust_remote_code, *init_inputs, **kwargs)
   2107     else:
   2108         logger.info(f"loading file {file_path} from cache at {resolved_vocab_files[file_id]}")
-> 2110 return cls._from_pretrained(
   2111     resolved_vocab_files,
   2112     pretrained_model_name_or_path,
   2113     init_configuration,
   2114     *init_inputs,
   2115     token=token,
   2116     cache_dir=cache_dir,
   2117     local_files_only=local_files_only,
   2118     _commit_hash=commit_hash,
   2119     _is_local=is_local,
   2120     trust_remote_code=trust_remote_code,
   2121     **kwargs,
   2122 )

File ~/.local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2336, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, token, cache_dir, local_files_only, _commit_hash, _is_local, trust_remote_code, *init_inputs, **kwargs)
   2334 # Instantiate the tokenizer.
   2335 try:
-> 2336     tokenizer = cls(*init_inputs, **init_kwargs)
   2337 except OSError:
   2338     raise OSError(
   2339         "Unable to load vocabulary from file. "
   2340         "Please check that the provided vocabulary is accessible and not corrupted."
   2341     )

File ~/.local/lib/python3.10/site-packages/transformers/models/gpt2/tokenization_gpt2_fast.py:100, in GPT2TokenizerFast.__init__(self, vocab_file, merges_file, tokenizer_file, unk_token, bos_token, eos_token, add_prefix_space, **kwargs)
     89 def __init__(
     90     self,
     91     vocab_file=None,
   (...)
     98     **kwargs,
     99 ):
--> 100     super().__init__(
    101         vocab_file,
    102         merges_file,
    103         tokenizer_file=tokenizer_file,
    104         unk_token=unk_token,
    105         bos_token=bos_token,
    106         eos_token=eos_token,
    107         add_prefix_space=add_prefix_space,
    108         **kwargs,
    109     )
    111     self.add_bos_token = kwargs.pop("add_bos_token", False)
    113     pre_tok_state = json.loads(self.backend_tokenizer.pre_tokenizer.__getstate__())

File ~/.local/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py:120, in PreTrainedTokenizerFast.__init__(self, *args, **kwargs)
    117     fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
    118 elif gguf_file is not None:
    119     # We need to convert a slow tokenizer to build the backend
--> 120     tokenizer_dict = load_gguf_checkpoint(kwargs.get("vocab_file"))["tokenizer"]
    121     fast_tokenizer = convert_gguf_tokenizer(tokenizer_dict)
    122 elif self.slow_tokenizer_class is not None:
    123     # We need to create and convert a slow tokenizer to build the backend

File ~/.local/lib/python3.10/site-packages/transformers/modeling_gguf_pytorch_utils.py:81, in load_gguf_checkpoint(gguf_checkpoint_path, return_tensors)
     75     logger.error(
     76         "Loading a GGUF checkpoint in PyTorch, requires both PyTorch and GGUF to be installed. Please see "
     77         "https://pytorch.org/ and https://github.com/ggerganov/llama.cpp/tree/master/gguf-py for installation instructions."
     78     )
     79     raise
---> 81 reader = GGUFReader(gguf_checkpoint_path)
     82 fields = reader.fields
     83 reader_keys = list(fields.keys())

File ~/.local/lib/python3.10/site-packages/gguf/gguf_reader.py:85, in GGUFReader.__init__(self, path, mode)
     84 def __init__(self, path: os.PathLike[str] | str, mode: Literal['r' | 'r+' | 'c'] = 'r'):
---> 85     self.data = np.memmap(path, mode = mode)
     86     offs = 0
     87     if self._get(offs, np.uint32, override_order = '<')[0] != GGUF_MAGIC:

File /usr/lib/python3/dist-packages/numpy/core/memmap.py:228, in memmap.__new__(subtype, filename, dtype, mode, offset, shape, order)
    226     f_ctx = nullcontext(filename)
    227 else:
--> 228     f_ctx = open(os_fspath(filename), ('r' if mode == 'c' else mode)+'b')
    230 with f_ctx as fid:
    231     fid.seek(0, 2)

TypeError: expected str, bytes or os.PathLike object, not NoneType

@younesbelkada
Copy link
Contributor

Hi @brandon-lockaby !
I think that GGUF file is broken: https://huggingface.co/NousResearch/Meta-Llama-3-8B-GGUF/discussions/1 - can you try to freshly convert llama-3 using this Space: https://huggingface.co/spaces/ggml-org/gguf-my-repo ?

@brandon-lockaby
Copy link

@younesbelkada

Same error. Created and loaded from this repo https://huggingface.co/brandonglockaby/Meta-Llama-3-8B-Q4_K_M-GGUF

I should point out, previous attempts are ggufs that work correctly with current releases of llama.cpp and llama-cpp-python

@younesbelkada
Copy link
Contributor

Indeed I was able to repro, this is because the tokenizer is registered as gpt2 tokenizer, will hhave a look and provide a fix !

@younesbelkada
Copy link
Contributor

@brandon-lockaby - #31175 has been merged and might include a fix for the issue you are facing, can you try to re-run the snippet using transformers main branch?

@brandon-lockaby
Copy link

@younesbelkada

pip install --upgrade --force-reinstall git+https://github.com/huggingface/transformers
<snip>
Successfully uninstalled transformers-4.41.2

Same error related to tokenizer filename, produced with updated repo from gguf-my-repo as well as a gguf from my storage

@younesbelkada
Copy link
Contributor

Hi @brandon-lockaby
Please see: #31358 for the final fix, let me know if that fixes your issue. It fixes the same issue I had locally

@Lin-xs
Copy link

Lin-xs commented Jun 15, 2024

Hi @brandon-lockaby Please see: #31358 for the final fix, let me know if that fixes your issue. It fixes the same issue I had locally

Hi @younesbelkada,

I have tried #31358, and now the tokenizer can be loaded successfully. However, when I attempt to load the GGUF model, an OSError occurs:

OSError: QuantFactory/Meta-Llama-3-8B-GGUF does not appear to have a file named config.json. Checkout 'https://huggingface.co/QuantFactory/Meta-Llama-3-8B-GGUF/tree/main' for available files.

Here is my code:

from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, LlamaForCausalLM

model_id = "QuantFactory/Meta-Llama-3-8B-GGUF"
filename = "Meta-Llama-3-8B.Q4_K_M.gguf"
tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)
model = LlamaForCausalLM.from_pretrained(model_id, gguf_file=filename)

Many GGUF models on Hugging Face do not have a config.json. So, I tried to load the config from the raw Meta-Llama-3-8B:

config = AutoConfig.from_pretrained("meta-llama/Meta-Llama-3-8B")
model = LlamaForCausalLM.from_pretrained(model_id, gguf_file=filename, config=config)

This approach works, but I think this is not an elegant solution. Perhaps more modifications are needed here.

Thank you for your contribution.

@younesbelkada
Copy link
Contributor

Hi @Lin-xs
Thanks a lot! hmmm indeed there might be a bug when not using autoclasses, can you try to load the model with AutoModelForCausalLM instead of LlamaModelForCausalLM

@Lin-xs
Copy link

Lin-xs commented Jun 16, 2024

Hi @Lin-xs Thanks a lot! hmmm indeed there might be a bug when not using autoclasses, can you try to load the model with AutoModelForCausalLM instead of LlamaModelForCausalLM

Thank you, this approach works.

@Lin-xs
Copy link

Lin-xs commented Jun 17, 2024

Hi @younesbelkada @Isotr0py ,

I encountered a bug when trying to use AutoModelForCausalLM to load the QuantFactory/Qwen2-7B-GGUF model. Here is the code I used:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "QuantFactory/Qwen2-7B-GGUF"
filename = "Qwen2-7B.Q4_K_M.gguf"

tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)
model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename)

The error message I received is:

File ~/miniconda3/envs/llama3/lib/python3.11/site-packages/accelerate/utils/modeling.py:358, in set_module_tensor_to_device(module, tensor_name, device, value, dtype, fp16_statistics, tied_params_map)
    356 if value is not None:
    357     if old_value.shape != value.shape:
--> 358         raise ValueError(
    359             f'Trying to set a tensor of shape {value.shape} in \"{tensor_name}\" (which has shape {old_value.shape}), this looks incorrect.'
    360         )
    362     if dtype is None:
    363         # For compatibility with PyTorch load_state_dict which converts state dict dtype to existing dtype in model
    364         value = value.to(old_value.dtype)

ValueError: Trying to set a tensor of shape torch.Size([152064, 3584]) in \"weight\" (which has shape torch.Size([151936, 3584])), this looks incorrect.

The same error occurs when I try to load Qwen/Qwen2-7B-Instruct-GGUF. It seems that 151936 is the vocab size for Qwen1.5 rather than Qwen2. In test_ggml.py, the attribute q4_0_qwen2_model_id is set to "qwen1_5-0_5b-chat-q4_0.gguf", which might cause the test to pass incorrectly.

Could you please take a look at it?

Thanks!

@Lin-xs
Copy link

Lin-xs commented Jun 17, 2024

I think probably this is because the default vocab_size of Qwen2Config is set to 151936 in configuration_qwen2.py and the config loaded from Qwen2 gguf file do not have "vocab_size":

from transformers.modeling_gguf_pytorch_utils import load_gguf_checkpoint
from transformers.utils import cached_file

model_id = "Qwen/Qwen2-7B-Instruct-GGUF"
filename = "qwen2-7b-instruct-q2_k.gguf"

gguf_path = cached_file(model_id, filename,)
config_dict = load_gguf_checkpoint(gguf_path, return_tensors=False)["config"]
print(config_dict)

the output is

{'model_type': 'qwen2',
 '_model_name_or_path': 'qwen2-7b-instruct',
 'num_hidden_layers': 28,
 'max_position_embeddings': 32768,
 'hidden_size': 3584,
 'intermediate_size': 18944,
 'num_attention_heads': 28,
 'num_key_value_heads': 4,
 'rope_theta': 1000000.0,
 'rms_norm_eps': 9.999999974752427e-07,
 'eos_token_id': 151645,
 'pad_token_id': 151643,
 'bos_token_id': 151643}

@AghaDurrani
Copy link

Hi @younesbelkada

very appreciate the effort to easily use GGUF models w the transformers library :)
just to understand -- does the current functionality convert the weights back to FP 32 and therefore "reverse" the entire quantization scheme when loading the model (instead of dequantizing on the "fly") ?

e.g. when i do the following

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF"
filename = "tinyllama-1.1b-chat-v1.0.Q4_K_S.gguf"

tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)
model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename)

print(model.model.layers[0].self_attn.k_proj.weight.dtype)

i get:
torch.float32

@amyeroberts
Copy link
Collaborator

cc @SunMarc

@SunMarc
Copy link
Member

SunMarc commented Sep 10, 2024

That's right ! The goal of this feature is to let users load their gguf in transformers so that they can fine-tune them before reconverting them to gguf format !

@AghaDurrani
Copy link

ok got it! thank you for confirming :)!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants