Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading GGUF files support #30391

Merged
merged 37 commits into from
May 15, 2024
Merged
Show file tree
Hide file tree
Changes from 33 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
fb00288
Adds support for loading GGUF files
LysandreJik Apr 19, 2024
81e4324
add q2_k q3_k q5_k support from @99991
younesbelkada Apr 22, 2024
8a0d5b8
fix tests
younesbelkada Apr 22, 2024
08534f3
Update doc
LysandreJik Apr 22, 2024
ebd9944
Style
LysandreJik Apr 22, 2024
5c913ec
Docs
LysandreJik Apr 22, 2024
8b81bfb
Merge remote-tracking branch 'upstream/main' into HEAD
younesbelkada Apr 22, 2024
c49f1a8
fix CI
younesbelkada Apr 22, 2024
7fa538b
Update docs/source/en/gguf.md
younesbelkada Apr 22, 2024
5485327
Update docs/source/en/gguf.md
younesbelkada Apr 22, 2024
074f05e
Merge branch 'main' into gguf-support
younesbelkada Apr 23, 2024
ca8363e
Compute merges
LysandreJik Apr 23, 2024
2a0c9b0
Merge branch 'main' into gguf-support
younesbelkada Apr 25, 2024
fac7bb3
Merge branch 'main' into gguf-support
younesbelkada Apr 25, 2024
45983db
Merge remote-tracking branch 'upstream/main' into HEAD
younesbelkada Apr 30, 2024
e6c6f6c
change logic
younesbelkada Apr 30, 2024
a6cd08c
add comment for clarity
younesbelkada Apr 30, 2024
6611877
add comment for clarity
younesbelkada Apr 30, 2024
455163b
Update src/transformers/models/auto/tokenization_auto.py
younesbelkada Apr 30, 2024
42d5815
change logic
younesbelkada Apr 30, 2024
1d3acec
Update src/transformers/modeling_utils.py
younesbelkada Apr 30, 2024
af3c42c
change
younesbelkada Apr 30, 2024
a27db0c
Merge branch 'gguf-support' of https://github.com/lysandrejik/transfo…
younesbelkada Apr 30, 2024
14ad10c
Apply suggestions from code review
younesbelkada Apr 30, 2024
ab621a7
Update src/transformers/modeling_gguf_pytorch_utils.py
younesbelkada Apr 30, 2024
207820a
put back comment
younesbelkada Apr 30, 2024
1fef8ad
add comment about mistral
younesbelkada Apr 30, 2024
9ae7363
comments and added tests
younesbelkada Apr 30, 2024
3ed384f
fix merge
younesbelkada May 14, 2024
55eb860
fix unconsistent type
younesbelkada May 14, 2024
f754335
more
younesbelkada May 14, 2024
a449078
Merge remote-tracking branch 'origin/main' into HEAD
younesbelkada May 14, 2024
3bdbb2e
fix tokenizer
younesbelkada May 15, 2024
0ab79f6
Update src/transformers/modeling_utils.py
younesbelkada May 15, 2024
65433c4
address comments about tests and tokenizer + add added_tokens
younesbelkada May 15, 2024
1b5ae54
from_gguf -> gguf_file
younesbelkada May 15, 2024
d6b67c6
replace on docs too
younesbelkada May 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docker/transformers-all-latest-gpu/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,9 @@ RUN python3 -m pip install --no-cache-dir git+https://github.com/huggingface/opt
# For video model testing
RUN python3 -m pip install --no-cache-dir decord av==9.2.0

# For GGUF tests
RUN python3 -m pip install --no-cache-dir gguf

# Some slow tests require bnb
RUN python3 -m pip install --no-cache-dir bitsandbytes

Expand Down
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -137,6 +137,8 @@
title: Troubleshoot
- local: hf_quantizer
title: Contribute new quantization method
- local: gguf
title: Interoperability with GGUF files
title: Developer guides
- sections:
- local: performance
Expand Down
96 changes: 96 additions & 0 deletions docs/source/en/gguf.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

# GGUF and interaction with Transformers

The GGUF file format is used to store models for inference with [GGML](https://github.com/ggerganov/ggml) and other
libraries that depend on it, like the very popular [llama.cpp](https://github.com/ggerganov/llama.cpp) or
[whisper.cpp](https://github.com/ggerganov/whisper.cpp).

It is a file format [supported by the Hugging Face Hub](https://huggingface.co/docs/hub/en/gguf) with features
allowing for quick inspection of tensors and metadata within the file.

This file format is designed as a "single-file-format" where a single file usually contains both the configuration
attributes, the tokenizer vocabulary and other attributes, as well as all tensors to be loaded in the model. These
files come in different formats according to the quantization type of the file. We briefly go over some of them
[here](https://huggingface.co/docs/hub/en/gguf#quantization-types).

## Support within Transformers

We have added the ability to load `gguf` files within `transformers` in order to offer further training/fine-tuning
capabilities to gguf models, before converting back those models to `gguf` to use within the `ggml` ecosystem. When
loading a model, we first dequantize it to fp32, before loading the weights to be used in PyTorch.

> [!NOTE]
> The support is still very exploratory and we welcome contributions in order to solidify it across quantization types
> and model architectures.

For now, here are the supported model architectures and quantization types:

### Supported quantization types

The initial supported quantization types are decided according to the popular quantized files that have been shared
on the Hub.

- F32
younesbelkada marked this conversation as resolved.
Show resolved Hide resolved
- Q2_K
- Q3_K
- Q4_0
- Q4_K
- Q5_K
- Q6_K
younesbelkada marked this conversation as resolved.
Show resolved Hide resolved
- Q8_0

We take example from the excellent [99991/pygguf](https://github.com/99991/pygguf) Python parser to dequantize the
weights.

### Supported model architectures

For now the supported model architectures are the architectures that have been very popular on the Hub, namely:

- LLaMa
- Mistral

## Example usage

In order to load `gguf` files in `transformers`, you should specify the `from_gguf` argument to the `from_pretrained`
methods of both tokenizers and models. Here is how one would load a tokenizer and a model, which can be loaded
from the exact same file:

```py
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF"
filename = "tinyllama-1.1b-chat-v1.0.Q6_K.gguf"

tokenizer = AutoTokenizer.from_pretrained(model_id, from_gguf=filename)
model = AutoModelForCausalLM.from_pretrained(model_id, from_gguf=filename)
Copy link
Collaborator

@amyeroberts amyeroberts Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if I passed in a quantization config in with the from_pretrained call? gguf -> unquantized -> requantized?

I see this is handled in modeling utils ❤️

```

Now you have access to the full, unquantized version of the model in the PyTorch ecosystem, where you can combine it
with a plethora of other tools.

In order to convert back to a `gguf` file, we recommend using the
[`convert-hf-to-gguf.py` file](https://github.com/ggerganov/llama.cpp/blob/master/convert-hf-to-gguf.py) from llama.cpp.

Here's how you would complete the script above to save the model and export it back to `gguf`:

```py
tokenizer.save_pretrained('directory')
model.save_pretrained('directory')

!python ${path_to_llama_cpp}/convert-hf-to-gguf.py ${directory}
Comment on lines +92 to +95
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we had this within save_pretrained using e.g. a save_gguf flag

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is part of the full integration, will do that in a follow up PR !

```
15 changes: 11 additions & 4 deletions src/transformers/configuration_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@

from . import __version__
from .dynamic_module_utils import custom_object_save
from .modeling_gguf_pytorch_utils import load_gguf_checkpoint
from .utils import (
CONFIG_NAME,
PushToHubMixin,
Expand Down Expand Up @@ -658,6 +659,8 @@ def _get_config_dict(
from_auto_class = kwargs.pop("_from_auto", False)
commit_hash = kwargs.pop("_commit_hash", None)

from_gguf = kwargs.get("from_gguf", None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be pop here?

Suggested change
from_gguf = kwargs.get("from_gguf", None)
from_gguf = kwargs.pop("from_gguf", None)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I think it should be get as from_gguf is used later in case one uses Auto classes

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, OK!


if trust_remote_code is True:
logger.warning(
"The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is"
Expand All @@ -676,10 +679,10 @@ def _get_config_dict(
resolved_config_file = pretrained_model_name_or_path
is_local = True
elif is_remote_url(pretrained_model_name_or_path):
configuration_file = pretrained_model_name_or_path
configuration_file = pretrained_model_name_or_path if from_gguf is None else from_gguf
resolved_config_file = download_url(pretrained_model_name_or_path)
else:
configuration_file = kwargs.pop("_configuration_file", CONFIG_NAME)
configuration_file = kwargs.pop("_configuration_file", CONFIG_NAME) if from_gguf is None else from_gguf

try:
# Load from local folder or from cache or download from model Hub and cache
Expand Down Expand Up @@ -712,8 +715,12 @@ def _get_config_dict(
)

try:
# Load config dict
config_dict = cls._dict_from_json_file(resolved_config_file)
if from_gguf:
config_dict = load_gguf_checkpoint(resolved_config_file, return_tensors=False)["config"]
else:
# Load config dict
config_dict = cls._dict_from_json_file(resolved_config_file)

config_dict["_commit_hash"] = commit_hash
except (json.JSONDecodeError, UnicodeDecodeError):
raise EnvironmentError(
Expand Down
16 changes: 16 additions & 0 deletions src/transformers/integrations/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,14 @@
"unset_hf_deepspeed_config",
],
"eetq": ["replace_with_eetq_linear"],
"ggml": [
"GGUF_CONFIG_MAPPING",
"GGUF_TENSOR_MAPPING",
"GGUF_TOKENIZER_MAPPING",
"_gguf_parse_value",
"load_dequant_gguf_tensor",
"load_gguf",
],
"hqq": ["prepare_for_hqq_linear"],
"integration_utils": [
"INTEGRATION_TO_CALLBACK",
Expand Down Expand Up @@ -116,6 +124,14 @@
unset_hf_deepspeed_config,
)
from .eetq import replace_with_eetq_linear
from .ggml import (
GGUF_CONFIG_MAPPING,
GGUF_TENSOR_MAPPING,
GGUF_TOKENIZER_MAPPING,
_gguf_parse_value,
load_dequant_gguf_tensor,
load_gguf,
)
from .hqq import prepare_for_hqq_linear
from .integration_utils import (
INTEGRATION_TO_CALLBACK,
Expand Down
Loading
Loading