Refactor convert.py and add support for Metas official Llama 3 model #6819

teleprint-me · 2024-04-22T04:11:07Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

Support the Official Llama 3 PyTorch model distributed by Meta.

Motivation

The convert.py supports converting the raw Llama 1 and 2 torch models distributed by Facebook Research Labs, but not the Llama 3 raw torch models. PR #6745 implemented the conversion process for Huggingface's transformers and tokenizers framework implementations, but not the raw torch models themselves.

There are issues conflicting with the current convert.py implementation due to feature creep based on the desire to support Huggingface's formats. These features are now blocking and interfering with the implementation for Llama 3.

Possible Implementation

The Official Llama 3 is distributed with a plaintext BPE tokenizer.model file which utilizes the GPT-2 tokenizer format distributed by OpenAI. This means it requires the use of tiktoken in order to convert the model to a compatible GGUF format.

We would need to integrate this into the BpeVocab class which solely supports HuggingFace's tokenizers at the moment.

We already have the implementation details given to us by Meta which released the official source code in their meta-llama org repo. See https://github.com/meta-llama/llama3 for more information.

The Tokenizer class implementation is already fleshed out, but needs to be refactored and integrated into the Vocab factory in a reasonable way. This is no small feat because it breaks the currently existing pattern and deviates from the previous releases as a result.

We already have support for most of these models and vocabularies are far and few inbetween, but there's enough abstractions as well as implementations that the complexity is increasing over time.

Some ideas I'm currently considering are to follow a series of steps over time to reduce the complexity, maintaince, and extension of the convert.py script over time.

This means removing any unnecessary and unrelated code from the convert.py script and migrating all HuggingFace source code to the convert-hf-to-gguf.py script. This is long term proposal that requires everyone to be on the same page in order to effectively and efficiently pull this off.

I outlined my rationale in the link above referencing PR #6745. A potentially related issue in ISS #6690.

I'm open to any feedback and suggestions here. I'm in no rush to implement this and I believe its wise we don't rush to implement this as enough technical debt has piled up. It might be better to discuss this first and determine the best steps to take before progressing forward.

@cebtenzzre @ngxson @pcuenca

The text was updated successfully, but these errors were encountered:

pcuenca · 2024-04-22T09:46:46Z

I'm a newcomer to the project so can't comment about past design decisions. Before #6144, I think convert.py was used to convert Llama/Mistral models (native weights or in HF transformers format), whereas convert-hf-to-gguf.py was used to convert other architectures available in HF format. It sounds reasonable to me that the hf script only does HF format, but I'm not sure if you're implying that support should be removed from convert.py, and what the implications of that would be.

One note about conversion of the native Llama 3 tokenizer. It is indeed based on tiktoken, which should just be a fast BPE implementation. However, there are some quirks in the implementation that make it a bit tricky to export to pure BPE.

To extract the merges, you can start using a method like this, but it won't find all the merges.
You need to consider all possible pairs (prefix + suffx) that make up each token in the vocab, and add those that exist in the bpe list. (You can read the list with load_tiktoken_bpe from the tiktoken.load module).
In addition, this line in the tiktoken implementation deviates from BPE. It shortcuts the BPE algorithm if the substring being considered for tokenization is in the vocab. This is the reason why the transformers implementation had to add the ignore_merges configuration option. A similar mechanism would have to be added here if it's already not in place.

The first 2 points (extracting the full list of merges) should be easy once the transformers conversion script is published, which should happen soon. The third point (shortcut BPE for subwords in the vocabulary) should be explored. If these details are not in place, tokenization will result in slightly different sequences of token ids, especially for a few words in some languages.

ryao · 2024-04-22T17:04:31Z

It already works as far as I can tell:

huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct
cd /path/to/llama.cpp
./convert.py ${HOME}/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/1448453bdb895762499deb4176c1dd83b145fac1 --outfile Llama-3-8B-Instruct.f32.gguf --outtype f32 --vocab-type bpe
./quantize Llama-3-8B-Instruct.f32.gguf  Llama-3-8B-Instruct.Q8_0.gguf Q8_0
./main -m Llama-3-8B-Instruct.Q8_0.gguf --no-display-prompt -e -c 0  -r '<|eot_id|>' --in-prefix '\n<|start_header_id|>user<|end_header_id|>\n\n' --in-suffix '<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' -p "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a story writing assistant.<|eot_id|>\n<|start_header_id|>user<|end_header_id|>\n\nWrite a story about llamas.<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>\n\n"

Add -ngl 33 to that last command for GPU offload. That is what I did to confirm it worked, as I did not have the patience to wait for this to run on my CPU. Quantization is unnecessary, but I included it to reduce the memory requirements from those of FP32.

teleprint-me · 2024-04-22T19:13:06Z

@pcuenca Thanks! I really appreciate your input and feedback. I'll check it out when I have some time.

@ryao That's the Huggingface model created with the transformers and tokenizers frameworks. I'm referencing the raw model distributed by Meta directly.

PawelSzpyt · 2024-04-22T20:07:13Z

I confirm, I downloaded full model directly from Meta and I can't convert it with llama.cpp to gguf. I'm on Mac, but I guess it doesn't really matter. Tried convert and convert-hf-to-gguf, with and without vocab-type and pad-vocab, f16/f32, different python envs etc :)

(llama3) ps@macstudio llama.cpp % python convert.py /Users/ps/llama3-8b/llama3/Meta-Llama-3-8B-Instruct/ --outfile /Users/ps/llama3-8b/llama3/Meta-Llama-3-8b-Instruct/fp32.gguf --vocab-type bpe --pad-vocab --outtype f16
Loading model file /Users/ps/llama3-8b/llama3/Meta-Llama-3-8B-Instruct/consolidated.00.pth
params = Params(n_vocab=128256, n_embd=4096, n_layer=32, n_ctx=4096, n_ff=14336, n_head=32, n_head_kv=8, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=500000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=<GGMLFileType.MostlyF16: 1>, path_model=PosixPath('/Users/ps/llama3-8b/llama3/Meta-Llama-3-8B-Instruct'))
Traceback (most recent call last):
  File "/Users/ps/llama.cpp/convert.py", line 1555, in <module>
    main()
  File "/Users/ps/llama.cpp/convert.py", line 1522, in main
    vocab, special_vocab = vocab_factory.load_vocab(vocab_types, model_parent_path)
  File "/Users/ps/llama.cpp/convert.py", line 1424, in load_vocab
    vocab = self._create_vocab_by_path(vocab_types)
  File "/Users/ps/llama.cpp/convert.py", line 1414, in _create_vocab_by_path
    raise FileNotFoundError(f"Could not find a tokenizer matching any of {vocab_types}")
FileNotFoundError: Could not find a tokenizer matching any of ['bpe']

Since you're not in a hurry, I guess I'll just get HF version. Probably will also be easier to use that later in mergekit (?).

teleprint-me · 2024-04-22T20:22:32Z

What makes the convert.py script so valuable and useful is that it doesn't load the full model into memory and it is not supposed to depend on any other library other than numpy and torch.

ryao · 2024-04-22T20:24:04Z

I confirm, I downloaded full model directly from Meta and I can't convert it with llama.cpp to gguf.

Use huggingface-cli to download it.

ryao · 2024-04-22T21:19:10Z

@ryao That's the Huggingface model created with the transformers and tokenizers frameworks. I'm referencing the raw model distributed by Meta directly.

Aren't the weights the same?

teleprint-me · 2024-04-22T21:54:39Z

Aren't the weights the same?

@ryao

The answer to that question is that it's nuanced. The issue stems from the file formats and tokenizers (aka vocabularies).

The consolidated.00.pth is not the same file format as the 4 model-0000n-of-00004.safetensors parts in the HF repository.

So the model weights themselves might be the same (I would expect some differences), the file formats are completely different; e.g. A pickled file is not the same as a safetensor file. It should be noted that the HuggingFace repository contains both formats, so I can understand the confusion.

The issue I'm addressing is related to the models vocabularies as the convert.py script already handles the pth, bin, and safetensors file formats. The convert.py script also handles the sentencepiece and tokenizers vocabulary formats too.

/mnt/scsm/models/facebook/llama-3/Meta-Llama-3-8B-Instruct
├── checklist.chk  # Checkpoint
├── consolidated.00.pth  # Model
├── params.json  # Hyperparameters
└── tokenizer.model  # Plaintext BPE depends on tiktoken

1 directory, 4 files

What the convert.py script does not handle is the original BPE formats and it also does not support the tiktoken format. The convert.py BPE vocabulary currently only supports the HuggingFace tokenizers format. The tiktoken vocabulary format is not the same as the tokenizers format at all. @pcuenca was kind enough to link to an OpenAI issue related to this and it seems HuggingFace has similar issues as well.

/mnt/valerie/models/meta-llama/Meta-Llama-3-8B-Instruct-HF
├── config.json  # transformers hyperparamters
├── generation_config.json  # Hyperparameters generation
├── LICENSE
├── model-00001-of-00004.safetensors
├── model-00002-of-00004.safetensors
├── model-00003-of-00004.safetensors
├── model-00004-of-00004.safetensors
├── model.safetensors.index.json
├── original  # Ignored
│   ├── consolidated.00.pth
│   ├── params.json
│   └── tokenizer.model
├── README.md
├── special_tokens_map.json  # Added tokens
├── tokenizer_config.json  # Tokenizers generation
├── tokenizer.json  # tokenizers vocabulary
└── USE_POLICY.md

2 directories, 16 files

The vocabulary and language models are seperate entities in other libraries and frameworks while GGML unifies the vocabulary and language model into a single file with metadata relevant to the model file. The GGUF file format is standardized, so it's easier to follow and comprehend as a result.

ashwini · 2024-04-23T05:34:22Z

Confirming convert.py doesn't work for Llama3 70b-instruct direct from Meta (not HF), on macOS (14.4.1) with latest llama.cpp using --vocab-type=bpe:

raise FileNotFoundError(f"Could not find a tokenizer matching any of {vocab_types}") FileNotFoundError: Could not find a tokenizer matching any of ['bpe']

edit: clarified I'm referring to model directly from Meta not HF

junrong1 · 2024-04-23T08:07:16Z

Confirming convert.py doesn't work for Llama3 70b-instruct, on macOS (14.4.1) with latest llama.cpp using --vocab-type=bpe:

raise FileNotFoundError(f"Could not find a tokenizer matching any of {vocab_types}") FileNotFoundError: Could not find a tokenizer matching any of ['bpe']

Did you try the latest version, it works on my macOS(14.2.1)

PawelSzpyt · 2024-04-23T08:21:24Z

I did try the latest version, It does not work on my MacOS 14.4.1. It works fine, without any problems and very fast if you downloaded the weights from Huggingface. It does not work at all if you got it from Meta.

junrong1 · 2024-04-23T08:35:28Z

I did try the latest version, It does not work on my MacOS 14.4.1. It works fine, without any problems and very fast if you downloaded the weights from Huggingface. It does not work at all if you got it from Meta.

Yep I used HF as well. I haven't tried the original from Meta.

leon-maechler · 2024-04-23T09:48:52Z

EDIT: everything works now, just had to reset the repo, had some old version...

./convert.py ~/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B/snapshots/561487d18c41c76bcb5fc6cfb73a324982f04f47/ --outfile ./models/Llama3-8B.gguf --vocab-type bpe

this works for me

pcuenca · 2024-04-23T16:47:37Z

Regarding the conversion of the original tokenizer to pure BPE, the transformers implementation is now available as a PR.

If anyone decides to tackle this, keep in mind the tiktoken deviation from BPE that was mentioned in a previous comment , and that should be replicated if you want to ensure 100% compatibility.

teleprint-me · 2024-04-23T18:37:14Z

@pcuenca Yeah, that makes me glad I'm waiting. Patience usually wins out.

I can implement it later on once I have some more time. I did the vocab factory for the original convert script, so I already understand how it works.

Something to keep in mind about the huggingface conversion scripts is that they're not part of the standard hf api. They're compatibility based cli tools intended to convert external formats from outside sources to the desired huggingface format. So it's unwise to rely on them. This is why I explicitly recommended migrating any hf reliant code from convert.py to convert-hf-to-gguf.py. convert.py should not rely on the hf api at all.

Last time I tried to leverage those tools, it did not go as planned and they were limited as a result. I never got around to fixing it because I only have so much time and bandwidth.

I already prototyped a few iterations, but I don't like any of them.

Idea 1: was to be as lazy as possible and use llama3 repo as a dependency and just import the llama.Tokenizer as Llama3Tokenizer and then encode and decode that way
Idea 2: was to not be as a lazy as possible and implement a custom tokenizer for llama3, but that felt out of scope
Idea 3: was to build it out, but then I realized how much the code base was going to grow as a result because of the need for the special tokens and having to reproduce the end result.
Idea 4: was to accept the convergence of hf api taking over the conversion scripts and merging them completely.

I still need time to think about how to go about this. Input and feedback is welcome. It's easier to iterate over ideas than it is to iterate over code.

ashwini · 2024-04-24T03:13:45Z

Given the importance of llama3 shouldn’t this be considered a serious bug, not an enhancement?

teleprint-me · 2024-04-24T20:15:07Z

A bug is when you have code and it isnt working as intended; A bug can also be something that is working as intended, but can be exploited in an unintended way. An enhancement can be the application of a pattern or some form of improvement or new addition to the code base. I suppose this could be both? 🤔😅

I would say it's more of a design flaw as well as a common misunderstanding. Everyone always tries to use the convert.py script to convert huggingface models and then theyre confused as to why it doesn't work. It doesnt work because it was originally designed to convert a raw torch model.

I labeled it as an enhancement because the code base needs to be rethought due to the fact that using tiktoken became a wrench thrown into the pipeline.

Technically, I could just include llama3 or tiktoken and call it a day, but I would prefer to not pile onto the technical debt.

If momentum is preferred, then its probably better to just include the llama3 tokenizer directly and call it a day. This would be the path of least resistance.

It would take me (I think) about a day, so probably (double the time I think it would take) 2 days to include tiktoken. A few hours if I used the llama3 tokenizer directly. This is a bandaide on something that needs more attention and is a growing issue as more and more models are added over time. The earlier its handled, the better off everyone in the future will be as a result.

apthagowda97 · 2024-04-25T08:34:33Z

Any update on converting RAW meta models to HF??

pcuenca · 2024-04-25T12:37:37Z

Any update on converting RAW meta models to HF??

You can use the conversion script that was merged yesterday to transformers @ main.

teleprint-me · 2024-05-02T01:45:34Z

This is going to get closed after 14 days of inactivity.

I'm waiting on some PRs to get merged and I'll work on this when I have some more time. I'm spread to thin at the moment, unfortunately.

I'll eventually open a PR once the major changes settle down because a bunch of breaking changes were introduced due to the Llama 3 tokenizer.

I appreciate everyone that participated as it added value to this thread.

teleprint-me added the enhancement New feature or request label Apr 22, 2024

mreso mentioned this issue Apr 25, 2024

Update cpp/llamacpp to Llama 3 pytorch/serve#3098

Open

Imaniac230 mentioned this issue May 1, 2024

llama : improve BPE pre-processing + LLaMA 3 and Deepseek support #6920

Merged

teleprint-me closed this as completed May 2, 2024

aleloi mentioned this issue May 17, 2024

convert.py still fails on llama3 8B-Instruct downloaded directly from Meta (Huggingface works) #7339

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor convert.py and add support for Metas official Llama 3 model #6819

Refactor convert.py and add support for Metas official Llama 3 model #6819

teleprint-me commented Apr 22, 2024 •

edited

Loading

pcuenca commented Apr 22, 2024

ryao commented Apr 22, 2024 •

edited

Loading

teleprint-me commented Apr 22, 2024

PawelSzpyt commented Apr 22, 2024

teleprint-me commented Apr 22, 2024

ryao commented Apr 22, 2024

ryao commented Apr 22, 2024

teleprint-me commented Apr 22, 2024

ashwini commented Apr 23, 2024 •

edited

Loading

junrong1 commented Apr 23, 2024

PawelSzpyt commented Apr 23, 2024

junrong1 commented Apr 23, 2024

leon-maechler commented Apr 23, 2024 •

edited

Loading

pcuenca commented Apr 23, 2024

teleprint-me commented Apr 23, 2024 •

edited

Loading

ashwini commented Apr 24, 2024

teleprint-me commented Apr 24, 2024 •

edited

Loading

apthagowda97 commented Apr 25, 2024

pcuenca commented Apr 25, 2024

teleprint-me commented May 2, 2024 •

edited

Loading

Refactor convert.py and add support for Metas official Llama 3 model #6819

Refactor convert.py and add support for Metas official Llama 3 model #6819

Comments

teleprint-me commented Apr 22, 2024 • edited Loading

Prerequisites

Feature Description

Motivation

Possible Implementation

pcuenca commented Apr 22, 2024

ryao commented Apr 22, 2024 • edited Loading

teleprint-me commented Apr 22, 2024

PawelSzpyt commented Apr 22, 2024

teleprint-me commented Apr 22, 2024

ryao commented Apr 22, 2024

ryao commented Apr 22, 2024

teleprint-me commented Apr 22, 2024

ashwini commented Apr 23, 2024 • edited Loading

junrong1 commented Apr 23, 2024

PawelSzpyt commented Apr 23, 2024

junrong1 commented Apr 23, 2024

leon-maechler commented Apr 23, 2024 • edited Loading

pcuenca commented Apr 23, 2024

teleprint-me commented Apr 23, 2024 • edited Loading

ashwini commented Apr 24, 2024

teleprint-me commented Apr 24, 2024 • edited Loading

apthagowda97 commented Apr 25, 2024

pcuenca commented Apr 25, 2024

teleprint-me commented May 2, 2024 • edited Loading

teleprint-me commented Apr 22, 2024 •

edited

Loading

ryao commented Apr 22, 2024 •

edited

Loading

ashwini commented Apr 23, 2024 •

edited

Loading

leon-maechler commented Apr 23, 2024 •

edited

Loading

teleprint-me commented Apr 23, 2024 •

edited

Loading

teleprint-me commented Apr 24, 2024 •

edited

Loading

teleprint-me commented May 2, 2024 •

edited

Loading