Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor convert.py and add support for Metas official Llama 3 model #6819

Closed
4 tasks done
teleprint-me opened this issue Apr 22, 2024 · 20 comments
Closed
4 tasks done
Labels
enhancement New feature or request

Comments

@teleprint-me
Copy link
Contributor

teleprint-me commented Apr 22, 2024

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

Support the Official Llama 3 PyTorch model distributed by Meta.

Motivation

The convert.py supports converting the raw Llama 1 and 2 torch models distributed by Facebook Research Labs, but not the Llama 3 raw torch models. PR #6745 implemented the conversion process for Huggingface's transformers and tokenizers framework implementations, but not the raw torch models themselves.

There are issues conflicting with the current convert.py implementation due to feature creep based on the desire to support Huggingface's formats. These features are now blocking and interfering with the implementation for Llama 3.

Possible Implementation

The Official Llama 3 is distributed with a plaintext BPE tokenizer.model file which utilizes the GPT-2 tokenizer format distributed by OpenAI. This means it requires the use of tiktoken in order to convert the model to a compatible GGUF format.

We would need to integrate this into the BpeVocab class which solely supports HuggingFace's tokenizers at the moment.

We already have the implementation details given to us by Meta which released the official source code in their meta-llama org repo. See https://github.com/meta-llama/llama3 for more information.

The Tokenizer class implementation is already fleshed out, but needs to be refactored and integrated into the Vocab factory in a reasonable way. This is no small feat because it breaks the currently existing pattern and deviates from the previous releases as a result.

We already have support for most of these models and vocabularies are far and few inbetween, but there's enough abstractions as well as implementations that the complexity is increasing over time.

Some ideas I'm currently considering are to follow a series of steps over time to reduce the complexity, maintaince, and extension of the convert.py script over time.

This means removing any unnecessary and unrelated code from the convert.py script and migrating all HuggingFace source code to the convert-hf-to-gguf.py script. This is long term proposal that requires everyone to be on the same page in order to effectively and efficiently pull this off.

I outlined my rationale in the link above referencing PR #6745. A potentially related issue in ISS #6690.

I'm open to any feedback and suggestions here. I'm in no rush to implement this and I believe its wise we don't rush to implement this as enough technical debt has piled up. It might be better to discuss this first and determine the best steps to take before progressing forward.

@cebtenzzre @ngxson @pcuenca

@teleprint-me teleprint-me added the enhancement New feature or request label Apr 22, 2024
@pcuenca
Copy link
Contributor

pcuenca commented Apr 22, 2024

I'm a newcomer to the project so can't comment about past design decisions. Before #6144, I think convert.py was used to convert Llama/Mistral models (native weights or in HF transformers format), whereas convert-hf-to-gguf.py was used to convert other architectures available in HF format. It sounds reasonable to me that the hf script only does HF format, but I'm not sure if you're implying that support should be removed from convert.py, and what the implications of that would be.

One note about conversion of the native Llama 3 tokenizer. It is indeed based on tiktoken, which should just be a fast BPE implementation. However, there are some quirks in the implementation that make it a bit tricky to export to pure BPE.

  • To extract the merges, you can start using a method like this, but it won't find all the merges.
  • You need to consider all possible pairs (prefix + suffx) that make up each token in the vocab, and add those that exist in the bpe list. (You can read the list with load_tiktoken_bpe from the tiktoken.load module).
  • In addition, this line in the tiktoken implementation deviates from BPE. It shortcuts the BPE algorithm if the substring being considered for tokenization is in the vocab. This is the reason why the transformers implementation had to add the ignore_merges configuration option. A similar mechanism would have to be added here if it's already not in place.

The first 2 points (extracting the full list of merges) should be easy once the transformers conversion script is published, which should happen soon. The third point (shortcut BPE for subwords in the vocabulary) should be explored. If these details are not in place, tokenization will result in slightly different sequences of token ids, especially for a few words in some languages.

@ryao
Copy link

ryao commented Apr 22, 2024

It already works as far as I can tell:

huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct
cd /path/to/llama.cpp
./convert.py ${HOME}/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/1448453bdb895762499deb4176c1dd83b145fac1 --outfile Llama-3-8B-Instruct.f32.gguf --outtype f32 --vocab-type bpe
./quantize Llama-3-8B-Instruct.f32.gguf  Llama-3-8B-Instruct.Q8_0.gguf Q8_0
./main -m Llama-3-8B-Instruct.Q8_0.gguf --no-display-prompt -e -c 0  -r '<|eot_id|>' --in-prefix '\n<|start_header_id|>user<|end_header_id|>\n\n' --in-suffix '<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' -p "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a story writing assistant.<|eot_id|>\n<|start_header_id|>user<|end_header_id|>\n\nWrite a story about llamas.<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>\n\n"

Add -ngl 33 to that last command for GPU offload. That is what I did to confirm it worked, as I did not have the patience to wait for this to run on my CPU. Quantization is unnecessary, but I included it to reduce the memory requirements from those of FP32.

@teleprint-me
Copy link
Contributor Author

@pcuenca Thanks! I really appreciate your input and feedback. I'll check it out when I have some time.

@ryao That's the Huggingface model created with the transformers and tokenizers frameworks. I'm referencing the raw model distributed by Meta directly.

@PawelSzpyt
Copy link

I confirm, I downloaded full model directly from Meta and I can't convert it with llama.cpp to gguf. I'm on Mac, but I guess it doesn't really matter. Tried convert and convert-hf-to-gguf, with and without vocab-type and pad-vocab, f16/f32, different python envs etc :)

(llama3) ps@macstudio llama.cpp % python convert.py /Users/ps/llama3-8b/llama3/Meta-Llama-3-8B-Instruct/ --outfile /Users/ps/llama3-8b/llama3/Meta-Llama-3-8b-Instruct/fp32.gguf --vocab-type bpe --pad-vocab --outtype f16
Loading model file /Users/ps/llama3-8b/llama3/Meta-Llama-3-8B-Instruct/consolidated.00.pth
params = Params(n_vocab=128256, n_embd=4096, n_layer=32, n_ctx=4096, n_ff=14336, n_head=32, n_head_kv=8, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=500000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=<GGMLFileType.MostlyF16: 1>, path_model=PosixPath('/Users/ps/llama3-8b/llama3/Meta-Llama-3-8B-Instruct'))
Traceback (most recent call last):
  File "/Users/ps/llama.cpp/convert.py", line 1555, in <module>
    main()
  File "/Users/ps/llama.cpp/convert.py", line 1522, in main
    vocab, special_vocab = vocab_factory.load_vocab(vocab_types, model_parent_path)
  File "/Users/ps/llama.cpp/convert.py", line 1424, in load_vocab
    vocab = self._create_vocab_by_path(vocab_types)
  File "/Users/ps/llama.cpp/convert.py", line 1414, in _create_vocab_by_path
    raise FileNotFoundError(f"Could not find a tokenizer matching any of {vocab_types}")
FileNotFoundError: Could not find a tokenizer matching any of ['bpe']

Since you're not in a hurry, I guess I'll just get HF version. Probably will also be easier to use that later in mergekit (?).

@teleprint-me
Copy link
Contributor Author

What makes the convert.py script so valuable and useful is that it doesn't load the full model into memory and it is not supposed to depend on any other library other than numpy and torch.

@ryao
Copy link

ryao commented Apr 22, 2024

I confirm, I downloaded full model directly from Meta and I can't convert it with llama.cpp to gguf.

Use huggingface-cli to download it.

@ryao
Copy link

ryao commented Apr 22, 2024

@ryao That's the Huggingface model created with the transformers and tokenizers frameworks. I'm referencing the raw model distributed by Meta directly.

Aren't the weights the same?

@teleprint-me
Copy link
Contributor Author

Aren't the weights the same?

@ryao

The answer to that question is that it's nuanced. The issue stems from the file formats and tokenizers (aka vocabularies).

The consolidated.00.pth is not the same file format as the 4 model-0000n-of-00004.safetensors parts in the HF repository.

So the model weights themselves might be the same (I would expect some differences), the file formats are completely different; e.g. A pickled file is not the same as a safetensor file. It should be noted that the HuggingFace repository contains both formats, so I can understand the confusion.

The issue I'm addressing is related to the models vocabularies as the convert.py script already handles the pth, bin, and safetensors file formats. The convert.py script also handles the sentencepiece and tokenizers vocabulary formats too.

/mnt/scsm/models/facebook/llama-3/Meta-Llama-3-8B-Instruct
├── checklist.chk  # Checkpoint
├── consolidated.00.pth  # Model
├── params.json  # Hyperparameters
└── tokenizer.model  # Plaintext BPE depends on tiktoken

1 directory, 4 files

What the convert.py script does not handle is the original BPE formats and it also does not support the tiktoken format. The convert.py BPE vocabulary currently only supports the HuggingFace tokenizers format. The tiktoken vocabulary format is not the same as the tokenizers format at all. @pcuenca was kind enough to link to an OpenAI issue related to this and it seems HuggingFace has similar issues as well.

/mnt/valerie/models/meta-llama/Meta-Llama-3-8B-Instruct-HF
├── config.json  # transformers hyperparamters
├── generation_config.json  # Hyperparameters generation
├── LICENSE
├── model-00001-of-00004.safetensors
├── model-00002-of-00004.safetensors
├── model-00003-of-00004.safetensors
├── model-00004-of-00004.safetensors
├── model.safetensors.index.json
├── original  # Ignored
│   ├── consolidated.00.pth
│   ├── params.json
│   └── tokenizer.model
├── README.md
├── special_tokens_map.json  # Added tokens
├── tokenizer_config.json  # Tokenizers generation
├── tokenizer.json  # tokenizers vocabulary
└── USE_POLICY.md

2 directories, 16 files

The vocabulary and language models are seperate entities in other libraries and frameworks while GGML unifies the vocabulary and language model into a single file with metadata relevant to the model file. The GGUF file format is standardized, so it's easier to follow and comprehend as a result.

@ashwini
Copy link

ashwini commented Apr 23, 2024

Confirming convert.py doesn't work for Llama3 70b-instruct direct from Meta (not HF), on macOS (14.4.1) with latest llama.cpp using --vocab-type=bpe:

raise FileNotFoundError(f"Could not find a tokenizer matching any of {vocab_types}") FileNotFoundError: Could not find a tokenizer matching any of ['bpe']

edit: clarified I'm referring to model directly from Meta not HF

@junrong1
Copy link

Confirming convert.py doesn't work for Llama3 70b-instruct, on macOS (14.4.1) with latest llama.cpp using --vocab-type=bpe:

raise FileNotFoundError(f"Could not find a tokenizer matching any of {vocab_types}") FileNotFoundError: Could not find a tokenizer matching any of ['bpe']

Did you try the latest version, it works on my macOS(14.2.1)

@PawelSzpyt
Copy link

I did try the latest version, It does not work on my MacOS 14.4.1. It works fine, without any problems and very fast if you downloaded the weights from Huggingface. It does not work at all if you got it from Meta.

@junrong1
Copy link

I did try the latest version, It does not work on my MacOS 14.4.1. It works fine, without any problems and very fast if you downloaded the weights from Huggingface. It does not work at all if you got it from Meta.

Yep I used HF as well. I haven't tried the original from Meta.

@leon-maechler
Copy link

leon-maechler commented Apr 23, 2024

EDIT: everything works now, just had to reset the repo, had some old version...

./convert.py ~/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B/snapshots/561487d18c41c76bcb5fc6cfb73a324982f04f47/ --outfile ./models/Llama3-8B.gguf --vocab-type bpe

this works for me

@pcuenca
Copy link
Contributor

pcuenca commented Apr 23, 2024

Regarding the conversion of the original tokenizer to pure BPE, the transformers implementation is now available as a PR.

If anyone decides to tackle this, keep in mind the tiktoken deviation from BPE that was mentioned in a previous comment , and that should be replicated if you want to ensure 100% compatibility.

@teleprint-me
Copy link
Contributor Author

teleprint-me commented Apr 23, 2024

@pcuenca Yeah, that makes me glad I'm waiting. Patience usually wins out.

I can implement it later on once I have some more time. I did the vocab factory for the original convert script, so I already understand how it works.

Something to keep in mind about the huggingface conversion scripts is that they're not part of the standard hf api. They're compatibility based cli tools intended to convert external formats from outside sources to the desired huggingface format. So it's unwise to rely on them. This is why I explicitly recommended migrating any hf reliant code from convert.py to convert-hf-to-gguf.py. convert.py should not rely on the hf api at all.

Last time I tried to leverage those tools, it did not go as planned and they were limited as a result. I never got around to fixing it because I only have so much time and bandwidth.

I already prototyped a few iterations, but I don't like any of them.

  • Idea 1: was to be as lazy as possible and use llama3 repo as a dependency and just import the llama.Tokenizer as Llama3Tokenizer and then encode and decode that way
  • Idea 2: was to not be as a lazy as possible and implement a custom tokenizer for llama3, but that felt out of scope
  • Idea 3: was to build it out, but then I realized how much the code base was going to grow as a result because of the need for the special tokens and having to reproduce the end result.
  • Idea 4: was to accept the convergence of hf api taking over the conversion scripts and merging them completely.

I still need time to think about how to go about this. Input and feedback is welcome. It's easier to iterate over ideas than it is to iterate over code.

@ashwini
Copy link

ashwini commented Apr 24, 2024

Given the importance of llama3 shouldn’t this be considered a serious bug, not an enhancement?

@teleprint-me
Copy link
Contributor Author

teleprint-me commented Apr 24, 2024

A bug is when you have code and it isnt working as intended; A bug can also be something that is working as intended, but can be exploited in an unintended way. An enhancement can be the application of a pattern or some form of improvement or new addition to the code base. I suppose this could be both? 🤔😅

I would say it's more of a design flaw as well as a common misunderstanding. Everyone always tries to use the convert.py script to convert huggingface models and then theyre confused as to why it doesn't work. It doesnt work because it was originally designed to convert a raw torch model.

I labeled it as an enhancement because the code base needs to be rethought due to the fact that using tiktoken became a wrench thrown into the pipeline.

Technically, I could just include llama3 or tiktoken and call it a day, but I would prefer to not pile onto the technical debt.

If momentum is preferred, then its probably better to just include the llama3 tokenizer directly and call it a day. This would be the path of least resistance.

It would take me (I think) about a day, so probably (double the time I think it would take) 2 days to include tiktoken. A few hours if I used the llama3 tokenizer directly. This is a bandaide on something that needs more attention and is a growing issue as more and more models are added over time. The earlier its handled, the better off everyone in the future will be as a result.

@apthagowda97
Copy link

Any update on converting RAW meta models to HF??

@pcuenca
Copy link
Contributor

pcuenca commented Apr 25, 2024

Any update on converting RAW meta models to HF??

You can use the conversion script that was merged yesterday to transformers @ main.

@teleprint-me
Copy link
Contributor Author

teleprint-me commented May 2, 2024

This is going to get closed after 14 days of inactivity.

I'm waiting on some PRs to get merged and I'll work on this when I have some more time. I'm spread to thin at the moment, unfortunately.

I'll eventually open a PR once the major changes settle down because a bunch of breaking changes were introduced due to the Llama 3 tokenizer.

I appreciate everyone that participated as it added value to this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

8 participants