Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Microsoft Phi-4 model #10817

Merged
merged 4 commits into from
Dec 19, 2024

Conversation

fairydreaming
Copy link
Collaborator

@fairydreaming fairydreaming commented Dec 13, 2024

This PR adds support for Microsoft Phi-4 model. Fixes #10814.

Current solution is to:

  • Use tokenizer_class value from tokenizer_config.json as a condition to use GPT2 vocab during model conversion.
  • Store explicit 0 value of sliding_window hparam if it's null. This allows the old Phi-3 n_swa validation logic to work without any changes. If n_swa is 0 a regular KQ mask is used instead of sliding window KQ mask in build_phi3().

A model name value from general.name ("Phi 4") was used to trigger behavior specific to Phi-4 model:

1. Using GPT2 vocab during model conversion
2. Ignoring sliding_window hparam during model conversion
3. Skipping sliding window length value check (n_swa == 0) in build_phi3()
4. Creating regular KQ mask instead of sliding window KQ mask in build_phi3()

Let me know if there is any better way to differentiate Phi 4 from other models based on PHI3 architecture.

…4 model

llama : use regular (not a sliding window) attention mask for Phi-4 model
@github-actions github-actions bot added the python python script changes label Dec 13, 2024
src/llama.cpp Outdated
@@ -12839,7 +12839,13 @@ struct llm_build_context {
struct ggml_tensor * inp_pos = build_inp_pos();

// KQ_mask (mask for 1 head, it will be broadcasted to all heads)
struct ggml_tensor * KQ_mask_swa = build_inp_KQ_mask_swa();
struct ggml_tensor * KQ_mask = nullptr;
if (model.name == "Phi 4") {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better solution would be to check if hparams.n_swa != 0.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified my patch to explicitly store zero sliding_window in case it's null in config.json and use the zero value to distinguish Phi-4 from other PHI3-based models.

convert_hf_to_gguf.py Outdated Show resolved Hide resolved
Comment on lines 2132 to 2133
if self.metadata.name == "Phi 4":
return self._set_vocab_gpt2()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, self._set_vocab_gpt2() could be called when tokenizer.model is missing here, regardless of the model name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified the solution to check value of tokenizer_class from tokenizer_config.json and call self._set_vocab_gpt2() if it's GPT2Tokenizer.

@JackCloudman
Copy link

I tested with https://huggingface.co/JackCloudman/Phi-4-jackterated and it works

@fairydreaming fairydreaming merged commit 7585edb into ggerganov:master Dec 19, 2024
51 checks passed
@3Simplex
Copy link

I tried and failed using the latest master.

INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:C:\Users\3simplex\Llama.Cpp-Toolbox\Converted\Microsoft_Phi-4-f16.gguf: n_tensors = 243, total_size = 29.3G
Writing:   0%|                                                                          | 0.00/29.3G [00:00<?, ?byte/s]Traceback (most recent call last):
  File "C:\Users\3simplex\Llama.Cpp-Toolbox\llama.cpp\convert_hf_to_gguf.py", line 4682, in <module>
    main()
  File "C:\Users\3simplex\Llama.Cpp-Toolbox\llama.cpp\convert_hf_to_gguf.py", line 4676, in main
    model_instance.write()
  File "C:\Users\3simplex\Llama.Cpp-Toolbox\llama.cpp\convert_hf_to_gguf.py", line 442, in write
    self.gguf_writer.write_tensors_to_file(progress=True)
  File "C:\Users\3simplex\Llama.Cpp-Toolbox\llama.cpp\gguf-py\gguf\gguf_writer.py", line 453, in write_tensors_to_file
    ti.tensor.tofile(fout)
  File "C:\Users\3simplex\Llama.Cpp-Toolbox\llama.cpp\gguf-py\gguf\lazy.py", line 210, in tofile
    eager = LazyNumpyTensor.to_eager(self)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\3simplex\Llama.Cpp-Toolbox\llama.cpp\gguf-py\gguf\lazy.py", line 169, in to_eager
    return cls._recurse_apply(t, simple_to_eager)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\3simplex\Llama.Cpp-Toolbox\llama.cpp\gguf-py\gguf\lazy.py", line 105, in _recurse_apply
    return fn(o)
           ^^^^^
  File "C:\Users\3simplex\Llama.Cpp-Toolbox\llama.cpp\gguf-py\gguf\lazy.py", line 160, in simple_to_eager
    _t._data = _t._func(*_t._args, **_t._kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\3simplex\Llama.Cpp-Toolbox\llama.cpp\gguf-py\gguf\lazy.py", line 207, in <lambda>
    return type(self)(meta=meta, args=full_args, kwargs=kwargs, func=(lambda a, *args, **kwargs: a.astype(*args, **kwargs)))
                                                                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 980. MiB for an array with shape (100352, 5120) and data type float16

@compilade
Copy link
Collaborator

@3Simplex

How does your free RAM look during conversion? Do you have enough RAM?
Do you have enough disk space too?

The convert script requires RAM for at least the biggest tensor in memory. The biggest tensor is usually the token embeddings tensor, which is usually the first read and written. For Phi-4, the shape of that tensor is (100352, 5120), around 514M elements.

Since the model files for Phi-4 are in BF16, and that Numpy doesn't support that type, the tensors are losslessly converted to F32 before being converted to F16 (because that's the target type in your case). This means at least 4GB of free RAM is required to convert that 29.3GB model.

That is, assuming memory mapping works correctly on Windows (hopefully it does, I don't know). If it doesn't, then you would need at least 64GB of RAM.

Also, if you do have enough RAM, make sure your Python interpreter is a 64-bit build.

@3Simplex
Copy link

@3Simplex

How does your free RAM look during conversion? Do you have enough RAM? Do you have enough disk space too?

The convert script requires RAM for at least the biggest tensor in memory. The biggest tensor is usually the token embeddings tensor, which is usually the first read and written. For Phi-4, the shape of that tensor is (100352, 5120), around 514M elements.

Since the model files for Phi-4 are in BF16, and that Numpy doesn't support that type, the tensors are losslessly converted to F32 before being converted to F16 (because that's the target type in your case). This means at least 4GB of free RAM is required to convert that 29.3GB model.

That is, assuming memory mapping works correctly on Windows (hopefully it does, I don't know). If it doesn't, then you would need at least 64GB of RAM.

Also, if you do have enough RAM, make sure your Python interpreter is a 64-bit build.

My Rx6900XT usually does fine converting these things. I also have 32gb system ram.
My python is 3.11.9 (64bit) and I use pyenv-win to manage them.
I recently updated transformers to 4.47.1 if that matters.
numpy 1.26.4

I tried converting the jackerated model and it gets up to 19gb\29gb before it hangs, plenty of space on NVME.

I can run the gguf provided by matteogeniaccio. I just prefer converting them myself.

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024
* convert : use GPT2 vocab for Phi-4 model

* convert : use null value of sliding_window to distinguish Phi-4 from other PHI3-based models

* llama : do not use sliding window attention mask for Phi-4 model

---------

Co-authored-by: Stanisław Szymczyk <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: Add support for Phi-4 model
6 participants