Consolidate support for Phi-1, Phi-1.5, and Phi-2 models #4552

teleprint-me · 2023-12-20T20:50:16Z

Overview

This Pull Request introduces changes to llama.cpp for unified handling of different Phi model variants (Phi-1, Phi-1.5, Phi-2). The modifications aim to simplify the architecture handling, tensor mapping, and computational graph construction for these models.

Changes

Replaced LLM_ARCH_PHI2 with LLM_ARCH_PHI across the codebase to create a singular reference for all Phi models.
Updated the architecture names mapping to change from "phi2" to "phi", ensuring consistency in architecture identification.
Adjusted the tensor names mapping to reflect the consolidated Phi model architecture, enabling correct tensor processing regardless of the specific Phi variant.
Modified hyperparameter loading logic to include Phi models with 24 layers, categorizing them as MODEL_1B. This addition caters to the different layer counts found in Phi model variants.
Updated the tensor loading sections in the code to utilize the new unified architecture enumeration, ensuring proper tensor instantiation.
Renamed the build_phi2() function to build_phi(), aligning it with the unified architecture name and ensuring appropriate computational graph construction for all Phi models.
Adjusted graph construction calls within the code to use the updated build_phi() function, maintaining functionality and integration across different Phi model variants.

Impact

These changes enhance llama.cpp's flexibility and adaptability in working with various Phi models. By consolidating the handling of these models under a single architecture enumeration and updating relevant sections of the code, we improve the maintainability and clarity of the codebase. This unified approach also facilitates future extensions or modifications related to Phi models.

Testing

The changes have been tested with Phi-1 and Phi-1.5 models, successfully converting and running inference. The results indicate that the unified handling approach is effective and does not introduce any regressions in the functionality for these models.

15:36:08 | ~/Valerie/llama.cpp
(.venv) git:(phi-1 | Δ) λ python convert-hf-to-gguf.py stash/models/microsoft/phi-1_5
Loading model: phi-1_5
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer
gguf: Adding 50000 merge(s).
gguf: Setting special token type bos to 50256
gguf: Setting special token type eos to 50256
gguf: Setting special token type unk to 50256
Exporting model to 'stash/models/microsoft/phi-1_5/ggml-model-f16.gguf'
gguf: loading model part 'pytorch_model.bin'
/mnt/valerie/llama.cpp/.venv/lib/python3.11/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
token_embd.weight, n_dims = 2, torch.float16 --> float16
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi
llama_model_loader: - kv   1:                               general.name str              = Phi
llama_model_loader: - kv   2:                         phi.context_length u32              = 2048
llama_model_loader: - kv   3:                       phi.embedding_length u32              = 2048
llama_model_loader: - kv   4:                    phi.feed_forward_length u32              = 8192
llama_model_loader: - kv   5:                            phi.block_count u32              = 24
llama_model_loader: - kv   6:                   phi.attention.head_count u32              = 32
llama_model_loader: - kv   7:                phi.attention.head_count_kv u32              = 32
llama_model_loader: - kv   8:           phi.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv   9:                   phi.rope.dimension_count u32              = 32
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,51200]   = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,51200]   = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,50000]   = ["Ġ t", "Ġ a", "h e", "i n", "r e",...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 50256
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 50256
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 50256
llama_model_loader: - type  f32:  147 tensors
llama_model_loader: - type  f16:   98 tensors
llm_load_vocab: mismatch in special tokens definition ( 910/51200 vs 944/51200 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 51200
llm_load_print_meta: n_merges         = 50000
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_rot            = 32
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 1.42 B
llm_load_print_meta: model size       = 2.64 GiB (16.01 BPW) 
llm_load_print_meta: general.name     = Phi
llm_load_print_meta: BOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: EOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: UNK token        = 50256 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_tensors: ggml ctx size =    0.09 MiB
llm_load_tensors: mem required  = 2706.37 MiB
................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: KV self size  =  384.00 MiB, K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_build_graph: non-view tensors processed: 582/582
llama_new_context_with_model: compute buffer total size = 159.19 MiB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
sampling: 
	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp 
generate: n_ctx = 2048, n_batch = 512, n_predict = 512, n_keep = 0


Question: What is the role of ribosomes in cellular biology?
Answer: Ribosomes are responsible for synthesizing proteins, which are essential for various cellular processes. They act as protein factories within cells and play a crucial role in maintaining the overall functionality of living organisms.
 [end of text]

llama_print_timings:        load time =     131.71 ms
llama_print_timings:      sample time =       6.35 ms /    41 runs   (    0.15 ms per token,  6453.64 tokens per second)
llama_print_timings: prompt eval time =     143.38 ms /    17 tokens (    8.43 ms per token,   118.56 tokens per second)
llama_print_timings:        eval time =    2548.84 ms /    40 runs   (   63.72 ms per token,    15.69 tokens per second)
llama_print_timings:       total time =    2711.62 ms
Log end

Looking forward to your feedback and suggestions on these changes.

- Created the `initialize_writer` function to set up GGUF writer with model metadata - Included validation for file type and architecture - Default hyperparameter values sourced from MixFormerSequentialConfig - Function annotations and documentation added for clarity - Prepared groundwork for MixFormer architecture integration

Signed-off-by: teleprint-me <[email protected]>

- Replaced LLM_ARCH_PHI2 with LLM_ARCH_PHI to unify the handling of different Phi model variants (Phi-1, Phi-1.5, Phi-2). - Updated architecture names map to reflect the consolidated architecture name from "phi2" to "phi". - Adjusted the tensor names mapping to use the new architecture name "phi" for consistent tensor loading and processing. - Modified hyperparameter loading to include a case for 24 layers under LLM_ARCH_PHI, classifying it as MODEL_1B. This change accommodates different layer counts for various Phi model variants. - Updated tensor loading sections to use the new architecture enum, ensuring proper tensor creation based on the model architecture. - Renamed build_phi2() to build_phi() in the graph building section, aligning with the new architecture name and ensuring correct computational graph construction for Phi models. - Adjusted graph construction calls to use the renamed build_phi() function, ensuring seamless integration and functionality for different Phi model variants. These changes aim to streamline the handling of various Phi models within `llama.cpp`, enhancing the application's capability to work effectively with these models while maintaining code clarity and consistency.

Signed-off-by: teleprint-me <[email protected]>

teleprint-me · 2023-12-24T01:56:28Z

@slaren @ebeyabraham Do you know why the warning llm_load_vocab: mismatch in special tokens definition ( 910/51200 vs 944/51200 ). is popping up? I haven't had time to look into it, but was planning on digging into it either tomorrow or monday.

slaren · 2023-12-24T02:19:37Z

I think it means that some of the tokens cannot be tokenized, but are not tagged as special. It's probably not a big deal, but @staviq may know more about this.

ggerganov · 2023-12-24T06:26:31Z

This change seems would break existing models due to "phi2" -> "phi". Is it worth it?

teleprint-me · 2023-12-24T15:22:21Z

@ggerganov

Yes, it's true that this change would break existing conversions and quants. However, I'd like to highlight why I believe it's a valuable modification.

All three models - Phi-1, Phi-1.5, and Phi-2 - share identical architectures, differing primarily in the number of layers they possess. While these architectural differences may seem subtle, they offer significant advantages.

Since all three architectures are identical, any other models created in the future using the PhiForCausalLM architecture will be compatible as a result. It's probably better to break things now rather than later, down the line.

By accommodating Phi-1, Phi-1.5, and Phi-2, we establish a unified implementation that can seamlessly adapt to both future Microsoft and PhiForCausalLM model releases. While it's not a guaranteed future-proofing, this forward-looking approach minimizes the effort required for future updates, potentially ensuring llama.cpp remains versatile and adaptable.

By adding support for Phi-1, Phi-1.5, and Phi-2 enhances llama.cpp's usability, accessibility, and adaptability. It's a worthwhile enhancement that promotes diversity in hardware usage and fosters innovation in AI research.

This change not only benefits current users but also sets a foundation for accommodating potential future models with greater ease. It's a valuable addition to llama.cpp's capabilities.

ggerganov · 2023-12-27T16:14:22Z

llama.cpp

+        // backwards compatibility with pre-#4552
+        // TODO: remove after Mar 2024
+        if (arch_name == "phi2") {
+            arch_name = "phi";
+        }


@teleprint-me Could you give this a test and see if it solves backwards compatibility with "phi2"?

I'm a bit worried if we don't handle the old name we'll get lot's of complaints and issues

Ah nvm, it's actually not going to work because there are other parameters like phi2.context_length

@ggerganov Yeah, that's why I broke it. I wish I had caught this sooner, but I've been preoccupied with a bunch of other stuff and I'm multitasking the best I can.

I actually started working on the conversion scripts too and I still have a bunch of other stuff, but this seemed like it needed attention sooner than later.

If you have a better idea, I'm open to it. I went with this because it seemed like the most pragmatic approach. I prefer simplicity and if making a simple choice will break something, then that's what I'll go with.

Signed-off-by: teleprint-me <[email protected]>

ggerganov · 2024-01-09T21:07:05Z

I've given some more thought on this and prefer not to merge the change. It's more likely to causes issues with broken support for existing models that anything, so I think it is not worth it. Thanks for the effort though

teleprint-me · 2024-01-09T21:58:44Z

@ggerganov I think there's another way to do this without breaking the existing models. I can adapt the code accordingly. That is, if you're open to it?

walter-cavinaw · 2024-01-19T05:09:12Z

I think this is quite important. Phi 1.5 and Phi 2 are the best small models and right now it's not simple to convert these to gguf. Phi 1.5 is quite useful on embedded systems because it strikes a good balance between quality and performance on a small 4 core cpu. @teleprint-me are you considering another way to make these changes?

teleprint-me · 2024-01-19T05:20:45Z

@walter-cavinaw It was accepted in #4847. Use convert-hf-to-gguf.py to convert the models. 1, 1.5, and 2 all work.

teleprint-me added 9 commits November 2, 2023 22:40

Merge branch 'master' into phi-1

1aa3392

fix: Apply phi to merged updates

748f376

Signed-off-by: teleprint-me <[email protected]>

Merge branch 'master' into phi-1

1d4bcd2

Consolidate PHI and PHI2 architectures in gguf constants

e53d44c

Signed-off-by: teleprint-me <[email protected]>

Update tensor mappings for Phi models (Phi-1, Phi-1.5, Phi-2)

e96f40b

Signed-off-by: teleprint-me <[email protected]>

Consolidate Phi model conversion handling in convert-hf-to-gguf.py

ea6ae8d

Signed-off-by: teleprint-me <[email protected]>

Remove deprecated conversion script

6becb1f

Signed-off-by: teleprint-me <[email protected]>

teleprint-me mentioned this pull request Dec 20, 2023

Requesting Support for phi-1_5 by Microsoft #3146

Closed

teleprint-me added 2 commits December 20, 2023 21:44

Ignore local content

af9cd93

Signed-off-by: teleprint-me <[email protected]>

Merge branch 'master' into phi-1

348d565

slaren approved these changes Dec 24, 2023

View reviewed changes

Merge branch 'master' into phi-1

b0583f7

ggerganov force-pushed the phi-1 branch from 79dc929 to 45dedd4 Compare December 27, 2023 16:13

ggerganov reviewed Dec 27, 2023

View reviewed changes

ggerganov force-pushed the phi-1 branch 2 times, most recently from 44ac215 to b0583f7 Compare December 27, 2023 16:46

teleprint-me added 4 commits December 27, 2023 23:37

Merge branch 'master' into phi-1

5d546a3

Merge branch 'master' into phi-1

1a108ad

Merge branch 'master' into phi-1

46536a4

chore: Apply flake8 formatting rules

2f2b3e4

Signed-off-by: teleprint-me <[email protected]>

ggerganov closed this Jan 9, 2024

teleprint-me deleted the phi-1 branch January 10, 2024 00:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate support for Phi-1, Phi-1.5, and Phi-2 models #4552

Consolidate support for Phi-1, Phi-1.5, and Phi-2 models #4552

teleprint-me commented Dec 20, 2023

teleprint-me commented Dec 24, 2023 •

edited

Loading

slaren commented Dec 24, 2023

ggerganov commented Dec 24, 2023 •

edited

Loading

teleprint-me commented Dec 24, 2023 •

edited

Loading

ggerganov Dec 27, 2023

ggerganov Dec 27, 2023

teleprint-me Dec 27, 2023 •

edited

Loading

ggerganov commented Jan 9, 2024

teleprint-me commented Jan 9, 2024 •

edited

Loading

walter-cavinaw commented Jan 19, 2024

teleprint-me commented Jan 19, 2024

Consolidate support for Phi-1, Phi-1.5, and Phi-2 models #4552

Consolidate support for Phi-1, Phi-1.5, and Phi-2 models #4552

Conversation

teleprint-me commented Dec 20, 2023

Overview

Changes

Impact

Testing

teleprint-me commented Dec 24, 2023 • edited Loading

slaren commented Dec 24, 2023

ggerganov commented Dec 24, 2023 • edited Loading

teleprint-me commented Dec 24, 2023 • edited Loading

ggerganov Dec 27, 2023

Choose a reason for hiding this comment

ggerganov Dec 27, 2023

Choose a reason for hiding this comment

teleprint-me Dec 27, 2023 • edited Loading

Choose a reason for hiding this comment

ggerganov commented Jan 9, 2024

teleprint-me commented Jan 9, 2024 • edited Loading

walter-cavinaw commented Jan 19, 2024

teleprint-me commented Jan 19, 2024

teleprint-me commented Dec 24, 2023 •

edited

Loading

ggerganov commented Dec 24, 2023 •

edited

Loading

teleprint-me commented Dec 24, 2023 •

edited

Loading

teleprint-me Dec 27, 2023 •

edited

Loading

teleprint-me commented Jan 9, 2024 •

edited

Loading