Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate support for Phi-1, Phi-1.5, and Phi-2 models #4552

Closed
wants to merge 16 commits into from

Conversation

teleprint-me
Copy link
Contributor

Overview

This Pull Request introduces changes to llama.cpp for unified handling of different Phi model variants (Phi-1, Phi-1.5, Phi-2). The modifications aim to simplify the architecture handling, tensor mapping, and computational graph construction for these models.

Changes

  • Replaced LLM_ARCH_PHI2 with LLM_ARCH_PHI across the codebase to create a singular reference for all Phi models.
  • Updated the architecture names mapping to change from "phi2" to "phi", ensuring consistency in architecture identification.
  • Adjusted the tensor names mapping to reflect the consolidated Phi model architecture, enabling correct tensor processing regardless of the specific Phi variant.
  • Modified hyperparameter loading logic to include Phi models with 24 layers, categorizing them as MODEL_1B. This addition caters to the different layer counts found in Phi model variants.
  • Updated the tensor loading sections in the code to utilize the new unified architecture enumeration, ensuring proper tensor instantiation.
  • Renamed the build_phi2() function to build_phi(), aligning it with the unified architecture name and ensuring appropriate computational graph construction for all Phi models.
  • Adjusted graph construction calls within the code to use the updated build_phi() function, maintaining functionality and integration across different Phi model variants.

Impact

These changes enhance llama.cpp's flexibility and adaptability in working with various Phi models. By consolidating the handling of these models under a single architecture enumeration and updating relevant sections of the code, we improve the maintainability and clarity of the codebase. This unified approach also facilitates future extensions or modifications related to Phi models.

Testing

The changes have been tested with Phi-1 and Phi-1.5 models, successfully converting and running inference. The results indicate that the unified handling approach is effective and does not introduce any regressions in the functionality for these models.

15:36:08 | ~/Valerie/llama.cpp
(.venv) git:(phi-1 | Δ) λ python convert-hf-to-gguf.py stash/models/microsoft/phi-1_5
Loading model: phi-1_5
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer
gguf: Adding 50000 merge(s).
gguf: Setting special token type bos to 50256
gguf: Setting special token type eos to 50256
gguf: Setting special token type unk to 50256
Exporting model to 'stash/models/microsoft/phi-1_5/ggml-model-f16.gguf'
gguf: loading model part 'pytorch_model.bin'
/mnt/valerie/llama.cpp/.venv/lib/python3.11/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
token_embd.weight, n_dims = 2, torch.float16 --> float16
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi
llama_model_loader: - kv   1:                               general.name str              = Phi
llama_model_loader: - kv   2:                         phi.context_length u32              = 2048
llama_model_loader: - kv   3:                       phi.embedding_length u32              = 2048
llama_model_loader: - kv   4:                    phi.feed_forward_length u32              = 8192
llama_model_loader: - kv   5:                            phi.block_count u32              = 24
llama_model_loader: - kv   6:                   phi.attention.head_count u32              = 32
llama_model_loader: - kv   7:                phi.attention.head_count_kv u32              = 32
llama_model_loader: - kv   8:           phi.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv   9:                   phi.rope.dimension_count u32              = 32
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,51200]   = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,51200]   = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,50000]   = ["Ġ t", "Ġ a", "h e", "i n", "r e",...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 50256
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 50256
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 50256
llama_model_loader: - type  f32:  147 tensors
llama_model_loader: - type  f16:   98 tensors
llm_load_vocab: mismatch in special tokens definition ( 910/51200 vs 944/51200 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 51200
llm_load_print_meta: n_merges         = 50000
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_rot            = 32
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 1.42 B
llm_load_print_meta: model size       = 2.64 GiB (16.01 BPW) 
llm_load_print_meta: general.name     = Phi
llm_load_print_meta: BOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: EOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: UNK token        = 50256 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_tensors: ggml ctx size =    0.09 MiB
llm_load_tensors: mem required  = 2706.37 MiB
................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: KV self size  =  384.00 MiB, K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_build_graph: non-view tensors processed: 582/582
llama_new_context_with_model: compute buffer total size = 159.19 MiB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
sampling: 
	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp 
generate: n_ctx = 2048, n_batch = 512, n_predict = 512, n_keep = 0


Question: What is the role of ribosomes in cellular biology?
Answer: Ribosomes are responsible for synthesizing proteins, which are essential for various cellular processes. They act as protein factories within cells and play a crucial role in maintaining the overall functionality of living organisms.
 [end of text]

llama_print_timings:        load time =     131.71 ms
llama_print_timings:      sample time =       6.35 ms /    41 runs   (    0.15 ms per token,  6453.64 tokens per second)
llama_print_timings: prompt eval time =     143.38 ms /    17 tokens (    8.43 ms per token,   118.56 tokens per second)
llama_print_timings:        eval time =    2548.84 ms /    40 runs   (   63.72 ms per token,    15.69 tokens per second)
llama_print_timings:       total time =    2711.62 ms
Log end

Looking forward to your feedback and suggestions on these changes.

- Created the `initialize_writer` function to set up GGUF writer with model metadata
- Included validation for file type and architecture
- Default hyperparameter values sourced from MixFormerSequentialConfig
- Function annotations and documentation added for clarity
- Prepared groundwork for MixFormer architecture integration
- Replaced LLM_ARCH_PHI2 with LLM_ARCH_PHI to unify the handling of different Phi model variants (Phi-1, Phi-1.5, Phi-2).
- Updated architecture names map to reflect the consolidated architecture name from "phi2" to "phi".
- Adjusted the tensor names mapping to use the new architecture name "phi" for consistent tensor loading and processing.
- Modified hyperparameter loading to include a case for 24 layers under LLM_ARCH_PHI, classifying it as MODEL_1B. This change accommodates different layer counts for various Phi model variants.
- Updated tensor loading sections to use the new architecture enum, ensuring proper tensor creation based on the model architecture.
- Renamed build_phi2() to build_phi() in the graph building section, aligning with the new architecture name and ensuring correct computational graph construction for Phi models.
- Adjusted graph construction calls to use the renamed build_phi() function, ensuring seamless integration and functionality for different Phi model variants.

These changes aim to streamline the handling of various Phi models within `llama.cpp`, enhancing the application's capability to work effectively with these models while maintaining code clarity and consistency.
@teleprint-me
Copy link
Contributor Author

teleprint-me commented Dec 24, 2023

@slaren @ebeyabraham Do you know why the warning llm_load_vocab: mismatch in special tokens definition ( 910/51200 vs 944/51200 ). is popping up? I haven't had time to look into it, but was planning on digging into it either tomorrow or monday.

@slaren
Copy link
Collaborator

slaren commented Dec 24, 2023

I think it means that some of the tokens cannot be tokenized, but are not tagged as special. It's probably not a big deal, but @staviq may know more about this.

@ggerganov
Copy link
Owner

ggerganov commented Dec 24, 2023

This change seems would break existing models due to "phi2" -> "phi". Is it worth it?

@teleprint-me
Copy link
Contributor Author

teleprint-me commented Dec 24, 2023

@ggerganov

Yes, it's true that this change would break existing conversions and quants. However, I'd like to highlight why I believe it's a valuable modification.

All three models - Phi-1, Phi-1.5, and Phi-2 - share identical architectures, differing primarily in the number of layers they possess. While these architectural differences may seem subtle, they offer significant advantages.

Since all three architectures are identical, any other models created in the future using the PhiForCausalLM architecture will be compatible as a result. It's probably better to break things now rather than later, down the line.

By accommodating Phi-1, Phi-1.5, and Phi-2, we establish a unified implementation that can seamlessly adapt to both future Microsoft and PhiForCausalLM model releases. While it's not a guaranteed future-proofing, this forward-looking approach minimizes the effort required for future updates, potentially ensuring llama.cpp remains versatile and adaptable.

By adding support for Phi-1, Phi-1.5, and Phi-2 enhances llama.cpp's usability, accessibility, and adaptability. It's a worthwhile enhancement that promotes diversity in hardware usage and fosters innovation in AI research.

This change not only benefits current users but also sets a foundation for accommodating potential future models with greater ease. It's a valuable addition to llama.cpp's capabilities.

llama.cpp Outdated
Comment on lines 2123 to 2128
// backwards compatibility with pre-#4552
// TODO: remove after Mar 2024
if (arch_name == "phi2") {
arch_name = "phi";
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@teleprint-me Could you give this a test and see if it solves backwards compatibility with "phi2"?

I'm a bit worried if we don't handle the old name we'll get lot's of complaints and issues

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah nvm, it's actually not going to work because there are other parameters like phi2.context_length

Copy link
Contributor Author

@teleprint-me teleprint-me Dec 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov Yeah, that's why I broke it. I wish I had caught this sooner, but I've been preoccupied with a bunch of other stuff and I'm multitasking the best I can.

I actually started working on the conversion scripts too and I still have a bunch of other stuff, but this seemed like it needed attention sooner than later.

If you have a better idea, I'm open to it. I went with this because it seemed like the most pragmatic approach. I prefer simplicity and if making a simple choice will break something, then that's what I'll go with.

@ggerganov ggerganov force-pushed the phi-1 branch 2 times, most recently from 44ac215 to b0583f7 Compare December 27, 2023 16:46
@ggerganov
Copy link
Owner

I've given some more thought on this and prefer not to merge the change. It's more likely to causes issues with broken support for existing models that anything, so I think it is not worth it. Thanks for the effort though

@ggerganov ggerganov closed this Jan 9, 2024
@teleprint-me
Copy link
Contributor Author

teleprint-me commented Jan 9, 2024

@ggerganov I think there's another way to do this without breaking the existing models. I can adapt the code accordingly. That is, if you're open to it?

@teleprint-me teleprint-me deleted the phi-1 branch January 10, 2024 00:01
@walter-cavinaw
Copy link

I think this is quite important. Phi 1.5 and Phi 2 are the best small models and right now it's not simple to convert these to gguf. Phi 1.5 is quite useful on embedded systems because it strikes a good balance between quality and performance on a small 4 core cpu. @teleprint-me are you considering another way to make these changes?

@teleprint-me
Copy link
Contributor Author

@walter-cavinaw It was accepted in #4847. Use convert-hf-to-gguf.py to convert the models. 1, 1.5, and 2 all work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants