-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ChatGlm
] Adds support for the ChatGLM model
#27883
Conversation
Co-authored-by: Xunkai <[email protected]>
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Hi~ o( ̄▽ ̄)ブ Could you tell me what your plans? May I know when the PR can be merged? I have PRs depended on this feature. |
@younesbelkada and I just came back from holidays, we are hoping end of the week maybe later! |
"ChatGlmModel is using ChatGlmSdpaAttention, but `torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to the manual attention implementation, " | ||
'but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be split in two lines
cc @ArthurZucker could you give a first pass ? 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few nits here and there, ready otherwise! cc @younesbelkada
@@ -0,0 +1,55 @@ | |||
<!--Copyright 2023 The HuggingFace Team. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Headers need an update
|
||
Tips: | ||
|
||
- TODO conversion tips if needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to remove?
@@ -0,0 +1,60 @@ | |||
# Copyright 2023 EleutherAI and The HuggingFace Inc. team. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
rot_shape = cos.shape | ||
|
||
# In the original ChatGLM repository the query and key states are manually | ||
# reshaped into a shape `batch_size, num_q_heads, seq_len, head_dim // 2, 2` changing the order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't you think that we can do this when converting the checkpoints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, we can move this complexity into the conversion script, and simplify the implementation here.
if not self.multi_query_attention: | ||
qkv_size = 3 * self.hidden_size | ||
self.num_key_value_heads = 1 | ||
else: | ||
self.num_key_value_heads = config.multi_query_group_num | ||
qkv_size = self.projection_size + 2 * self.hidden_size_per_attention_head * self.num_key_value_heads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should just have a general implementation! supporting GQA means you can have MQA, MHA and GQA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative in my mind: We probably can split the QKV matrix in conversion scripts as well, so that the inference code here do not need to handle the division of QKV matrix. Also separated Q, K, V are more friendly to some specific platforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might have time to work on the addition next week, with a new cool feature that will help ease the integration! 🤗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Arthur! How's everything going?
GLM has already released GLM-4 model (slightly different with ChatGLM-3 on tokenizer but others are basically same). Could we resume the integration effort? Thanks!
else: | ||
raise ValueError(f"Unknown RoPE scaling type {scaling_type}") | ||
|
||
def _split_heads(self, fused_qkv: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need if else
I think #33823 should fix this! |
What does this PR do?
Drafts the support of chat GLM