-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement efficient packing without cross-contamination attention #4224
Conversation
是否应该考虑使用 varlen_flash_atten 实现? |
@@ -33,6 +33,9 @@ def run_sft( | |||
dataset = get_dataset(model_args, data_args, training_args, stage="sft", **tokenizer_module) | |||
model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train) | |||
|
|||
if data_args.efficient_packing: | |||
configure_packing(model.config, model_args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we do configure_packing
in llamafactory.model.patcher
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I just edited it
src/llamafactory/extras/constants.py
Outdated
@@ -66,6 +66,21 @@ | |||
|
|||
SUPPORTED_CLASS_FOR_S2ATTN = {"llama"} | |||
|
|||
SUPPORTED_CLASS_FOR_MULTIPACK = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it "efficient_packing" rather than "multipack"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I just fixed.
Hi @AlongWY , The models in transformers have used flash_attn_varlen_func by default when passing attention_mask. I just made a slight change to the attention_mask when packing sequences and returned indices, cu_seqlens, and max_seqlen_in_batch corresponding to the modified attention_mask. |
Hi @hiyouga |
hi @chuan298 |
) | ||
transformers.modeling_attn_mask_utils._prepare_4d_causal_attention_mask = ( # pylint: disable=protected-access | ||
patched_prepare_4d_causal_attention_mask | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_prepare_4d_causal_attention_mask
has never been used in the Llama's forward pass, this patch will not affect training
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to construct the 4d attention mask during get_dataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, in an older version of transformers, I saw the code using _prepare_4d_causal_attention_mask with eager mode. I just checked again, and it has been removed and modified. We need to convert the attention mask with sdpa and eager mode to something like:
For example, if batch = 3 and seqlen = 6, the old attention_mask is:
[
[1, 1, 2, 2, 2, 0],
[1, 1, 1, 2, 2, 0],
[1, 1, 1, 1, 1, 1]
]
Convert to new 4D-attention mask:
[
[
[
[0, -inf, -inf, -inf, -inf, -inf],
[0, 0, -inf, -inf, -inf, -inf],
[-inf, -inf, 0, -inf, -inf, -inf],
[-inf, -inf, 0, 0, -inf, -inf],
[-inf, -inf, 0, 0, 0, -inf],
[-inf, -inf, -inf, -inf, -inf, 0]
]
],
[
[
[0, -inf, -inf, -inf, -inf, -inf],
[0, 0, -inf, -inf, -inf, -inf],
[0, 0, 0, -inf, -inf, -inf],
[-inf, -inf, -inf, 0, -inf, -inf],
[-inf, -inf, -inf, 0, 0, -inf],
[-inf, -inf, -inf, -inf, -inf, 0]
]
],
[
[
[0, -inf, -inf, -inf, -inf, -inf],
[0, 0, -inf, -inf, -inf, -inf],
[0, 0, 0, -inf, -inf, -inf],
[0, 0, 0, 0, -inf, -inf],
[0, 0, 0, 0, 0, -inf],
[0, 0, 0, 0, 0, 0]
]
]
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will fix that part right now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, I am fixing it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are designing a new data collator for SFT Trainer that converts attention masks with indices to 4d attention masks with correct dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I have learned a lot by looking at the way you design your system
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
glad to hear any valuable advice from you about the implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is now ready to be merged. Patches for MoE models are left for future work. We will update the patch for eager/sdpa attention soon.
Thanks for doing this! One quick question: does position id re-initialized for packed examples? |
Almost all models use RoPE instead of absolute positional embeddings, so we don't need to reinitialize position ids |
您好,我在 Qwen2-7B-Instruct 上使用该方法后,效果远不如packing前训练,请问有什么解决思路吗? |
|
Thank you for reporting the issue. I will test it again and get back to you as soon as possible. |
理论上至少neat_packing会好于packing才对,packing存在交叉污染问题,能确定你们的实验结果没有搞混吗 |
经过多次试验,可以确定 COIG-CQIA 数据集微调 |
@hiyouga @chuan298 虽然我还没有机器在llamafactory上进行调试,但我使用了https://github.com/MeetKai/functionary/blob/main/functionary/train/packing/monkey_patch_packing.py 的验证程序,发现该验证程序在starcoderv2上验证失败了。各位大佬可以参考一下,该方法目前是否实现有误 |
@bao-xiaoyi 这个 PR 就是参考了axolotl 和 functionary,,可以看看 PR conversation 第一行和文件内的注释 |
@bao-xiaoyi 我看了下 starcoderv2 的源码,已经没有 |
|
个人愚见,Flash_attn 貌似统一到一个文件中了,目前的实现不满足 transformers 的最近版本,如码所示 |
但就 eager and sdpa attention,效果依然不好 |
starcoderv2的确是有这样的问题,但是deepseek-v2-lite我测试了没问题,平均loss误差很小很小 |
Hi @YeQiuO @bao-xiaoyi I'm sorry for the late reply.
I found that the results of neat_packing were relatively higher compared to no packing. I then reviewed the latest version of transformers and quickly implemented changes to resemble DataCollatorWithPadding and the
The results achieved with this were equivalent to no packing, so I believe there have been changes in the latest version of transformers, and I will update the code as soon as possible. |
@chuan298 I think we should adopt same token batch sizes and training steps for fair comparison? I find the model trained by neat_packing is underfitted. |
|
https://research.ibm.com/blog/hugging-face-training-flash-attention |
What does this PR do?
Update 15/6/2024: Add support packing for eager and sdpa
Fixes #2289
Implement efficient packing without cross-contamination attention
Taking inspiration from some repository as axolotl and functionary, I applied packing sequences more effectively, enabling the model to learn samples more efficiently without attending to other samples within the same pack. Now I only support this implement for sft with flash_attention_2.
Example training config:
Before submitting