Replies: 14 comments 1 reply
-
lora训练吗?是不是没加载lora。检查一下是不是加载到CPU上了 |
Beta Was this translation helpful? Give feedback.
-
@Nihaokai 不是,确实是在GPU上,可能机器性能还是太差了 |
Beta Was this translation helpful? Give feedback.
-
lora权重有没有合并? |
Beta Was this translation helpful? Give feedback.
-
@Nihaokai 不知道要如何合并,我只选择了当前训练的检查点,然后就直接导出,看了导出大小,以为自动合并了 因为选择加载模型要等很久,所以我直接选择导出,不知道是否是我操作有误。目前还发现训练过的模型丢失了部分能力,不知道是不是这个原因 |
Beta Was this translation helpful? Give feedback.
-
训练集太小了,这样容易过拟合 |
Beta Was this translation helpful? Give feedback.
-
@Nihaokai 谢谢,我之前一直是用的webui可视化界面操作,我尝试一下脚本,非常感谢指导 |
Beta Was this translation helpful? Give feedback.
-
请问为什么会出现如下报错啊,5555 |
Beta Was this translation helpful? Give feedback.
-
@linkeusen 报错显示你的CUDA没有安装,或者版本过低。建议使用conda虚拟环境,在官网获取安装pytorch的方法,会一同在虚拟环境中安装CUDA。 |
Beta Was this translation helpful? Give feedback.
-
但是nvidia-smi显示的cuda为12.1 |
Beta Was this translation helpful? Give feedback.
-
@linkeusen nvcc -V显示的是真正版本,nvidia-smi上的是最高支持的版本。 |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
@linkeusen 是的,关于这个原因,可能是CUDA相关包的问题,我还是建议在conda环境中安装 |
Beta Was this translation helpful? Give feedback.
-
我确实是在conda环境中装的,谢谢您的帮助,我觉得会不会是awq的版本不对,我接下来尝试一下 |
Beta Was this translation helpful? Give feedback.
-
@linkeusen 训练后测试效果,和未训练一样。 我也遇到这个问题了。请问你排查结果是什么,望分享下。 |
Beta Was this translation helpful? Give feedback.
-
Reminder
System Info
I7-27000,32G内存,RTX 3060显卡 12G专用+15G共享,系统win11
目前问题:
(1)训练后测试效果,和未训练一样。
(2)通过LLAMA BOARD加载模型非常慢(甚至轮数为10时,根本加载不了模型,直接崩溃),如果检查点路径为空时,非常快
1、训练模型llama-3-chinese-8b-instruct-v3
2、数据集是自己创建的json文件,内容如下
[
{
"instruction": "你是谁?",
"input": "",
"output": "您好,我是H-lin重新训练过的智慧小助手,有什么可以帮到您吗?"
},{
"instruction": "你是谁?",
"input": "",
"output": "您好,我是H-lin重新训练过的智慧小助手,很高兴认识您,有什么可以帮到您吗?"
}
]
3、学习率尝试过5e-5和1e-4,轮数:3或者10,计算类型:bf16
4、训练很快
{'train_runtime': 113.0455, 'train_samples_per_second': 0.265, 'train_steps_per_second': 0.088, 'train_loss': 0.6034916400909424, 'epoch': 10.0, 'num_input_tokens_seen': 2984}
5、加载模型非常慢,甚至有时无法加载,通过llamafactory-cli webui启动,因为加载失败直接退出服务,也未报错
请教大神,我的问题在哪儿,我就想尝试一次训练成功,让它接受我命令,请抽空指导一下
Reproduction
新手操作,全用的配置界面操作
Expected behavior
No response
Others
No response
Beta Was this translation helpful? Give feedback.
All reactions