update

modelscope · Jul 8, 2024 · e5a9e5e · e5a9e5e
1 parent 153d18e
commit e5a9e5e
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/docs/source/LLM/命令行参数.md b/docs/source/LLM/命令行参数.md
@@ -235,7 +235,7 @@ RLHF参数继承了sft参数, 除此之外增加了以下参数:
 - `--loss_type`: loss类型, 默认值'sigmoid'.
 - `--sft_beta`: 是否在DPO中加入sft loss, 默认为0.1, 支持 $[0, 1)$ 区间，最后的loss为`(1-sft_beta)*KL_loss + sft_beta * sft_loss`.
 - `--simpo_gamma`: SimPO算法中的reward margin项，论文中建议设置为0.5-1.5, 默认为1.0
-- `--cpo_alpha`: 混合CPO loss中的nll loss, 默认为1.0
+- `--cpo_alpha`: CPO loss 中 nll loss的系数, 默认为1.0, 在SimPO中使用混合nll loss以提高训练稳定性
 - `--desirable_weight`: KTO算法中对desirable response的loss权重 $\lambda_D$ ，默认为1.0
 - `--undesirable_weight`: KTO论文中对undesirable response的loss权重 $\lambda_U$ , 默认为1.0. 分别用$n_d$ 和$n_u$ 表示数据集中desirable examples和undesirable examples的数量，论文中推荐控制 $\frac{\lambda_D n_D}{\lambda_Un_U} \in [1,\frac{4}{3}]$