为什么DPO训练时ref_model不指定为初始的SFT完的模型,而是用每个batch参数更新后的policy model来初始化呢? #1801
Annotations
1 warning
label_issue
ubuntu-latest pipelines will use ubuntu-24.04 soon. For more details, see https://github.com/actions/runner-images/issues/10636
|