Full-finetune DPO single device recipe #2082

SalmanMohammadi · 2024-11-27T14:45:13Z

This should be straightforward. The main issue I see coming up is with compile - similar to how we attempt to compile the reference and policy model in our single device PPO recipe. Since the SelfAttentionLayer block is inlined and shared across the models, we're going to hit recompiles due to param.requires_grad. This might be acceptable in this case, since the recompiles won't be as severe as with PPO in it's current state #2066.

We might want to offer some kind of customization around the choice of reference policy model. The only constraint I can think of here is ensuring that both of the reference and policy models share a tokenizer - otherwise users should be able to freely experiment here.

The text was updated successfully, but these errors were encountered:

SalmanMohammadi mentioned this issue Nov 27, 2024

RLHF Tracker #2081

Open

SalmanMohammadi self-assigned this Nov 27, 2024

SalmanMohammadi added the enhancement New feature or request label Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full-finetune DPO single device recipe #2082

Full-finetune DPO single device recipe #2082

SalmanMohammadi commented Nov 27, 2024 •

edited

Loading

Full-finetune DPO single device recipe #2082

Full-finetune DPO single device recipe #2082

Comments

SalmanMohammadi commented Nov 27, 2024 • edited Loading

SalmanMohammadi commented Nov 27, 2024 •

edited

Loading