You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This should be straightforward. The main issue I see coming up is with compile - similar to how we attempt to compile the reference and policy model in our single device PPO recipe. Since the SelfAttentionLayer block is inlined and shared across the models, we're going to hit recompiles due to param.requires_grad. This might be acceptable in this case, since the recompiles won't be as severe as with PPO in it's current state #2066.
We might want to offer some kind of customization around the choice of reference policy model. The only constraint I can think of here is ensuring that both of the reference and policy models share a tokenizer - otherwise users should be able to freely experiment here.
The text was updated successfully, but these errors were encountered:
This should be straightforward. The main issue I see coming up is with compile - similar to how we attempt to compile the reference and policy model in our single device PPO recipe. Since the
SelfAttentionLayer
block is inlined and shared across the models, we're going to hit recompiles due toparam.requires_grad
. This might be acceptable in this case, since the recompiles won't be as severe as with PPO in it's current state #2066.We might want to offer some kind of customization around the choice of reference policy model. The only constraint I can think of here is ensuring that both of the reference and policy models share a tokenizer - otherwise users should be able to freely experiment here.
The text was updated successfully, but these errors were encountered: