You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Soft Actor-Critic is popular reinforcement learning algorithm that meets or exceeds the performance of PPO on a variety of tasks. Oddly, it is not used for LLM post-training, and I have not been able to find a satisfactory explanation as to why that is. I intend to do a research project to investigate how the Soft Actor-Critic performs for RL-based optimization. The hope is that SAC's entropy maximization results in better exploration, more varied responses, and perhaps make the LLM more robust to jailbreaking due to having more varied experience. If this hypothesis is true, then it would allow the community to create more interesting and robust LLMs that maintain alignment.
I plan to first implement and evaluate the algorithm as written. I'll compare it against PPO on time to train, performance, robustness to jailbreaks, and output diversity.
Then, I will try to improve the algorithm. Notably, SAC (as written) has 4 additional models, which would be prohibitively expensive in vRAM cost for most users since each would require a large set of parameters. Switching from Clipped Double Q-learning to Double DQN or regularization techniques like CQL may maintain performance and reduce the models to 1 or 2 additional models.
If I find that SAC or my variations offer a useful improvement along any of the dimensions described, I'll offer a pull request.
Open source status
The method implementation is available
The model weights are available
The training datasets are available
Provide useful links for the implementation
No response
The text was updated successfully, but these errors were encountered:
Method description
Soft Actor-Critic is popular reinforcement learning algorithm that meets or exceeds the performance of PPO on a variety of tasks. Oddly, it is not used for LLM post-training, and I have not been able to find a satisfactory explanation as to why that is. I intend to do a research project to investigate how the Soft Actor-Critic performs for RL-based optimization. The hope is that SAC's entropy maximization results in better exploration, more varied responses, and perhaps make the LLM more robust to jailbreaking due to having more varied experience. If this hypothesis is true, then it would allow the community to create more interesting and robust LLMs that maintain alignment.
The SAC algorithm is given in this paper.
I plan to first implement and evaluate the algorithm as written. I'll compare it against PPO on time to train, performance, robustness to jailbreaks, and output diversity.
Then, I will try to improve the algorithm. Notably, SAC (as written) has 4 additional models, which would be prohibitively expensive in vRAM cost for most users since each would require a large set of parameters. Switching from Clipped Double Q-learning to Double DQN or regularization techniques like CQL may maintain performance and reduce the models to 1 or 2 additional models.
If I find that SAC or my variations offer a useful improvement along any of the dimensions described, I'll offer a pull request.
Open source status
Provide useful links for the implementation
No response
The text was updated successfully, but these errors were encountered: