Soft Actor-Critic (SAC) Trainer #2517

AMindToThink · 2024-12-23T23:52:08Z

Method description

Soft Actor-Critic is popular reinforcement learning algorithm that meets or exceeds the performance of PPO on a variety of tasks. Oddly, it is not used for LLM post-training, and I have not been able to find a satisfactory explanation as to why that is. I intend to do a research project to investigate how the Soft Actor-Critic performs for RL-based optimization. The hope is that SAC's entropy maximization results in better exploration, more varied responses, and perhaps make the LLM more robust to jailbreaking due to having more varied experience. If this hypothesis is true, then it would allow the community to create more interesting and robust LLMs that maintain alignment.

The SAC algorithm is given in this paper.

I plan to first implement and evaluate the algorithm as written. I'll compare it against PPO on time to train, performance, robustness to jailbreaks, and output diversity.

Then, I will try to improve the algorithm. Notably, SAC (as written) has 4 additional models, which would be prohibitively expensive in vRAM cost for most users since each would require a large set of parameters. Switching from Clipped Double Q-learning to Double DQN or regularization techniques like CQL may maintain performance and reduce the models to 1 or 2 additional models.

If I find that SAC or my variations offer a useful improvement along any of the dimensions described, I'll offer a pull request.

Open source status

The method implementation is available
The model weights are available
The training datasets are available

Provide useful links for the implementation

No response

August-murr added the ✨ enhancement New feature or request label Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Soft Actor-Critic (SAC) Trainer #2517

Soft Actor-Critic (SAC) Trainer #2517

AMindToThink commented Dec 23, 2024 •

edited by August-murr

Loading

Soft Actor-Critic (SAC) Trainer #2517

Soft Actor-Critic (SAC) Trainer #2517

Comments

AMindToThink commented Dec 23, 2024 • edited by August-murr Loading

Method description

Open source status

Provide useful links for the implementation

AMindToThink commented Dec 23, 2024 •

edited by August-murr

Loading