[bug] objective/entropy < 0 when using rlootrainer and ppotrainer #2496

macheng6 · 2024-12-17T17:55:48Z

Line 443 in 1661bc2

mean_entropy = (-logprobs).sum(1).mean()

This is because in the previous code, the padding part of logprod is filled with 1.

INVALID_LOGPROB=1.0
logprobs = torch.masked_fill(logprobs, padding_mask, INVALID_LOGPROB)

I don't know why INVALID_LOGPROB is set to 1, wouldn't it work fine if it is set to 0?

asparius · 2024-12-18T14:14:35Z

This has been noted previously #2281. I believe this was introduced in PPOv2 which was replication of the openai tldr paper which also contains this INVALID_LOGPROB=1.0 which does not break training because it cancels out at kl reward. Perhaps @vwxyzjn can tell why this was used, instead of masked_mean version

qgallouedec added 🙋 help from community wanted Open invitation for community members to contribute ❓ question Seeking clarification or more information 🏋 PPO Related to PPO 🏋 RLOO Related to RLOO labels Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] objective/entropy < 0 when using rlootrainer and ppotrainer #2496

[bug] objective/entropy < 0 when using rlootrainer and ppotrainer #2496

macheng6 commented Dec 17, 2024

asparius commented Dec 18, 2024

[bug] objective/entropy < 0 when using rlootrainer and ppotrainer #2496

[bug] objective/entropy < 0 when using rlootrainer and ppotrainer #2496

Comments

macheng6 commented Dec 17, 2024

asparius commented Dec 18, 2024