Bug of PPO #1072

GIS-PuppetMaster · 2020-03-13T11:01:54Z

ratio = tf.exp(pi.log_prob(action) - old_pi.log_prob(action))
surr = ratio * adv
...
loss = -tf.reduce_mean( tf.minimum(surr, tf.clip_by_value(ratio, 1. - self.epsilon, 1. + self.epsilon) * adv) )

should use ratio in tf.minimum rather than surr, because surr=ration*adv, and there could be negative value in adv, so the result of tf.minimum may contain a value like -1e10, and cause actor's loss failed.

GIS-PuppetMaster · 2020-03-13T11:02:36Z

it should be like this:
self.cliped_ratio = tf.clip_by_value(self.ratio, 1. - METHOD['epsilon'],
1. + METHOD['epsilon'])
self.min_temp = tf.minimum(self.ratio, self.cliped_ratio)
self.aloss = -tf.reduce_mean(self.min_temp * self.tfadv)

quantumiracle · 2020-03-13T15:04:08Z

Why the negative value causes failure in actor loss?
You can also refer to OpenAI baselines here, which has similar process as our repo.

GIS-PuppetMaster · 2020-03-13T16:41:00Z

Why the negative value causes failure in actor loss?
You can also refer to OpenAI baselines here, which has similar process as our repo.

I drawed the loss polt and reward plot, when there is a very small negative value, such as 1e-10, the loss will be extremly larger than normally, and the reward stoped increase.
I just tried lower learning rate, and there was no such 1e-10 value came out.
I wonder if it's the same that use my code above, since it's more robust.

quantumiracle · 2021-01-06T02:54:25Z

Sorry for the late reply. What you mentioned might be caused by some numerical issues in tf.minimum if I understood correctly. Could you please print out an example case and paste it here? I'm a bit confused by your description since you mentioned both large negative value (-1e10) and small positive value (1e-10). A case showing how it causes a large loss value would be great.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug of PPO #1072

Bug of PPO #1072

GIS-PuppetMaster commented Mar 13, 2020

GIS-PuppetMaster commented Mar 13, 2020

quantumiracle commented Mar 13, 2020

GIS-PuppetMaster commented Mar 13, 2020 •

edited

Loading

quantumiracle commented Jan 6, 2021 •

edited

Loading

Bug of PPO #1072

Bug of PPO #1072

Comments

GIS-PuppetMaster commented Mar 13, 2020

GIS-PuppetMaster commented Mar 13, 2020

quantumiracle commented Mar 13, 2020

GIS-PuppetMaster commented Mar 13, 2020 • edited Loading

quantumiracle commented Jan 6, 2021 • edited Loading

GIS-PuppetMaster commented Mar 13, 2020 •

edited

Loading

quantumiracle commented Jan 6, 2021 •

edited

Loading