-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adjust query of reward during training #256
Conversation
-before it got mean of all rewards -now it is per unit which is better
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #256 +/- ##
=======================================
Coverage 78.44% 78.45%
=======================================
Files 39 39
Lines 4259 4260 +1
=======================================
+ Hits 3341 3342 +1
Misses 918 918
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
I am not quite sure if that is gae theoretic the smartest. We aim to have the highest sum of all rewards not the highest average per unit. |
Why highest sum of all rewards? We want to perfrom each unit as good as possible. In current approach one unit performing really well outshines all other units which haven't learned anything. For example nuclear can earn a lot without much effort and has a huge reward, while others didn't learn much, and their reward is lost. |
also, I have learned, that taking the max reward is not anywhere close to the equilibrium point. We should introduce some mechanism in the future which checks chnages in rewards per unit and exits if no changes in behavior were observed for some period of time. |
Mhhh, I do not see that according to the game theory the Nash Equilibrium is when the overall welfare is the highest. Since we have a fixed demand that equals the production rent. If the overall welfare (so the absolute sum) is higher when the nuclear plant earns a shit ton of money and the rest do not earn anything than this is the Nash Equilibrium disregarding of the fairness of the result. |
@kim-mskw I don't agree with this definition. Maybe it is the case for some particular designs, but not for a general market setup. NE is when noone deviates from their policy. So ultimately we should have such condition for MADRL setups. But for now I believe the average reward of agents is a better representation compared to sum of all rewards |
@nick-harder after our bilateral talk I thought about that a lot. You are right the Nash Equilibrium (or one of the multiple) is not the state where the sum of all profits/rewards is maximal, but neither is it when the average profits/rewards of all units are the highest. I mean both are approximations. I could not find evidence in the literature which hints in multi-agent reinforcement learning which metric to rather use, frankly. I mean with the mean we just divide the sum by the quantity of agents right now. So I came to the conclusion, it should not make any difference anyhow. Hence, my initial thought of it needing to be sum was wrong. |
-before it got mean of all rewards
-now it is per unit which is better