You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, recently I am trying to reproduce your work and feel a little confused when implementing MF-AC. According to the algorithm at somewhere the MF-Value (10) should be calculated, where it seems it involves many computations to enumerate all possible mean-field actions and their probabilities. I took a look at you MF-AC implementation in battle-game, but it appears to me (please correct me if i am wrong) here the MF-values are substituted with the returns from the sampled trajectory? Could you explain more about how to calculate the MF-value eq(10), for both MF-AC and MF-Q? Thanks
The text was updated successfully, but these errors were encountered:
It just occurred to me that the sampled trajectory is an unbiased estimator of the MF-Value? It works for REINFORCE-like AC. But still confused how to calculated for off-policy RL like MF-Q?
Hi, recently I am trying to reproduce your work and feel a little confused when implementing MF-AC. According to the algorithm at somewhere the MF-Value (10) should be calculated, where it seems it involves many computations to enumerate all possible mean-field actions and their probabilities. I took a look at you MF-AC implementation in battle-game, but it appears to me (please correct me if i am wrong) here the MF-values are substituted with the returns from the sampled trajectory? Could you explain more about how to calculate the MF-value eq(10), for both MF-AC and MF-Q? Thanks
The text was updated successfully, but these errors were encountered: