Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is the evluation procedure different? #57

Open
guydav opened this issue Sep 11, 2019 · 8 comments
Open

Is the evluation procedure different? #57

guydav opened this issue Sep 11, 2019 · 8 comments

Comments

@guydav
Copy link
Contributor

guydav commented Sep 11, 2019

Hi Kai,

In the Rainbow paper, the evaluation procedure is described as

The average scores of the agent are evaluated during training, every 1M steps in the environment, by suspending learning and evaluating the latest agent for 500K frames. Episodes are truncated at 108K frames (or 30 minutes of simulated play).

However, the code as written tests for a fixed number of episodes. Am I missing anything? Or is this the procedure from the data-efficient Rainbow paper (I couldn't find a detailed description there).

Thanks!

@guydav
Copy link
Contributor Author

guydav commented Sep 11, 2019

Another question which I'll tack onto here -- the default value for target-update parameter is 8000, which matches Table 1 in the Rainbow paper, which reports it as 32k frames.

Do you have a sense of why the data-efficient Rainbow paper, in Table 2 (Appendix E), reports the update period for both Rainbow and the data-efficient Rainbow as being every 2000 updates?

@Kaixhin
Copy link
Owner

Kaixhin commented Sep 12, 2019

Ah honestly using a fixed number of episodes is something I came up with (as it makes sense, keeps statistics easy, and also works across other environments), and I completely overlooked that evaluation detail. My default - 10 episodes - could run for max ~2x as long as DeepMind's procedure, but I feel like 5 episodes may be a bit too little to get a good estimate? So I'm in favour of keeping what I've done, but maybe noting in the readme that this differs from the original procedure - what do you think?

As noted in the footnote of the data-efficient paper, the target network update period is reported with respect to the online network update, which is only every 4 steps in the original and hence 8000 steps for the target network update.

@guydav
Copy link
Contributor Author

guydav commented Sep 12, 2019

I think that makes sense. I don't know why you'd evaluate over a fixed number of frames rather than episodes. You could make a TODO to eventually implement their evaluation procedure too? It wouldn't be hard at all but might take a bit of someone's time.

Another realization regarding the target-update bit. These 32K frames are not, actually, 32K unique frames, right? If I understand correctly, we repeat every action 4 times, add the max pool of the last two action frames to the state (and drop the oldest frame from the state), and pass the 4-frame state buffer to the agent.

In other words, every pair of agent actions shares three of the four frames in the environment state, correct?

(also, which paper does the max pool of the last two frames of the action repetition come from, if at all? I'm just trying to trace all of these implementation details.)

@Kaixhin
Copy link
Owner

Kaixhin commented Sep 12, 2019

OK I've added a TODO with 2188966 .

You are correct. I can't remember where this was first reported in a paper, but I've spent years trying to replicate DeepMind's results, and I did a lot of digging into their Atari wrapper repositories and the DQN source code released with the Nature paper to get these implementation details.

@guydav
Copy link
Contributor Author

guydav commented Sep 12, 2019

Thanks for the clarification. There appears to be so much voodoo around the implementation details that make it quite hard to know when you can trust your results.

It's interesting that the original Rainbow paper frames it as updating ever 32K frames, which, while strictly true, is far fewer than that in actual games frames given the overlap.

@guydav
Copy link
Contributor Author

guydav commented Sep 19, 2019

@Kaixhin -- it seems that most papers also evaluate on either no-op starts or human starts. Did you ever take a stab at implementing either?

@Kaixhin
Copy link
Owner

Kaixhin commented Sep 19, 2019

No-op starts are still used during evaluation. Haven't tried human starts, but have no idea where you would get them from (presumably internal to DeepMind).

@guydav
Copy link
Contributor Author

guydav commented Sep 20, 2019

Ah, I see, in env.reset(). That makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants