diff --git a/docs/build/doctrees/environment.pickle b/docs/build/doctrees/environment.pickle index 579174b..1cc018d 100644 Binary files a/docs/build/doctrees/environment.pickle and b/docs/build/doctrees/environment.pickle differ diff --git a/docs/build/doctrees/rst/blog.doctree b/docs/build/doctrees/rst/blog.doctree index 14d9e8c..9b36338 100644 Binary files a/docs/build/doctrees/rst/blog.doctree and b/docs/build/doctrees/rst/blog.doctree differ diff --git a/docs/build/doctrees/rst/ocean.doctree b/docs/build/doctrees/rst/ocean.doctree index 7413c8d..49eecd1 100644 Binary files a/docs/build/doctrees/rst/ocean.doctree and b/docs/build/doctrees/rst/ocean.doctree differ diff --git a/docs/build/html/_images/ocean.png b/docs/build/html/_images/ocean.png new file mode 100644 index 0000000..2eff103 Binary files /dev/null and b/docs/build/html/_images/ocean.png differ diff --git a/docs/build/html/_sources/rst/blog.rst.txt b/docs/build/html/_sources/rst/blog.rst.txt index cfea71b..08f602d 100644 --- a/docs/build/html/_sources/rst/blog.rst.txt +++ b/docs/build/html/_sources/rst/blog.rst.txt @@ -11,6 +11,39 @@ +🐡🌊 An Ocean of Environments for Learning Pufferfish +##################################################### + +Ocean is a small suite of environments that train from scratch in 30 seconds and render in a terminal. Each environment is a sanity check for a common implementation bug. Use Ocean as a quick verification test whenever you make small code changes. + +.. image:: ../resource/ocean.png + :width: 100% + :align: center + +**Memory:** The agent is shown one binary token at a time and must recite them back after a pause. Do not make the sequence too long or you start testing credit assignment. + +**Stochasticity:** The agent is rewarded for learning a particular nondeterministic action distribution. Do not use an architecture with memory or the agent can solve the task without stochasticity. + +**Exploration:** The agent is rewarded for guessing a specific binary sequence. Do not tune your entropy coefficients higher than you would use in your actual environments, since that is the point of the test. + +**Bandit:** The agent is rewarded for solving a multiarmed bandit problem. This environment is included for historical importance. Any reasonable implementation should solve the default setting. + +**Squared:** The agent is rewarded for moving to targets that spawn around the edges of a square. There are settings to test memory, exploration, and stochasticity separately or jointly to help you prod at deeper issues with your implementation. + +This project is heavily inspired by BSuite, a DeepMind project with similar if more benchmarky goals. BSuite was a bit too heavy for my liking and didn’t fit the niche of a quick and portable verification suite. + +I had a few issues designing these. The memory task is apparently a standard RNN copying task (I would be surprised if it weren’t). But it’s a bit different in an RL context because you still have to learn credit assignment. I don’t think there is a way to fully isolate learning only memory outside of a simple 1-step problem. Try increasing the memory sequence length or delay and you will quickly find that the problem gets harder to learn. + +The exploration environment is the only one that just worked. You can increase the password length and the problem gets harder to learn at about the rate you would expect. It’s just a guess and check, so once you happen to get the password right once, the goal is to learn from that single instance as much as possible. Any prioritized replay would trivialize the problem. + +The stochastic environment took the longest. Initially, I was looking for one where the optimal policy was still stochastic and nontrivial even if the agent had memory. I could not figure out how to make one of these, and Twitter seems to think it’s impossible. They’re probably right, though you might be able to alter the setup conditions a bit, still test for the same thing, and have something that works better. For now, this is a quick and consistent test. + +I wrote the bandit environment earlier in the project, and it seems kind of useful, so I left it in the release. Probably a good idea to have at least some version of a problem this historically important easily accessible in PufferLib. + +I wrote Squared over the summer. I’m rather fond of it as a test environment, since it is fairly scalable. You spawn at the center of a square and targets spawn around the outside. You get a reward the first time you hit each target and are teleported to the center whenever you hit a target. This means that the optimal policy is stochastic: you place equal probability on moving towards each target and then deterministically move towards the target you have selected. It’s interesting because the optimal policy is stochastic in some states and deterministic in others. You can also turn the problem into a memory test by using a recurrent network. In any event, it’s similar to the bandit problem in that it combines elements of the simpler tests, but it’s a bit more tunable and interpretable. + +Let me know if you have other ideas for useful test environments. Lately, I’ve landed on either very simple or very complex environments as being the most useful for research. Many of the tasks in the middle (looking at you Atari) are too slow to be useful as quick tests and too simple to test interesting ideas. + PufferLib 0.5: A Bigger EnvPool for Growing Puffers ################################################### diff --git a/docs/build/html/_sources/rst/ocean.rst.txt b/docs/build/html/_sources/rst/ocean.rst.txt index 41746b0..3277f1c 100644 --- a/docs/build/html/_sources/rst/ocean.rst.txt +++ b/docs/build/html/_sources/rst/ocean.rst.txt @@ -4,6 +4,8 @@ 🌊 Ocean is PufferLib's suite of first-party environments. They are small and can be trained from scratch in 30 seconds to 2 minutes. Use Ocean as a sanity check for your training code instead of overnighting heavier runs. +.. image:: /resource/ocean.png + Squared ******* diff --git a/docs/build/html/genindex.html b/docs/build/html/genindex.html index fae9b89..bbb6059 100644 --- a/docs/build/html/genindex.html +++ b/docs/build/html/genindex.html @@ -246,7 +246,8 @@