Ocean blog post

PufferAI · Jan 14, 2024 · 04d9319 · 04d9319
1 parent 22e6bf4
commit 04d9319
Show file tree

Hide file tree

Showing 18 changed files with 112 additions and 55 deletions.
diff --git a/docs/build/doctrees/environment.pickle b/docs/build/doctrees/environment.pickle
diff --git a/docs/build/doctrees/rst/blog.doctree b/docs/build/doctrees/rst/blog.doctree
diff --git a/docs/build/doctrees/rst/ocean.doctree b/docs/build/doctrees/rst/ocean.doctree
diff --git a/docs/build/html/_images/ocean.png b/docs/build/html/_images/ocean.png
diff --git a/docs/build/html/_sources/rst/blog.rst.txt b/docs/build/html/_sources/rst/blog.rst.txt
@@ -11,6 +11,39 @@
      </video>
    </center>
 
+🐡🌊 An Ocean of Environments for Learning Pufferfish
+#####################################################
+
+Ocean is a small suite of environments that train from scratch in 30 seconds and render in a terminal. Each environment is a sanity check for a common implementation bug. Use Ocean as a quick verification test whenever you make small code changes.
+
+.. image:: ../resource/ocean.png
+   :width: 100%
+   :align: center
+
+**Memory:** The agent is shown one binary token at a time and must recite them back after a pause. Do not make the sequence too long or you start testing credit assignment.
+
+**Stochasticity:** The agent is rewarded for learning a particular nondeterministic action distribution. Do not use an architecture with memory or the agent can solve the task without stochasticity.
+
+**Exploration:** The agent is rewarded for guessing a specific binary sequence. Do not tune your entropy coefficients higher than you would use in your actual environments, since that is the point of the test.
+
+**Bandit:** The agent is rewarded for solving a multiarmed bandit problem. This environment is included for historical importance. Any reasonable implementation should solve the default setting.
+
+**Squared:** The agent is rewarded for moving to targets that spawn around the edges of a square. There are settings to test memory, exploration, and stochasticity separately or jointly to help you prod at deeper issues with your implementation.
+
+This project is heavily inspired by BSuite, a DeepMind project with similar if more benchmarky goals. BSuite was a bit too heavy for my liking and didn’t fit the niche of a quick and portable verification suite.
+
+I had a few issues designing these. The memory task is apparently a standard RNN copying task (I would be surprised if it weren’t). But it’s a bit different in an RL context because you still have to learn credit assignment. I don’t think there is a way to fully isolate learning only memory outside of a simple 1-step problem. Try increasing the memory sequence length or delay and you will quickly find that the problem gets harder to learn.
+
+The exploration environment is the only one that just worked. You can increase the password length and the problem gets harder to learn at about the rate you would expect. It’s just a guess and check, so once you happen to get the password right once, the goal is to learn from that single instance as much as possible. Any prioritized replay would trivialize the problem.
+
+The stochastic environment took the longest. Initially, I was looking for one where the optimal policy was still stochastic and nontrivial even if the agent had memory. I could not figure out how to make one of these, and Twitter seems to think it’s impossible. They’re probably right, though you might be able to alter the setup conditions a bit, still test for the same thing, and have something that works better. For now, this is a quick and consistent test.
+
+I wrote the bandit environment earlier in the project, and it seems kind of useful, so I left it in the release. Probably a good idea to have at least some version of a problem this historically important easily accessible in PufferLib.
+
+I wrote Squared over the summer. I’m rather fond of it as a test environment, since it is fairly scalable. You spawn at the center of a square and targets spawn around the outside. You get a reward the first time you hit each target and are teleported to the center whenever you hit a target. This means that the optimal policy is stochastic: you place equal probability on moving towards each target and then deterministically move towards the target you have selected. It’s interesting because the optimal policy is stochastic in some states and deterministic in others. You can also turn the problem into a memory test by using a recurrent network. In any event, it’s similar to the bandit problem in that it combines elements of the simpler tests, but it’s a bit more tunable and interpretable.
+
+Let me know if you have other ideas for useful test environments. Lately, I’ve landed on either very simple or very complex environments as being the most useful for research. Many of the tasks in the middle (looking at you Atari) are too slow to be useful as quick tests and too simple to test interesting ideas.
+
 PufferLib 0.5: A Bigger EnvPool for Growing Puffers
 ###################################################
 

diff --git a/docs/build/html/_sources/rst/ocean.rst.txt b/docs/build/html/_sources/rst/ocean.rst.txt
@@ -4,6 +4,8 @@
 
 🌊 Ocean is PufferLib's suite of first-party environments. They are small and can be trained from scratch in 30 seconds to 2 minutes. Use Ocean as a sanity check for your training code instead of overnighting heavier runs.
 
+.. image:: /resource/ocean.png
+
 Squared
 *******
 

diff --git a/docs/build/html/genindex.html b/docs/build/html/genindex.html
@@ -246,7 +246,8 @@
 </ul>
 <p class="caption" role="heading"><span class="caption-text">Blog</span></p>
 <ul>
-<li class="toctree-l1"><a class="reference internal" href="rst/blog.html">PufferLib 0.5: A Bigger EnvPool for Growing Puffers</a></li>
+<li class="toctree-l1"><a class="reference internal" href="rst/blog.html">🐡🌊 An Ocean of Environments for Learning Pufferfish</a></li>
+<li class="toctree-l1"><a class="reference internal" href="rst/blog.html#pufferlib-0-5-a-bigger-envpool-for-growing-puffers">PufferLib 0.5: A Bigger EnvPool for Growing Puffers</a></li>
 <li class="toctree-l1"><a class="reference internal" href="rst/blog.html#pufferlib-0-4-ready-to-take-on-bigger-fish">PufferLib 0.4: Ready to Take on Bigger Fish</a></li>
 <li class="toctree-l1"><a class="reference internal" href="rst/blog.html#pufferlib-0-2-ready-to-take-on-the-big-fish">PufferLib 0.2: Ready to Take on the Big Fish</a></li>
 </ul>

diff --git a/docs/build/html/index.html b/docs/build/html/index.html
@@ -248,7 +248,8 @@
 </ul>
 <p class="caption" role="heading"><span class="caption-text">Blog</span></p>
 <ul>
-<li class="toctree-l1"><a class="reference internal" href="rst/blog.html">PufferLib 0.5: A Bigger EnvPool for Growing Puffers</a></li>
+<li class="toctree-l1"><a class="reference internal" href="rst/blog.html">🐡🌊 An Ocean of Environments for Learning Pufferfish</a></li>
+<li class="toctree-l1"><a class="reference internal" href="rst/blog.html#pufferlib-0-5-a-bigger-envpool-for-growing-puffers">PufferLib 0.5: A Bigger EnvPool for Growing Puffers</a></li>
 <li class="toctree-l1"><a class="reference internal" href="rst/blog.html#pufferlib-0-4-ready-to-take-on-bigger-fish">PufferLib 0.4: Ready to Take on Bigger Fish</a></li>
 <li class="toctree-l1"><a class="reference internal" href="rst/blog.html#pufferlib-0-2-ready-to-take-on-the-big-fish">PufferLib 0.2: Ready to Take on the Big Fish</a></li>
 </ul>
@@ -321,7 +322,8 @@ <h1>Index<a class="headerlink" href="#index" title="Permalink to this heading">#
 <div class="toctree-wrapper compound">
 <p class="caption" role="heading"><span class="caption-text">Blog</span></p>
 <ul>
-<li class="toctree-l1"><a class="reference internal" href="rst/blog.html">PufferLib 0.5: A Bigger EnvPool for Growing Puffers</a><ul>
+<li class="toctree-l1"><a class="reference internal" href="rst/blog.html">🐡🌊 An Ocean of Environments for Learning Pufferfish</a></li>
+<li class="toctree-l1"><a class="reference internal" href="rst/blog.html#pufferlib-0-5-a-bigger-envpool-for-growing-puffers">PufferLib 0.5: A Bigger EnvPool for Growing Puffers</a><ul>
 <li class="toctree-l2"><a class="reference internal" href="rst/blog.html#the-simulation-crisis">The Simulation Crisis</a></li>
 <li class="toctree-l2"><a class="reference internal" href="rst/blog.html#the-solution">The Solution</a></li>
 <li class="toctree-l2"><a class="reference internal" href="rst/blog.html#experiments">Experiments</a></li>

diff --git a/docs/build/html/objects.inv b/docs/build/html/objects.inv
diff --git a/docs/build/html/rst/api.html b/docs/build/html/rst/api.html
@@ -248,7 +248,8 @@
 </ul>
 <p class="caption" role="heading"><span class="caption-text">Blog</span></p>
 <ul>
-<li class="toctree-l1"><a class="reference internal" href="blog.html">PufferLib 0.5: A Bigger EnvPool for Growing Puffers</a></li>
+<li class="toctree-l1"><a class="reference internal" href="blog.html">🐡🌊 An Ocean of Environments for Learning Pufferfish</a></li>
+<li class="toctree-l1"><a class="reference internal" href="blog.html#pufferlib-0-5-a-bigger-envpool-for-growing-puffers">PufferLib 0.5: A Bigger EnvPool for Growing Puffers</a></li>
 <li class="toctree-l1"><a class="reference internal" href="blog.html#pufferlib-0-4-ready-to-take-on-bigger-fish">PufferLib 0.4: Ready to Take on Bigger Fish</a></li>
 <li class="toctree-l1"><a class="reference internal" href="blog.html#pufferlib-0-2-ready-to-take-on-the-big-fish">PufferLib 0.2: Ready to Take on the Big Fish</a></li>
 </ul>

diff --git a/docs/build/html/rst/blog.html b/docs/build/html/rst/blog.html
@@ -6,7 +6,7 @@
 <link rel="index" title="Index" href="../genindex.html" /><link rel="search" title="Search" href="../search.html" /><link rel="prev" title="Squared" href="ocean.html" />
 
     <!-- Generated with Sphinx 5.0.0 and Furo 2023.03.27 -->
-        <title>PufferLib 0.5: A Bigger EnvPool for Growing Puffers - PufferLib 0.6.0 documentation</title>
+        <title>🐡🌊 An Ocean of Environments for Learning Pufferfish - PufferLib 0.6.0 documentation</title>
       <link rel="stylesheet" type="text/css" href="../_static/pygments.css" />
     <link rel="stylesheet" type="text/css" href="../_static/styles/furo.css?digest=fad236701ea90a88636c2a8c73b44ae642ed2a53" />
     <link rel="stylesheet" type="text/css" href="../_static/design-style.1e8bd061cd6da7fc9cf755528e8ffc24.min.css" />
@@ -201,7 +201,7 @@
           <svg class="theme-icon-when-light"><use href="#svg-sun"></use></svg>
         </button>
       </div>
-      <label class="toc-overlay-icon toc-header-icon" for="__toc">
+      <label class="toc-overlay-icon toc-header-icon no-toc" for="__toc">
         <div class="visually-hidden">Toggle table of contents sidebar</div>
         <i class="icon"><svg><use href="#svg-toc"></use></svg></i>
       </label>
@@ -248,7 +248,8 @@
 </ul>
 <p class="caption" role="heading"><span class="caption-text">Blog</span></p>
 <ul class="current">
-<li class="toctree-l1 current current-page"><a class="current reference internal" href="#">PufferLib 0.5: A Bigger EnvPool for Growing Puffers</a></li>
+<li class="toctree-l1 current current-page"><a class="current reference internal" href="#">🐡🌊 An Ocean of Environments for Learning Pufferfish</a></li>
+<li class="toctree-l1"><a class="reference internal" href="#pufferlib-0-5-a-bigger-envpool-for-growing-puffers">PufferLib 0.5: A Bigger EnvPool for Growing Puffers</a></li>
 <li class="toctree-l1"><a class="reference internal" href="#pufferlib-0-4-ready-to-take-on-bigger-fish">PufferLib 0.4: Ready to Take on Bigger Fish</a></li>
 <li class="toctree-l1"><a class="reference internal" href="#pufferlib-0-2-ready-to-take-on-the-big-fish">PufferLib 0.2: Ready to Take on the Big Fish</a></li>
 </ul>
@@ -279,7 +280,7 @@
               <svg class="theme-icon-when-light"><use href="#svg-sun"></use></svg>
             </button>
           </div>
-          <label class="toc-overlay-icon toc-content-icon" for="__toc">
+          <label class="toc-overlay-icon toc-content-icon no-toc" for="__toc">
             <div class="visually-hidden">Toggle table of contents sidebar</div>
             <i class="icon"><svg><use href="#svg-toc"></use></svg></i>
           </label>
@@ -291,7 +292,24 @@
     <source src="../_static/banner.mp4" type="video/mp4">
     Your browser does not support this video.
   </video>
-</center><section id="pufferlib-0-5-a-bigger-envpool-for-growing-puffers">
+</center><section id="an-ocean-of-environments-for-learning-pufferfish">
+<h1>🐡🌊 An Ocean of Environments for Learning Pufferfish<a class="headerlink" href="#an-ocean-of-environments-for-learning-pufferfish" title="Permalink to this heading">#</a></h1>
+<p>Ocean is a small suite of environments that train from scratch in 30 seconds and render in a terminal. Each environment is a sanity check for a common implementation bug. Use Ocean as a quick verification test whenever you make small code changes.</p>
+<a class="reference internal image-reference" href="../_images/ocean.png"><img alt="../_images/ocean.png" class="align-center" src="../_images/ocean.png" style="width: 100%;" /></a>
+<p><strong>Memory:</strong> The agent is shown one binary token at a time and must recite them back after a pause. Do not make the sequence too long or you start testing credit assignment.</p>
+<p><strong>Stochasticity:</strong> The agent is rewarded for learning a particular nondeterministic action distribution. Do not use an architecture with memory or the agent can solve the task without stochasticity.</p>
+<p><strong>Exploration:</strong> The agent is rewarded for guessing a specific binary sequence. Do not tune your entropy coefficients higher than you would use in your actual environments, since that is the point of the test.</p>
+<p><strong>Bandit:</strong> The agent is rewarded for solving a multiarmed bandit problem. This environment is included for historical importance. Any reasonable implementation should solve the default setting.</p>
+<p><strong>Squared:</strong> The agent is rewarded for moving to targets that spawn around the edges of a square. There are settings to test memory, exploration, and stochasticity separately or jointly to help you prod at deeper issues with your implementation.</p>
+<p>This project is heavily inspired by BSuite, a DeepMind project with similar if more benchmarky goals. BSuite was a bit too heavy for my liking and didn’t fit the niche of a quick and portable verification suite.</p>
+<p>I had a few issues designing these. The memory task is apparently a standard RNN copying task (I would be surprised if it weren’t). But it’s a bit different in an RL context because you still have to learn credit assignment. I don’t think there is a way to fully isolate learning only memory outside of a simple 1-step problem. Try increasing the memory sequence length or delay and you will quickly find that the problem gets harder to learn.</p>
+<p>The exploration environment is the only one that just worked. You can increase the password length and the problem gets harder to learn at about the rate you would expect. It’s just a guess and check, so once you happen to get the password right once, the goal is to learn from that single instance as much as possible. Any prioritized replay would trivialize the problem.</p>
+<p>The stochastic environment took the longest. Initially, I was looking for one where the optimal policy was still stochastic and nontrivial even if the agent had memory. I could not figure out how to make one of these, and Twitter seems to think it’s impossible. They’re probably right, though you might be able to alter the setup conditions a bit, still test for the same thing, and have something that works better. For now, this is a quick and consistent test.</p>
+<p>I wrote the bandit environment earlier in the project, and it seems kind of useful, so I left it in the release. Probably a good idea to have at least some version of a problem this historically important easily accessible in PufferLib.</p>
+<p>I wrote Squared over the summer. I’m rather fond of it as a test environment, since it is fairly scalable. You spawn at the center of a square and targets spawn around the outside. You get a reward the first time you hit each target and are teleported to the center whenever you hit a target. This means that the optimal policy is stochastic: you place equal probability on moving towards each target and then deterministically move towards the target you have selected. It’s interesting because the optimal policy is stochastic in some states and deterministic in others. You can also turn the problem into a memory test by using a recurrent network. In any event, it’s similar to the bandit problem in that it combines elements of the simpler tests, but it’s a bit more tunable and interpretable.</p>
+<p>Let me know if you have other ideas for useful test environments. Lately, I’ve landed on either very simple or very complex environments as being the most useful for research. Many of the tasks in the middle (looking at you Atari) are too slow to be useful as quick tests and too simple to test interesting ideas.</p>
+</section>
+<section id="pufferlib-0-5-a-bigger-envpool-for-growing-puffers">
 <h1>PufferLib 0.5: A Bigger EnvPool for Growing Puffers<a class="headerlink" href="#pufferlib-0-5-a-bigger-envpool-for-growing-puffers" title="Permalink to this heading">#</a></h1>
 <p>This is what reinforcement learning does to your CPU utilization:</p>
 <figure class="align-default">
@@ -502,47 +520,8 @@ <h2>Next Steps<a class="headerlink" href="#next-steps" title="Permalink to this
 
       </footer>
     </div>
-    <aside class="toc-drawer">
-
+    <aside class="toc-drawer no-toc">
 
-      <div class="toc-sticky toc-scroll">
-        <div class="toc-title-container">
-          <span class="toc-title">
-            On this page
-          </span>
-        </div>
-        <div class="toc-tree-container">
-          <div class="toc-tree">
-            <ul>
-<li><a class="reference internal" href="#">PufferLib 0.5: A Bigger EnvPool for Growing Puffers</a><ul>
-<li><a class="reference internal" href="#the-simulation-crisis">The Simulation Crisis</a></li>
-<li><a class="reference internal" href="#the-solution">The Solution</a></li>
-<li><a class="reference internal" href="#experiments">Experiments</a></li>
-<li><a class="reference internal" href="#technical-details-and-gotchas">Technical Details and Gotchas</a></li>
-</ul>
-</li>
-<li><a class="reference internal" href="#pufferlib-0-4-ready-to-take-on-bigger-fish">PufferLib 0.4: Ready to Take on Bigger Fish</a><ul>
-<li><a class="reference internal" href="#emulation">Emulation</a></li>
-<li><a class="reference internal" href="#vectorization">Vectorization</a></li>
-<li><a class="reference internal" href="#puffertank">PufferTank</a></li>
-<li><a class="reference internal" href="#policies">Policies</a></li>
-<li><a class="reference internal" href="#error-handling">Error Handling</a></li>
-<li><a class="reference internal" href="#miscellaneous">Miscellaneous</a></li>
-</ul>
-</li>
-<li><a class="reference internal" href="#pufferlib-0-2-ready-to-take-on-the-big-fish">PufferLib 0.2: Ready to Take on the Big Fish</a><ul>
-<li><a class="reference internal" href="#problem-statement">Problem Statement</a></li>
-<li><a class="reference internal" href="#cleanrl-demos">CleanRL Demos</a></li>
-<li><a class="reference internal" href="#pufferlib-emulation">PufferLib Emulation</a></li>
-<li><a class="reference internal" href="#pufferlib-vectorization">PufferLib Vectorization</a></li>
-<li><a class="reference internal" href="#next-steps">Next Steps</a></li>
-</ul>
-</li>
-</ul>
-
-          </div>
-        </div>
-      </div>
 
 
     </aside>

diff --git a/docs/build/html/rst/landing.html b/docs/build/html/rst/landing.html
@@ -248,7 +248,8 @@
 </ul>
 <p class="caption" role="heading"><span class="caption-text">Blog</span></p>
 <ul>
-<li class="toctree-l1"><a class="reference internal" href="blog.html">PufferLib 0.5: A Bigger EnvPool for Growing Puffers</a></li>
+<li class="toctree-l1"><a class="reference internal" href="blog.html">🐡🌊 An Ocean of Environments for Learning Pufferfish</a></li>
+<li class="toctree-l1"><a class="reference internal" href="blog.html#pufferlib-0-5-a-bigger-envpool-for-growing-puffers">PufferLib 0.5: A Bigger EnvPool for Growing Puffers</a></li>
 <li class="toctree-l1"><a class="reference internal" href="blog.html#pufferlib-0-4-ready-to-take-on-bigger-fish">PufferLib 0.4: Ready to Take on Bigger Fish</a></li>
 <li class="toctree-l1"><a class="reference internal" href="blog.html#pufferlib-0-2-ready-to-take-on-the-big-fish">PufferLib 0.2: Ready to Take on the Big Fish</a></li>
 </ul>
-Original file line number
+Diff line change
@@ Expand Up / @@ -4,6 +4,8 @@ @@
     🌊 Ocean is PufferLib's suite of first-party environments. They are small and can be trained from scratch in 30 seconds to 2 minutes. Use Ocean as a sanity check for your training code instead of overnighting heavier runs.
+    .. image:: /resource/ocean.png
     Squared
     *******
@@ Expand Down @@