0.5 stable

PufferAI · Dec 28, 2023 · 366d4ad · 366d4ad
1 parent c16697d
commit 366d4ad
Show file tree

Hide file tree

Showing 32 changed files with 874 additions and 1,086 deletions.
diff --git a/docs/build/doctrees/environment.pickle b/docs/build/doctrees/environment.pickle
diff --git a/docs/build/doctrees/index.doctree b/docs/build/doctrees/index.doctree
diff --git a/docs/build/doctrees/rst/api.doctree b/docs/build/doctrees/rst/api.doctree
diff --git a/docs/build/doctrees/rst/blog.doctree b/docs/build/doctrees/rst/blog.doctree
diff --git a/docs/build/doctrees/rst/landing.doctree b/docs/build/doctrees/rst/landing.doctree
diff --git a/docs/build/html/.buildinfo b/docs/build/html/.buildinfo
@@ -1,4 +1,4 @@
 # Sphinx build info version 1
 # This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
-config: 4bbe4e1bdc45b7d096f5e7a0a5eb5873
+config: 6be11b3893c1b9feee4dcb2d0620068b
 tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/docs/build/html/_images/0-5_blog_envpool.png b/docs/build/html/_images/0-5_blog_envpool.png
diff --git a/docs/build/html/_images/0-5_blog_header.png b/docs/build/html/_images/0-5_blog_header.png
diff --git a/docs/build/html/_sources/rst/api.rst.txt b/docs/build/html/_sources/rst/api.rst.txt
@@ -9,7 +9,7 @@ Emulation
 
 Wrap your environments for broad compatibility. Supports passing creator functions, classes, or env objects. The API of the returned PufferEnv is the same as Gym/PettingZoo.
 
-.. autoclass:: pufferlib.emulation.GymPufferEnv
+.. autoclass:: pufferlib.emulation.GymnasiumPufferEnv
    :members:
    :undoc-members:
    :noindex:
@@ -19,101 +19,29 @@ Wrap your environments for broad compatibility. Supports passing creator functio
    :undoc-members:
    :noindex:
 
-Registry
-########
+Environments
+############
 
-make_env functions and policies for included environments.
+All included environments expose make_env and env_creator functions. make_env is the one that you want most of the time. The other one is used to expose e.g. class interfaces for environments that support them so that you can pass around static references.
 
-Atari
-*****
-
-.. automodule:: pufferlib.registry.atari
-   :members:
-   :undoc-members:
-   :noindex:
+Additionally, all environments expose a Policy class with a baseline model. Note that not all environments have *custom* policies, and the default simply flattens observations before applying a linear layer. Atari, Procgen, Neural MMO, Nethack/Minihack, and Pokemon Red currently have reasonable policies.
 
+The PufferLib Squared environment is used as an example below. Everything is exposed through __init__, so you can call these methods through e.g. pufferlib.environments.squared.make_env
 
-Butterfly
-*********
-
-.. automodule:: pufferlib.registry.butterfly
+.. automodule:: pufferlib.environments.squared.environment
    :members:
    :undoc-members:
    :noindex:
 
-
-Classic Control
-***************
-
-.. automodule:: pufferlib.registry.classic_control
-   :members:
-   :undoc-members:
-   :noindex:
-
-Crafter
-*******
-
-.. automodule:: pufferlib.registry.crafter
-   :members:
-   :undoc-members:
-   :noindex:
-
-Griddly
-*******
-
-.. automodule:: pufferlib.registry.griddly
-   :members:
-   :undoc-members:
-   :noindex:
-
-
-MAgent
-******
-
-.. automodule:: pufferlib.registry.magent
-   :members:
-   :undoc-members:
-   :noindex:
-
-
-MicroRTS
-********
-
-.. automodule:: pufferlib.registry.microrts
-   :members:
-   :undoc-members:
-   :noindex:
-
-
-NetHack
-*******
-
-.. automodule:: pufferlib.registry.nethack
-   :members:
-   :undoc-members:
-   :noindex:
-
-
-Neural MMO
-**********
-
-.. automodule:: pufferlib.registry.nmmo
-   :members:
-   :undoc-members:
-   :noindex:
-
-Procgen
-*******
-
-.. automodule:: pufferlib.registry.procgen
+.. autoclass:: pufferlib.environments.squared.torch.Policy
    :members:
    :undoc-members:
    :noindex:
 
 Models
 ######
 
-PufferLib model API and default policies
+PufferLib model default policies and optional API. These are not required to use PufferLib.
 
 .. automodule:: pufferlib.models
    :members:
@@ -150,7 +78,7 @@ Wrap your PyTorch policies for use with CleanRL
    :undoc-members:
    :noindex:
 
-Recurrence requires you to subclass our base policy instead. See the default policies for examples.
+Wrap your PyTorch policies for use with CleanRL but add an LSTM. This requires you to use our policy API. It's pretty simple -- see the default policies for examples.
 
 .. autoclass:: pufferlib.frameworks.cleanrl.RecurrentPolicy
    :members:
@@ -160,9 +88,14 @@ Recurrence requires you to subclass our base policy instead. See the default pol
 RLlib Binding
 #############
 
-Wrap your policies for use with RLlib (WIP)
+Wrap your policies for use with RLlib (Shelved until RLlib is more stable)
 
 .. automodule:: pufferlib.frameworks.rllib
    :members:
    :undoc-members:
-   :noindex:
+   :noindex:
+
+SB3 Binding
+###########
+
+Coming soon!
diff --git a/docs/build/html/_sources/rst/blog.rst.txt b/docs/build/html/_sources/rst/blog.rst.txt
@@ -11,6 +11,73 @@
      </video>
    </center>
 
+PufferLib 0.5: A Bigger EnvPool for Growing Puffers
+###################################################
+
+This is what reinforcement learning does to your CPU utilization.
+
+.. figure:: ../_static/0-5_blog_header.png
+
+You wouldn’t pack a box this way, right? With PufferLib 0.5, we are releasing a Python implementation of EnvPool to solve this problem. **TL;DR: ~20% performance improvement across most workloads, up to 2x for complex environments, and native multiagent support.**
+
+.. figure:: ../_static/0-5_blog_envpool.png
+
+If you just want the enhancements, you can pip install -U pufferlib. But if you’d like to see a bit behind the curtain, read on!
+
+The Simulation Crisis
+*********************
+
+You want to do some RL research, so you install Atari. Say it runs at 1000 steps/second on 1 core and 5000 steps/second on 6 cores. Now, you decide you want to work on a more interesting environment and happen upon Neural MMO, a brilliant project that must have been developed by a truly fantastic team. It runs at 1500 steps/second – faster than Atari! So you scale it up to 6 cores and it runs at … 1800 steps per second. What gives?
+
+The problem is that environments simulated on different cores do not run at the same speed. Even if they did, many modern CPUs have cores that run at different speeds. Parallelization overhead is mostly the sum of:
+-  Launching/synchronization overhead. This is roughly 0.1 ms per process and is linear in the number of processes. At ~100 steps per second, you can ignore it. At >10,000 steps/second, it is the main limiting factor.
+- Environment variance. This is defined by the ratio mu/std of the environment simulation time and scales with the square root of the number of processes. For 24 processes, 10% std is 20% overhead and 100% std is 300% overhead.
+- Different core speeds. Many modern CPUs, especially Intel desktop series processors, feature additional cores that are ~20% slower than the main cores.
+- Model latency. This is the time taken to transfer observations to GPU, run the model, and transfer actions to CPU. It is not technically part of multiprocesssing overhead, but naive implementations will leave CPUs idle during model inference.
+
+As a rule of thumb, simple RL environments have < 10% variance because the code is always simulating roughly the same thing. Complex environments, especially ones with variable numbers of agents, can have > 100% variance because different code runs depending on the current state. On the other hand, if your environment has 100 agents, you are effectively running 100x fewer simulations for the same data, so launching/synchronization overhead is lower.
+
+The Solution
+************
+
+Run multiple environments per process if you have > ~2000 sps environment with variance < ~10%. This will reduce the impact of launching/synchronization overhead and also reduces variance because you are summing over samples. In PufferLib, we typically enable this only for environments > ~5000 sps because of interactions with the optimizations below.
+
+Simulate multiple buffers of environments so that one buffer is running while your model is processing observations from the other. This technique was introduced by https://github.com/alex-petrenko/sample-factory and does not speed up simulation, but it allows you to interleave simulations from two sets of environments. It’s a good trick, but it is superseded by the final optimization, which is faster and simpler.
+
+Run a pool of environments and sample from the first ones to finish stepping. For example, if you want a batch of 24 observations, you might run 64 environments. At each step, the 24 for which you have computed actions are going to take a while to simulate, but you can still select the fastest 24 from the other 64-24=40 environments. This technique was introduced by https://github.com/sail-sg/envpool and is massively effective, but the original implementation is only for specific C/C++ environments. PufferLib’s implementation is in Python, so it is slower, but it works for arbitrary Python environments and includes native multiagent support.
+
+Experiments
+***********
+
+To evaluate the performance of different backends, I am using a 13900k (24 cores) on a max specced Maingear desktop running a minimal Debian 12 install. We test 9 different simulated environments: 1e-2 to 1-4 mean delay with 0-100% delay std. For each environment, we spawn 1, 6, 24, 96, and 192 processes for each backend tested (Gymnasium’s and Pufferlib’s serial and multiprocessing implementations + Pufferlib’s pool). We also have Ray implementations compatible with our pooling code, but that will be a separate post. Additionally, PufferLib implementations sweep over (1, 2, 4) environments per process and PufferLib pool will compute 24 observations at a time. We do not consider model latency, which can yield another 2x relative performance for pooling on specific workloads.
+
+.. figure:: ../_static/0-5_blog_envpool.png
+
+9 groups of bars, each for one environment. 5 groups of bars per environment, each for a specific number of processes. The serial Gymasium/PufferLib experiments match in all cases. The best PufferLib settings are 10-20% faster than the best Gymasium settings for all workloads and can be up to 2x faster for environments with a high standard deviation in important cases (for instance, you may not want to run 192 copies of heavy environments). Again, this is before even considering the time saved by interleaving with the model forward pass.
+
+All of the implementations start to dip ~10% at 1,000 steps/second and ~50% at 10,000 steps/second. To make absolutely sure that this overhead is unavoidable, I reimplemented the entire pool architecture as minimally as possible, without any of the environment wrapper or data transfer overhead:
+
+SPS: 10734.36 envs_per_worker: 1 delay_mean: 0 delay_std: 0 num_workers: 1 batch_size: 1 sync: False
+SPS: 11640.42 envs_per_worker: 1 delay_mean: 0 delay_std: 0 num_workers: 1 batch_size: 1 sync: True
+SPS: 32715.65 envs_per_worker: 1 delay_mean: 0 delay_std: 0 num_workers: 6 batch_size: 6 sync: False
+SPS: 27635.31 envs_per_worker: 1 delay_mean: 0 delay_std: 0 num_workers: 6 batch_size: 6 sync: True
+SPS: 22681.48 envs_per_worker: 1 delay_mean: 0 delay_std: 0 num_workers: 24 batch_size: 6 sync: False
+SPS: 26183.73 envs_per_worker: 1 delay_mean: 0 delay_std: 0 num_workers: 24 batch_size: 24 sync: False
+SPS: 30120.75 envs_per_worker: 1 delay_mean: 0 delay_std: 0 num_workers: 24 batch_size: 6 sync: True
+
+As it turns out, Python’s multiprocessing caps around 10,000 steps per second per worker. There is still room for improvement by running more environments per process, but at this speed, small optimizations to the data processing code start to matter much more.
+
+Technical Details and Gotchas
+****************************
+
+PufferLib’s vectorization library is extremely concise – around 800 lines for serial, multiprocessing, and ray backends with support for PufferLib’s Gymnasium and PettingZoo wrappers. Adding envpool only required changing around 100 lines of code but required a lot of performance testing:
+Don’t use multiprocessing.Queue. There’s no fast way to poll which processes are done. Instead, use multiprocessing.Pipe and poll with selectors. I have not seen noticeable overhead from this in any of my tests.
+Don’t use time.sleep(), as this will trigger context switching, or time.time(), as this will include time spent on other processes. Use time.process_time() if you want an equal slice per core or count to ~150M/second (time it on your machine) if you want a fixed amount of work.
+
+The ray backend was extremely easy to implement thanks to ray.wait(). It is unfortunately too slow for most environments, but I wish standard multiprocessing used the Ray API, if not the architecture. The library itself has some cleanup issues that can cause crashes during heavy performance tests, which is why results are not included in this post.
+
+There’s one other thing I want to mention for people looking at the code. I was doing some experimental procedural stuff testing different programming paradigms, so the actual class interfaces are in __init__. It’s pretty much equivalent to one subclass per backend. 
+
 PufferLib 0.4: Ready to Take on Bigger Fish
 ###########################################