Releases: vwxyzjn/cleanrl
v0.2.1
v0.4.3
v0.4.2
v0.4.1
CleanRL v0.4.0
What's new in the 0.4.0 release
- Added contribution guide here https://github.com/vwxyzjn/cleanrl/blob/master/CONTRIBUTING.md. We welcome contribution of new algorithms and new games to be added to the Open RL Benchmark (http://benchmark.cleanrl.dev/)
- Added tables for the benchmark results with standard deviations created by (https://github.com/vwxyzjn/cleanrl/blob/master/benchmark/plots.py)
Atari Results
gym_id | apex_dqn_atari_visual | c51_atari_visual | dqn_atari_visual | ppo_atari_visual |
---|---|---|---|---|
BeamRiderNoFrameskip-v4 | 2936.93 ± 362.18 | 13380.67 ± 0.00 | 7139.11 ± 479.11 | 2053.08 ± 83.37 |
QbertNoFrameskip-v4 | 3565.00 ± 690.00 | 16286.11 ± 0.00 | 11586.11 ± 0.00 | 17919.44 ± 383.33 |
SpaceInvadersNoFrameskip-v4 | 1019.17 ± 356.94 | 1099.72 ± 14.72 | 935.40 ± 93.17 | 1089.44 ± 67.22 |
PongNoFrameskip-v4 | 19.06 ± 0.83 | 18.00 ± 0.00 | 19.78 ± 0.22 | 20.72 ± 0.28 |
BreakoutNoFrameskip-v4 | 364.97 ± 58.36 | 386.10 ± 21.77 | 353.39 ± 30.61 | 380.67 ± 35.29 |
Mujoco Results
gym_id | ddpg_continuous_action | td3_continuous_action | ppo_continuous_action |
---|---|---|---|
Reacher-v2 | -6.25 ± 0.54 | -6.65 ± 0.04 | -7.86 ± 1.47 |
Pusher-v2 | -44.84 ± 5.54 | -59.69 ± 3.84 | -44.10 ± 6.49 |
Thrower-v2 | -137.18 ± 47.98 | -80.75 ± 12.92 | -58.76 ± 1.42 |
Striker-v2 | -193.43 ± 27.22 | -269.63 ± 22.14 | -112.03 ± 9.43 |
InvertedPendulum-v2 | 1000.00 ± 0.00 | 443.33 ± 249.78 | 968.33 ± 31.67 |
HalfCheetah-v2 | 10386.46 ± 265.09 | 9265.25 ± 1290.73 | 1717.42 ± 20.25 |
Hopper-v2 | 1128.75 ± 9.61 | 3095.89 ± 590.92 | 2276.30 ± 418.94 |
Swimmer-v2 | 114.93 ± 29.09 | 103.89 ± 30.72 | 111.74 ± 7.06 |
Walker2d-v2 | 1946.23 ± 223.65 | 3059.69 ± 1014.05 | 3142.06 ± 1041.17 |
Ant-v2 | 243.25 ± 129.70 | 5586.91 ± 476.27 | 2785.98 ± 1265.03 |
Humanoid-v2 | 877.90 ± 3.46 | 6342.99 ± 247.26 | 786.83 ± 95.66 |
Pybullet Results
gym_id | ddpg_continuous_action | td3_continuous_action | ppo_continuous_action |
---|---|---|---|
MinitaurBulletEnv-v0 | -0.17 ± 0.02 | 7.73 ± 5.13 | 23.20 ± 2.23 |
MinitaurBulletDuckEnv-v0 | -0.31 ± 0.03 | 0.88 ± 0.34 | 11.09 ± 1.50 |
InvertedPendulumBulletEnv-v0 | 742.22 ± 47.33 | 1000.00 ± 0.00 | 1000.00 ± 0.00 |
InvertedDoublePendulumBulletEnv-v0 | 5847.31 ± 843.53 | 5085.57 ± 4272.17 | 6970.72 ± 2386.46 |
Walker2DBulletEnv-v0 | 567.61 ± 15.01 | 2177.57 ± 65.49 | 1377.68 ± 51.96 |
HalfCheetahBulletEnv-v0 | 2847.63 ± 212.31 | 2537.34 ± 347.20 | 2347.64 ± 51.56 |
AntBulletEnv-v0 | 2094.62 ± 952.21 | 3253.93 ± 106.96 | 1775.50 ± 50.19 |
HopperBulletEnv-v0 | 1262.70 ± 424.95 | 2271.89 ± 24.26 | 2311.20 ± 45.28 |
HumanoidBulletEnv-v0 | -54.45 ± 13.99 | 937.37 ± 161.05 | 204.47 ± 1.00 |
BipedalWalker-v3 | 66.01 ± 127.82 | 78.91 ± 232.51 | 272.08 ± 10.29 |
LunarLanderContinuous-v2 | 162.96 ± 65.60 | 281.88 ± 0.91 | 215.27 ± 10.17 |
Pendulum-v0 | -238.65 ± 14.13 | -345.29 ± 47.40 | -1255.62 ± 28.37 |
MountainCarContinuous-v0 | -1.01 ± 0.01 | -1.12 ± 0.12 | 93.89 ± 0.06 |
Other Results
gym_id | ppo | dqn |
---|---|---|
CartPole-v1 | 500.00 ± 0.00 | 182.93 ± 47.82 |
Acrobot-v1 | -80.10 ± 6.77 | -81.50 ± 4.72 |
MountainCar-v0 | -200.00 ± 0.00 | -142.56 ± 15.89 |
LunarLander-v2 | 46.18 ± 53.04 | 144.52 ± 1.75 |
- Added experimental support for Apex-DQN that is significantly faster than DQN. See https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/apex_dqn_atari_visual.py. In the game of breakout, Apex-DQN takes less than 4 hours to achieve around 360 episode reward. In contrast, it took 25 hours for DQN to reach 360 episode rewards.
- Our implementation is a little different from the original. First, in pytorch's ecosystem there isn't a well-mainained distributed prioritized experience buffer such as https://github.com/deepmind/reverb. So instead we split a single prioritized replay buffer of size 100000 to two prioritized replay of size 50000 in different
data-processors
in sub-processes to prepare data for theworker
. This is kind of a work around and a hack but according to our benchmark, it works empirically good and fast enough.
- Our implementation is a little different from the original. First, in pytorch's ecosystem there isn't a well-mainained distributed prioritized experience buffer such as https://github.com/deepmind/reverb. So instead we split a single prioritized replay buffer of size 100000 to two prioritized replay of size 50000 in different
Benchmarked Learning Curves | Atari |
---|---|
Metrics, logs, and recorded videos are at | cleanrl.benchmark/reports/Atari |
- Supported
CarRacing-v0
by PPO in the Experimental Domains. It is our first example with pixel observation space and continuous action space. See https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/experiments/ppo_car_racing.py.- During our experiments, we found the normalization of observation and reward seems to have a huge impact on PPO's performance, probably due to the large range of rewards provided by
CarRacing-v0
(e.g. if dies you get -100 reward, but PPO is anecdotally sensitive to this kind of large rewards).
- During our experiments, we found the normalization of observation and reward seems to have a huge impact on PPO's performance, probably due to the large range of rewards provided by
Release of Open RL Benchmark @ 0.3.0
See https://streamable.com/cq8e62 for a demo
Significant amount of effort was put into the making of Open RL Benchmark (http://benchmark.cleanrl.dev/). It provides benchmark of popular Deep Reinforcement Learning algorithms in 34+ games with unprecedented level of transparency, openness, and reproducibility.
In addition, the legacy common.py
is depreciated in favor of using single-file implementations.
CleanRL 0.2.1 with SAC added and video recording feature.
We've made the SAC algorithm works for both continuous and discrete action spaces, with primary references from the following papers:
https://arxiv.org/abs/1801.01290
https://arxiv.org/abs/1812.05905
https://arxiv.org/abs/1910.07207
My personal thanks to everyone who participated in the monthly dev cycle and, in particular, @dosssman who implemented the SAC with discrete action spaces.
Additional improvement include
support gym.wrappers.Monitor to automatically record agent’s performance at certain episodes (default is 1, 2, 9, 28, 65, ... 1000, 2000, 3000) and integrate with wandb. (so cool, see screenshot below) #4
Use the same replay buffer from minimalRL for DQN and SAC #5
Initial Release
This is the initial release 🙌🙌
Working on more algorithms and where and bug fixes for the 1.0 release :) Comments and PR are more than welcome.