Skip to content

Releases: vwxyzjn/cleanrl

v0.2.1

13 Apr 18:24
Compare
Choose a tag to compare
add ppg impala for procgen

v0.4.3

11 Apr 22:40
Compare
Choose a tag to compare
v0.4.3 Pre-release
Pre-release
hot fix

v0.4.2

11 Apr 22:30
Compare
Choose a tag to compare
fix setup.py

v0.4.1

11 Apr 22:28
Compare
Choose a tag to compare
include versioneer.py

CleanRL v0.4.0

24 Sep 02:50
f4ec8af
Compare
Choose a tag to compare

What's new in the 0.4.0 release

Atari Results

gym_id apex_dqn_atari_visual c51_atari_visual dqn_atari_visual ppo_atari_visual
BeamRiderNoFrameskip-v4 2936.93 ± 362.18 13380.67 ± 0.00 7139.11 ± 479.11 2053.08 ± 83.37
QbertNoFrameskip-v4 3565.00 ± 690.00 16286.11 ± 0.00 11586.11 ± 0.00 17919.44 ± 383.33
SpaceInvadersNoFrameskip-v4 1019.17 ± 356.94 1099.72 ± 14.72 935.40 ± 93.17 1089.44 ± 67.22
PongNoFrameskip-v4 19.06 ± 0.83 18.00 ± 0.00 19.78 ± 0.22 20.72 ± 0.28
BreakoutNoFrameskip-v4 364.97 ± 58.36 386.10 ± 21.77 353.39 ± 30.61 380.67 ± 35.29

Mujoco Results

gym_id ddpg_continuous_action td3_continuous_action ppo_continuous_action
Reacher-v2 -6.25 ± 0.54 -6.65 ± 0.04 -7.86 ± 1.47
Pusher-v2 -44.84 ± 5.54 -59.69 ± 3.84 -44.10 ± 6.49
Thrower-v2 -137.18 ± 47.98 -80.75 ± 12.92 -58.76 ± 1.42
Striker-v2 -193.43 ± 27.22 -269.63 ± 22.14 -112.03 ± 9.43
InvertedPendulum-v2 1000.00 ± 0.00 443.33 ± 249.78 968.33 ± 31.67
HalfCheetah-v2 10386.46 ± 265.09 9265.25 ± 1290.73 1717.42 ± 20.25
Hopper-v2 1128.75 ± 9.61 3095.89 ± 590.92 2276.30 ± 418.94
Swimmer-v2 114.93 ± 29.09 103.89 ± 30.72 111.74 ± 7.06
Walker2d-v2 1946.23 ± 223.65 3059.69 ± 1014.05 3142.06 ± 1041.17
Ant-v2 243.25 ± 129.70 5586.91 ± 476.27 2785.98 ± 1265.03
Humanoid-v2 877.90 ± 3.46 6342.99 ± 247.26 786.83 ± 95.66

Pybullet Results

gym_id ddpg_continuous_action td3_continuous_action ppo_continuous_action
MinitaurBulletEnv-v0 -0.17 ± 0.02 7.73 ± 5.13 23.20 ± 2.23
MinitaurBulletDuckEnv-v0 -0.31 ± 0.03 0.88 ± 0.34 11.09 ± 1.50
InvertedPendulumBulletEnv-v0 742.22 ± 47.33 1000.00 ± 0.00 1000.00 ± 0.00
InvertedDoublePendulumBulletEnv-v0 5847.31 ± 843.53 5085.57 ± 4272.17 6970.72 ± 2386.46
Walker2DBulletEnv-v0 567.61 ± 15.01 2177.57 ± 65.49 1377.68 ± 51.96
HalfCheetahBulletEnv-v0 2847.63 ± 212.31 2537.34 ± 347.20 2347.64 ± 51.56
AntBulletEnv-v0 2094.62 ± 952.21 3253.93 ± 106.96 1775.50 ± 50.19
HopperBulletEnv-v0 1262.70 ± 424.95 2271.89 ± 24.26 2311.20 ± 45.28
HumanoidBulletEnv-v0 -54.45 ± 13.99 937.37 ± 161.05 204.47 ± 1.00
BipedalWalker-v3 66.01 ± 127.82 78.91 ± 232.51 272.08 ± 10.29
LunarLanderContinuous-v2 162.96 ± 65.60 281.88 ± 0.91 215.27 ± 10.17
Pendulum-v0 -238.65 ± 14.13 -345.29 ± 47.40 -1255.62 ± 28.37
MountainCarContinuous-v0 -1.01 ± 0.01 -1.12 ± 0.12 93.89 ± 0.06

Other Results

gym_id ppo dqn
CartPole-v1 500.00 ± 0.00 182.93 ± 47.82
Acrobot-v1 -80.10 ± 6.77 -81.50 ± 4.72
MountainCar-v0 -200.00 ± 0.00 -142.56 ± 15.89
LunarLander-v2 46.18 ± 53.04 144.52 ± 1.75
  • Added experimental support for Apex-DQN that is significantly faster than DQN. See https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/apex_dqn_atari_visual.py. In the game of breakout, Apex-DQN takes less than 4 hours to achieve around 360 episode reward. In contrast, it took 25 hours for DQN to reach 360 episode rewards.
    • Our implementation is a little different from the original. First, in pytorch's ecosystem there isn't a well-mainained distributed prioritized experience buffer such as https://github.com/deepmind/reverb. So instead we split a single prioritized replay buffer of size 100000 to two prioritized replay of size 50000 in different data-processors in sub-processes to prepare data for the worker. This is kind of a work around and a hack but according to our benchmark, it works empirically good and fast enough.
Benchmarked Learning Curves Atari
Metrics, logs, and recorded videos are at cleanrl.benchmark/reports/Atari
 
  • Supported CarRacing-v0 by PPO in the Experimental Domains. It is our first example with pixel observation space and continuous action space. See https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/experiments/ppo_car_racing.py.
    • During our experiments, we found the normalization of observation and reward seems to have a huge impact on PPO's performance, probably due to the large range of rewards provided by CarRacing-v0 (e.g. if dies you get -100 reward, but PPO is anecdotally sensitive to this kind of large rewards).

image

Release of Open RL Benchmark @ 0.3.0

01 Aug 22:57
b6c33df
Compare
Choose a tag to compare

See https://streamable.com/cq8e62 for a demo

Significant amount of effort was put into the making of Open RL Benchmark (http://benchmark.cleanrl.dev/). It provides benchmark of popular Deep Reinforcement Learning algorithms in 34+ games with unprecedented level of transparency, openness, and reproducibility.

In addition, the legacy common.py is depreciated in favor of using single-file implementations.

CleanRL 0.2.1 with SAC added and video recording feature.

09 Jan 22:00
Compare
Choose a tag to compare

We've made the SAC algorithm works for both continuous and discrete action spaces, with primary references from the following papers:

https://arxiv.org/abs/1801.01290
https://arxiv.org/abs/1812.05905
https://arxiv.org/abs/1910.07207

My personal thanks to everyone who participated in the monthly dev cycle and, in particular, @dosssman who implemented the SAC with discrete action spaces.

Additional improvement include
support gym.wrappers.Monitor to automatically record agent’s performance at certain episodes (default is 1, 2, 9, 28, 65, ... 1000, 2000, 3000) and integrate with wandb. (so cool, see screenshot below) #4
Use the same replay buffer from minimalRL for DQN and SAC #5

https://app.wandb.ai/cleanrl/cleanrl.benchmark

image

Initial Release

07 Oct 03:13
Compare
Choose a tag to compare
Initial Release Pre-release
Pre-release

This is the initial release 🙌🙌

Working on more algorithms and where and bug fixes for the 1.0 release :) Comments and PR are more than welcome.