gym_id	apex_dqn_atari_visual	c51_atari_visual	dqn_atari_visual	ppo_atari_visual
BeamRiderNoFrameskip-v4	2936.93 ± 362.18	13380.67 ± 0.00	7139.11 ± 479.11	2053.08 ± 83.37
QbertNoFrameskip-v4	3565.00 ± 690.00	16286.11 ± 0.00	11586.11 ± 0.00	17919.44 ± 383.33
SpaceInvadersNoFrameskip-v4	1019.17 ± 356.94	1099.72 ± 14.72	935.40 ± 93.17	1089.44 ± 67.22
PongNoFrameskip-v4	19.06 ± 0.83	18.00 ± 0.00	19.78 ± 0.22	20.72 ± 0.28
BreakoutNoFrameskip-v4	364.97 ± 58.36	386.10 ± 21.77	353.39 ± 30.61	380.67 ± 35.29

Mujoco Results

gym_id	ddpg_continuous_action	td3_continuous_action	ppo_continuous_action
Reacher-v2	-6.25 ± 0.54	-6.65 ± 0.04	-7.86 ± 1.47
Pusher-v2	-44.84 ± 5.54	-59.69 ± 3.84	-44.10 ± 6.49
Thrower-v2	-137.18 ± 47.98	-80.75 ± 12.92	-58.76 ± 1.42
Striker-v2	-193.43 ± 27.22	-269.63 ± 22.14	-112.03 ± 9.43
InvertedPendulum-v2	1000.00 ± 0.00	443.33 ± 249.78	968.33 ± 31.67
HalfCheetah-v2	10386.46 ± 265.09	9265.25 ± 1290.73	1717.42 ± 20.25
Hopper-v2	1128.75 ± 9.61	3095.89 ± 590.92	2276.30 ± 418.94
Swimmer-v2	114.93 ± 29.09	103.89 ± 30.72	111.74 ± 7.06
Walker2d-v2	1946.23 ± 223.65	3059.69 ± 1014.05	3142.06 ± 1041.17
Ant-v2	243.25 ± 129.70	5586.91 ± 476.27	2785.98 ± 1265.03
Humanoid-v2	877.90 ± 3.46	6342.99 ± 247.26	786.83 ± 95.66

Pybullet Results

gym_id	ddpg_continuous_action	td3_continuous_action	ppo_continuous_action
MinitaurBulletEnv-v0	-0.17 ± 0.02	7.73 ± 5.13	23.20 ± 2.23
MinitaurBulletDuckEnv-v0	-0.31 ± 0.03	0.88 ± 0.34	11.09 ± 1.50
InvertedPendulumBulletEnv-v0	742.22 ± 47.33	1000.00 ± 0.00	1000.00 ± 0.00
InvertedDoublePendulumBulletEnv-v0	5847.31 ± 843.53	5085.57 ± 4272.17	6970.72 ± 2386.46
Walker2DBulletEnv-v0	567.61 ± 15.01	2177.57 ± 65.49	1377.68 ± 51.96
HalfCheetahBulletEnv-v0	2847.63 ± 212.31	2537.34 ± 347.20	2347.64 ± 51.56
AntBulletEnv-v0	2094.62 ± 952.21	3253.93 ± 106.96	1775.50 ± 50.19
HopperBulletEnv-v0	1262.70 ± 424.95	2271.89 ± 24.26	2311.20 ± 45.28
HumanoidBulletEnv-v0	-54.45 ± 13.99	937.37 ± 161.05	204.47 ± 1.00
BipedalWalker-v3	66.01 ± 127.82	78.91 ± 232.51	272.08 ± 10.29
LunarLanderContinuous-v2	162.96 ± 65.60	281.88 ± 0.91	215.27 ± 10.17
Pendulum-v0	-238.65 ± 14.13	-345.29 ± 47.40	-1255.62 ± 28.37
MountainCarContinuous-v0	-1.01 ± 0.01	-1.12 ± 0.12	93.89 ± 0.06

Other Results

gym_id	ppo	dqn
CartPole-v1	500.00 ± 0.00	182.93 ± 47.82
Acrobot-v1	-80.10 ± 6.77	-81.50 ± 4.72
MountainCar-v0	-200.00 ± 0.00	-142.56 ± 15.89
LunarLander-v2	46.18 ± 53.04	144.52 ± 1.75

Added experimental support for Apex-DQN that is significantly faster than DQN. See https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/apex_dqn_atari_visual.py. In the game of breakout, Apex-DQN takes less than 4 hours to achieve around 360 episode reward. In contrast, it took 25 hours for DQN to reach 360 episode rewards.
- Our implementation is a little different from the original. First, in pytorch's ecosystem there isn't a well-mainained distributed prioritized experience buffer such as https://github.com/deepmind/reverb. So instead we split a single prioritized replay buffer of size 100000 to two prioritized replay of size 50000 in different data-processors in sub-processes to prepare data for the worker. This is kind of a work around and a hack but according to our benchmark, it works empirically good and fast enough.

Benchmarked Learning Curves	Atari
Metrics, logs, and recorded videos are at	cleanrl.benchmark/reports/Atari

Supported CarRacing-v0 by PPO in the Experimental Domains. It is our first example with pixel observation space and continuous action space. See https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/experiments/ppo_car_racing.py.
- During our experiments, we found the normalization of observation and reward seems to have a huge impact on PPO's performance, probably due to the large range of rewards provided by CarRacing-v0 (e.g. if dies you get -100 reward, but PPO is anecdotally sensitive to this kind of large rewards).

Assets 2

01 Aug 22:57

vwxyzjn

0.3.0

b6c33df

Release of Open RL Benchmark @ 0.3.0

See https://streamable.com/cq8e62 for a demo

Significant amount of effort was put into the making of Open RL Benchmark (http://benchmark.cleanrl.dev/). It provides benchmark of popular Deep Reinforcement Learning algorithms in 34+ games with unprecedented level of transparency, openness, and reproducibility.

In addition, the legacy common.py is depreciated in favor of using single-file implementations.

Assets 2

09 Jan 22:00

vwxyzjn

0.2.1

ac34459

CleanRL 0.2.1 with SAC added and video recording feature.

We've made the SAC algorithm works for both continuous and discrete action spaces, with primary references from the following papers:

https://arxiv.org/abs/1801.01290
https://arxiv.org/abs/1812.05905
https://arxiv.org/abs/1910.07207

My personal thanks to everyone who participated in the monthly dev cycle and, in particular, @dosssman who implemented the SAC with discrete action spaces.

Additional improvement include
support gym.wrappers.Monitor to automatically record agent’s performance at certain episodes (default is 1, 2, 9, 28, 65, ... 1000, 2000, 3000) and integrate with wandb. (so cool, see screenshot below) #4
Use the same replay buffer from minimalRL for DQN and SAC #5

https://app.wandb.ai/cleanrl/cleanrl.benchmark

Assets 2

07 Oct 03:13

vwxyzjn

V0.1

35694d2

Initial Release Pre-release

Pre-release

This is the initial release 🙌🙌

Working on more algorithms and where and bug fixes for the 1.0 release :) Comments and PR are more than welcome.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's new in the 0.4.0 release

Atari Results

Mujoco Results

Pybullet Results

Other Results

Releases: vwxyzjn/cleanrl

v0.2.1

v0.4.3

v0.4.2

v0.4.1

CleanRL v0.4.0

What's new in the 0.4.0 release

Atari Results

Mujoco Results

Pybullet Results

Other Results

Release of Open RL Benchmark @ 0.3.0

CleanRL 0.2.1 with SAC added and video recording feature.

Initial Release