What's new in the 0.4.0 release

Added contribution guide here https://github.com/vwxyzjn/cleanrl/blob/master/CONTRIBUTING.md. We welcome contribution of new algorithms and new games to be added to the Open RL Benchmark (http://benchmark.cleanrl.dev/)
Added tables for the benchmark results with standard deviations created by (https://github.com/vwxyzjn/cleanrl/blob/master/benchmark/plots.py)

Atari Results

gym_id	apex_dqn_atari_visual	c51_atari_visual	dqn_atari_visual	ppo_atari_visual
BeamRiderNoFrameskip-v4	2936.93 ± 362.18	13380.67 ± 0.00	7139.11 ± 479.11	2053.08 ± 83.37
QbertNoFrameskip-v4	3565.00 ± 690.00	16286.11 ± 0.00	11586.11 ± 0.00	17919.44 ± 383.33
SpaceInvadersNoFrameskip-v4	1019.17 ± 356.94	1099.72 ± 14.72	935.40 ± 93.17	1089.44 ± 67.22
PongNoFrameskip-v4	19.06 ± 0.83	18.00 ± 0.00	19.78 ± 0.22	20.72 ± 0.28
BreakoutNoFrameskip-v4	364.97 ± 58.36	386.10 ± 21.77	353.39 ± 30.61	380.67 ± 35.29

gym_id	ddpg_continuous_action	td3_continuous_action	ppo_continuous_action
Reacher-v2	-6.25 ± 0.54	-6.65 ± 0.04	-7.86 ± 1.47
Pusher-v2	-44.84 ± 5.54	-59.69 ± 3.84	-44.10 ± 6.49
Thrower-v2	-137.18 ± 47.98	-80.75 ± 12.92	-58.76 ± 1.42
Striker-v2	-193.43 ± 27.22	-269.63 ± 22.14	-112.03 ± 9.43
InvertedPendulum-v2	1000.00 ± 0.00	443.33 ± 249.78	968.33 ± 31.67
HalfCheetah-v2	10386.46 ± 265.09	9265.25 ± 1290.73	1717.42 ± 20.25
Hopper-v2	1128.75 ± 9.61	3095.89 ± 590.92	2276.30 ± 418.94
Swimmer-v2	114.93 ± 29.09	103.89 ± 30.72	111.74 ± 7.06
Walker2d-v2	1946.23 ± 223.65	3059.69 ± 1014.05	3142.06 ± 1041.17
Ant-v2	243.25 ± 129.70	5586.91 ± 476.27	2785.98 ± 1265.03
Humanoid-v2	877.90 ± 3.46	6342.99 ± 247.26	786.83 ± 95.66

gym_id	ddpg_continuous_action	td3_continuous_action	ppo_continuous_action
MinitaurBulletEnv-v0	-0.17 ± 0.02	7.73 ± 5.13	23.20 ± 2.23
MinitaurBulletDuckEnv-v0	-0.31 ± 0.03	0.88 ± 0.34	11.09 ± 1.50
InvertedPendulumBulletEnv-v0	742.22 ± 47.33	1000.00 ± 0.00	1000.00 ± 0.00
InvertedDoublePendulumBulletEnv-v0	5847.31 ± 843.53	5085.57 ± 4272.17	6970.72 ± 2386.46
Walker2DBulletEnv-v0	567.61 ± 15.01	2177.57 ± 65.49	1377.68 ± 51.96
HalfCheetahBulletEnv-v0	2847.63 ± 212.31	2537.34 ± 347.20	2347.64 ± 51.56
AntBulletEnv-v0	2094.62 ± 952.21	3253.93 ± 106.96	1775.50 ± 50.19
HopperBulletEnv-v0	1262.70 ± 424.95	2271.89 ± 24.26	2311.20 ± 45.28
HumanoidBulletEnv-v0	-54.45 ± 13.99	937.37 ± 161.05	204.47 ± 1.00
BipedalWalker-v3	66.01 ± 127.82	78.91 ± 232.51	272.08 ± 10.29
LunarLanderContinuous-v2	162.96 ± 65.60	281.88 ± 0.91	215.27 ± 10.17
Pendulum-v0	-238.65 ± 14.13	-345.29 ± 47.40	-1255.62 ± 28.37
MountainCarContinuous-v0	-1.01 ± 0.01	-1.12 ± 0.12	93.89 ± 0.06

Added experimental support for Apex-DQN that is significantly faster than DQN. See https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/apex_dqn_atari_visual.py. In the game of breakout, Apex-DQN takes less than 4 hours to achieve around 360 episode reward. In contrast, it took 25 hours for DQN to reach 360 episode rewards.
- Our implementation is a little different from the original. First, in pytorch's ecosystem there isn't a well-mainained distributed prioritized experience buffer such as https://github.com/deepmind/reverb. So instead we split a single prioritized replay buffer of size 100000 to two prioritized replay of size 50000 in different data-processors in sub-processes to prepare data for the worker. This is kind of a work around and a hack but according to our benchmark, it works empirically good and fast enough.

Benchmarked Learning Curves	Atari
Metrics, logs, and recorded videos are at	cleanrl.benchmark/reports/Atari

Supported CarRacing-v0 by PPO in the Experimental Domains. It is our first example with pixel observation space and continuous action space. See https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/experiments/ppo_car_racing.py.
- During our experiments, we found the normalization of observation and reward seems to have a huge impact on PPO's performance, probably due to the large range of rewards provided by CarRacing-v0 (e.g. if dies you get -100 reward, but PPO is anecdotally sensitive to this kind of large rewards).