Still under construction...
- Python 2.7
- TensorFlow >= 0.8.0
- NumPy >= 1.10.0
- openai gym
- matplotlib
Run
python gym_experiment.py
to train a softmax policy (without bias) using vanilla policy gradient on CartPole task. You can see that the return is stochastically increasing until it reaches the maximum (200).