Collection of my implementation of reinforcement learning algorithms
- DQN
- Use negative reward to penalise terminal state
- Let Tensorflow do as much batch processing as possible (I was doing individual inference sequentially for a training batch, lots of time wasted)
- During Q target update, must use network's current weight for
Q_s(t+1)
, instead of the weight during that particular observation. - Provide all action space to training!
MSE(q_update, max(prediction))
is wrong, because themax(prediction)
can be from a different action than what was recorded in experience and was used for Q update. - Smoothed performance over episodes: (lighter blue line is unsmoothed)
- REINFOCE with continuous action
- Parametralize mean and standard deviation of a normal distribution
- mean is linear model; standard deviation is
exp(linear)
- Does not seem to converge as of episode 1000 :( Although the solution given here does not converge either 🤷
- actor-critic with CartPole
- Important: use a powerful enough function approximator for value critic
- REINFORCE with CartPole
- Linear function approximation with mountain car, with my own tile encoding implementation
- Q-learning
- Sarsa
- Monte Carlo Prediction & Control with Exploring Starts
- reproduced black jack solution from Sutton book
- Policy evaluation & iteration, value iteration