- With Value Iteraiton and Policy Iteration
- Q-learning
- Double-Q-learning
- SARSA (State-Action-Reward-State-Action)
You can find the code for results below here. In which, we will collect the rewards for 5 runs and plot them together to see any patterns.
We see the common pattern that the rewards are initially bad, but as the number of episodes increases, the agent gets better and the reward reach an asymptote.