QLearningAgent learns incorrect results #1247

CrosleyZack · 2022-01-20T20:19:15Z

I believe there is a flaw in the QLearningAgent implementation in reinforcement.py, possibly resulting from how run_single_trial is written.

I was testing this with the 4x3 environment problem given in 17.1. Upon reaching a terminal state (TERMINAL?(s1) == True), the __call__ function returns None. This causes run_single_trial to exit. If called again in a loop for multiple trials (IE for _ in range(N): run_single_trial(agent_program, mdp)), this results in a call to QLearningAgent.__call__ with s1 being the initial state [(1,1) for 4x3 environment], r1 being the reward for this state (-0.04 for 4x3 environment), TERMINAL?(s) == TRUE [as s is either (4,2) or (4,3)], and a == None. This then sets Q[s, None] = r1 = -0.04, instead of the actual termination value of 1 or -1. This results in an incorrect policy. Simply change line 93 to Q[s, None] = r fixes the issue and learns a correct policy.

I recognize this does not match the pseudocode in the book (21.8), and I am not certain if this is simply due to the implementation of run_single_trial. A better fix may be available which more closely matches the pseudocode from 21.8.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QLearningAgent learns incorrect results #1247

QLearningAgent learns incorrect results #1247

CrosleyZack commented Jan 20, 2022

QLearningAgent learns incorrect results #1247

QLearningAgent learns incorrect results #1247

Comments

CrosleyZack commented Jan 20, 2022