You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I believe there is a flaw in the QLearningAgent implementation in reinforcement.py, possibly resulting from how run_single_trial is written.
I was testing this with the 4x3 environment problem given in 17.1. Upon reaching a terminal state (TERMINAL?(s1) == True), the __call__ function returns None. This causes run_single_trial to exit. If called again in a loop for multiple trials (IE for _ in range(N): run_single_trial(agent_program, mdp)), this results in a call to QLearningAgent.__call__ with s1 being the initial state [(1,1) for 4x3 environment], r1 being the reward for this state (-0.04 for 4x3 environment), TERMINAL?(s) == TRUE [as s is either (4,2) or (4,3)], and a == None. This then sets Q[s, None] = r1 = -0.04, instead of the actual termination value of 1 or -1. This results in an incorrect policy. Simply change line 93 to Q[s, None] = r fixes the issue and learns a correct policy.
I recognize this does not match the pseudocode in the book (21.8), and I am not certain if this is simply due to the implementation of run_single_trial. A better fix may be available which more closely matches the pseudocode from 21.8.
The text was updated successfully, but these errors were encountered:
I believe there is a flaw in the
QLearningAgent
implementation in reinforcement.py, possibly resulting from howrun_single_trial
is written.I was testing this with the 4x3 environment problem given in 17.1. Upon reaching a terminal state (
TERMINAL?(s1) == True
), the__call__
function returnsNone
. This causesrun_single_trial
to exit. If called again in a loop for multiple trials (IEfor _ in range(N): run_single_trial(agent_program, mdp)
), this results in a call toQLearningAgent.__call__
with s1 being the initial state [(1,1) for 4x3 environment], r1 being the reward for this state (-0.04 for 4x3 environment),TERMINAL?(s) == TRUE
[as s is either (4,2) or (4,3)], anda == None
. This then setsQ[s, None] = r1 = -0.04
, instead of the actual termination value of 1 or -1. This results in an incorrect policy. Simply change line 93 toQ[s, None] = r
fixes the issue and learns a correct policy.I recognize this does not match the pseudocode in the book (21.8), and I am not certain if this is simply due to the implementation of
run_single_trial
. A better fix may be available which more closely matches the pseudocode from 21.8.The text was updated successfully, but these errors were encountered: