remarks:

Next: SARSA Up: Q-learning Previous: Q-learning

if we choose Q to be optimal Q=Q^* then $Q^{*}(s,a)=r(s,a)+\lambda % E_{s^{'}}[^{max}_{b}Q^{*}(s^{'},b)]$ .
The algorithm Q-Learning is off-policy since $\Pi$ we don't control the policy that porforms actions. In general an off-line algorithm doesn't control the actions it does.
For on-policy we can give any action a small probability s.t. we reach all MDP. For on-policy we can hope to achieve rewards getting closer to optimal.

For on-policy $\pi$ is $\epsilon-greedy$ regarding Q at the moment. Thus we get Sarsa algorithm

Yishay Mansour
2000-01-07