** Next:** SARSA
** Up:** Q-learning
** Previous:** Q-learning

###

remarks:

- if we choose Q to be optimal
*Q*=*Q*^{*} then
.
- The algorithm Q-Learning is off-policy since
we don't control the policy that
porforms actions. In general an off-line algorithm doesn't control the actions it does.
- For on-policy we can give any action a small probability s.t. we reach all MDP.
For on-policy we can hope to achieve rewards getting closer to optimal.

For on-policy
is
regarding Q at the moment. Thus we get Sarsa
algorithm

*Yishay Mansour*

*2000-01-07*