** Next:** remarks:
** Up:** Q-learning and SARSA algorithms
** Previous:** Q-learning and SARSA algorithms

##

Q-learning

Lets consider Value Iteration Algorithm(VI) from lecture 6.It described the non linear
operator: L. In every iteration of the algorithm we operate L:

*V*_{n+1}=*LV*_{n}, and explicitaly:

.

Lets refine the equation a somewhat. We define new function Q regarding VI:

.

Now the iteration of VI are:
.
Expressed in Q function terms only we have:

.

We write the iteration with
.

(In lecture 7 we learned it converges the right value.)

Until now the iterations are equvivalent to VI. Instead of taking the excpetancy of the
value of the next step we take a sample of the next step. We assume that we are in state s,
we take action a, the next state *s*^{'} is distributed by
*P*(*s*^{'}|*s*,*a*). Finally we get

**Figure 9.1:**
Algorithm for Q-LEARNING
. |

*Yishay Mansour*

*2000-01-07*