Q-learning

Next: remarks: Up: Q-learning and SARSA algorithms Previous: Q-learning and SARSA algorithms

Q-learning

Lets consider Value Iteration Algorithm(VI) from lecture 6.It described the non linear operator: L. In every iteration of the algorithm we operate L:
V_n+1=LV_n, and explicitaly:

$V_{n+1} = \max_{a \in A_{s}}\{r(s,a) + \lambda\sum_{s^{'}\in S} P(s^{'}\vert s,a)V_{n}(s^{'})\}$ .

Lets refine the equation a somewhat. We define new function Q regarding VI:

$Q^{n+1}(s,a) = r(s,a) + \lambda\sum_{s^{'}\in S}P(s^{'}\vert s,a)V_{n}(s^{'})$ .

Now the iteration of VI are: $V_{n} = \max_{a \in A_{s}}\{Q^{n}(s,a)\}$ . Expressed in Q function terms only we have:

$Q^{n+1}(s,a) = r(s,a) + \lambda\sum_{s^{'}\in S}P(s^{'}\vert s,a)\max_{b \in % A_{s}}\{Q^{n}(s^{'},b)\}$ .

We write the iteration with $\alpha-notation$ .
(In lecture 7 we learned it converges the right value.)

$Q^{n+1}(s,a) = (1-\alpha)Q^{n}(s,a) + \alpha[ r(s,a) + \lambda\sum_{s^{'}\in % S}P(s^{'}\vert s,a)\max_{b \in A_{s}}\{Q^{n}(s^{'},b)\}]$

Until now the iterations are equvivalent to VI. Instead of taking the excpetancy of the value of the next step we take a sample of the next step. We assume that we are in state s, we take action a, the next state s^' is distributed by P(s^'|s,a). Finally we get

$Q^{n+1}(s,a) = (1-\alpha)Q^{n}(s,a) + \alpha[ r(s,a) + \lambda\max_{b \in % A_{s^{'}}}Q^{n}(s^{'},b)\}]$

**Figure 9.1:** Algorithm for Q-LEARNING
$\framebox[\textwidth]{ \begin{minipage}{\textwidth} \begin{tabbing} \ \ \ \ ... ...for}\ \-\\ {\small\bf end} \sc q-learning\-\\ \end{tabbing} \end{minipage}}$ .

remarks:

Yishay Mansour
2000-01-07