Next: About this document ... Up: No Title Previous: SARSA

Convergence proof

The general idea is writing our algorithm as:

$Q \leftarrow (1-\alpha)Q + \alpha(HQ+w)$

where:

$(HQ)(s,a) = r(s,a)+\lambda\sum_{s^{'}}P(s^{'}\vert s,a)^{max}_{b}Q(s^{'},b)$

We will show that H is a contracting mapping (operator):
$\vert\vert HQ_{1} - HQ_{2}\vert\vert = \vert\vert\lambda P[MQ_{1} - MQ_{2}]\vert\vert$
where (MQ₁)(s)=^max_aQ₁(s,a). Note that MQ₁=V₁. We have,
$\vert\vert HQ_{1} - HQ_{2}\vert\vert \leq \lambda\vert\vert V_{1} - V_{2}\vert\vert\leq \lambda\vert\vert Q_{1} - Q_{2}\vert\vert$ ,
Therefore H is $\lambda$ contracting.
The parameter w is "noise" in the sample whose expectation is 0. By using the contracting property of H we show convergence. For convergence we require:

$Q^{n+1}(s,a)=(1-\alpha_{n}(s,a))Q^{n}(s,a) + % \alpha_{n}(s,a)[r(s,a)+\lambda^{max}_{b}Q^{n}(s^{'},b)]$
Let T^s,a be the time we are in state s and make action a
- for $t \in T^{s,a}$ $\alpha_{t}(s,a)>0$
- and for $t \notin T^{s,a}$ $\alpha_{t}(s,a)=0$
F_t is history untill time t
Thus the algorithm can be written as:

$Q^{n+1}(s,a)=(1-\alpha_{n}(s,a))Q^{n}(s,a) + \alpha_{n}(s,a)[(HQ^{(n)}(s,a) + % w_{n}(s,a)]$

where: $w_{n}(s,a)=\lambda[^{max}_{b}Q^{(n)}(s^{'},b)-% \sum_{s^{''}}P(s^{''}\vert s,a)^{max}_{c}Q^{(n)}(s^{''},c)]$
because s^' is distributed according to P(.|s,a) (the policy is stationary) then we have: $E[w_{n}(s,a)]=\lambda[E[^{max}_{b}Q^{(n)}(s^{'},b)]-E[^{max}_{c}Q^{(n)}(s^{''},c)]] = 0$
Therefore the expectation of "noise" is 0.

Theorem 9.1 (A general theorem for Stochastic processes)
Let: $r_{t+1}(i)=(1-\alpha_{t}(i))r_{t}(i)+\alpha_{t}(i)[(Hr_{t})(i)+w_{t}(i)]$

Given that:

1.: E[w_t(i)|F_t]=0
2.: $E[w_{t}^{2}(i)\vert F_{t}] \leq A+B\vert\vert r_{t}\vert\vert^{2}$ for some A,B constants.
3.: the steps $\alpha_{t}(i)$ are such that: $\sum\alpha_{t}(i)=\infty$ and $\sum\alpha_{t}^{2}(i)<\infty$
4.: the mapping H (operator) is contracting

Then $r_{t} \rightarrow r^{*}$ with probability 1 .
(where R^* is the fixed point of Hr^*=r^*)

In the case of Q-Learning it means that HQ^*=Q^* and Q^* is the optimal value function,
(this is the same operator as in Value Iteration). Thus we achieve by the convergence of Q-Learning the optimal policy value.

Theorem 9.2 If all r(s,a) are non-negative and bounded and Q⁰(s,a)=0 then
$Q^{(n)} \rightarrow Q^{*}$

For the theorem to be true we assumed that we sample each (s,a) infinite number of times,
thus we could look at each sequence separately.

How to assure that we visit all (s,a) ?

1.: If we follow the greedy policy of Q⁽ⁿ⁾ it may be not possible to check all pairs (s,a)
even if we reach all states $s \in S$
2.: We will follow from a state s the $\varepsilon$ -greedy policy. With probability $1-% \varepsilon$ we choose the greedy action from s,
and do a random policy from s with probability $\varepsilon$ .
3.: Choose the policy by:

$\Pi^{n}(s,a)= \frac{exp{-Q(s,a)/T}}{\sum_{b}exp{-Q(s,b)/T}}$
T determines if we do $\varepsilon$ -greedy action (T $\rightarrow 0$ ) or random action (T $\rightarrow \infty$ ). We can look at this as if we part the time into sections of explorations(random) and of explotation(greedy).

On all 3 methods we can decide if we want to exploit the knowledge we have or get more information about the MDP

The next theorem is a weaker version of the main theorem.

Theorem 9.3 Let $r_{t+1}=(1-\frac{1}{t})r_{t}+\frac{1}{t}(Hr_{t}+w_{t})$
Assume that:

1.: E[w_t]=0 and $\vert w_{t}\vert\leq 1$
2.: H is a contracting mapping
3.: $\vert r_{t}\vert \leq D_{0}$

Then $r_{t} \rightarrow r^{*}$ such that Hr^* = r^*

Proof:Since H is contracting,
$\vert\vert Hr-Hr^{*}\vert\vert\leq \beta \vert\vert r-r^{*}\vert\vert$ for some $\beta < 1$ .
Whithout loss of generality, assume that r^*=0,
therefore $\vert\vert Hr\vert\vert\leq \beta \vert\vert r\vert\vert$ .

Define: $D_{k+1} = (\beta+2\varepsilon)D_{k}$ such that $\beta + 2\varepsilon<1$ .
We show by induction that exists a time t_k such that for every $t \geq t_{k}$ then $\vert\vert r_{t}\vert\vert \leq D_{k}$

We bound r_t for t>t_k. since $\vert\vert r_{t}\vert\vert \leq D_{k}$ then $\vert\vert Hr_{t}\vert\vert \leq \beta D_{k}$ (the induction hypothesis).
We have: $r_{t+1}=\frac{t-1}{t}r_{t}+\frac{1}{t}(Hr_{t}+w_{t})=\frac{t-l-1}{t}r_{t-% l}+\frac{1}{t}\sum_{i=t-l}^{t}Hr_{i}+\frac{1}{t}\sum_{i=t-l}^{t}w_{i}$ .
choose l=t-t_k then : $\vert\vert r_{i}\vert\vert\leq D_{k}$ and $\vert\vert Hr_{i}\vert\vert\leq \beta D_{k}$ .
Since E[w_i]=0 and $\vert w_{i}\vert\leq 1$ we have that: $\frac{1}{t}\sum_{i=t-% l}^{t}w_{i}\rightarrow 0$ . Therefore $\frac{1}{t}\sum_{i=t-l}^{t}w_{i} \leq \varepsilon D_{k}$ for all but a finite number of t.

We continue: $r_{t} \leq \frac{t_{k}-1}{t}r_{t_{k}}+\frac{1}{t}\sum_{i=t-% l}^{t}Hr_{i}+\frac... ...frac{t_{k}}{t}D_{k}+\frac{t-% t_{k}}{t}\beta D_{k}+\frac{t-t_{k}}{t}\varepsilon$

For $t \geq t_{k}/\varepsilon$ we get:
$r_{t} \leq \varepsilon D_{k} + \beta D_{k}+\varepsilon D_{k} = % (\beta+2\varepsilon)D_{k}=D_{k+1}$

Therefore exists t_k+1 such that $r_{t}\leq D_{k+1}$
Thus when $t\rightarrow \infty$ then $\vert\vert r_{t}\vert\vert \rightarrow r^{*}=0$ . $\Box$

Next: About this document ... Up: No Title Previous: SARSA

Yishay Mansour
2000-01-07