Convergence proof

The general idea is writing our algorithm as:

where:

We will show that H is a contracting mapping (operator):

where (MQ1)(s)=maxaQ1(s,a). Note that MQ1=V1. We have,
,
Therefore H is contracting.
The parameter w is "noise" in the sample whose expectation is 0. By using the contracting property of H we show convergence. For convergence we require:

• Let Ts,a be the time we are in state s and make action a
• for
• and for
• Ft is history untill time t
• Thus the algorithm can be written as:

where:
• because s' is distributed according to P(.|s,a) (the policy is stationary) then we have:
Therefore the expectation of "noise" is 0.

Theorem 9.1 (A general theorem for Stochastic processes)
Let:

Given that:
1.
E[wt(i)|Ft]=0
2.
for some A,B constants.
3.
the steps are such that: and
4.
the mapping H (operator) is contracting
Then with probability 1 .
(where R* is the fixed point of Hr*=r*)

In the case of Q-Learning it means that HQ*=Q* and Q* is the optimal value function,
(this is the same operator as in Value Iteration). Thus we achieve by the convergence of Q-Learning the optimal policy value.

Theorem 9.2   If all r(s,a) are non-negative and bounded and Q0(s,a)=0 then

For the theorem to be true we assumed that we sample each (s,a) infinite number of times,
thus we could look at each sequence separately.

How to assure that we visit all (s,a) ?
1.
If we follow the greedy policy of Q(n) it may be not possible to check all pairs (s,a)
even if we reach all states
2.
We will follow from a state s the -greedy policy. With probability we choose the greedy action from s,
and do a random policy from s with probability .
3.
Choose the policy by:

T determines if we do -greedy action (T ) or random action (T ). We can look at this as if we part the time into sections of explorations(random) and of explotation(greedy).

On all 3 methods we can decide if we want to exploit the knowledge we have or get more information about the MDP

The next theorem is a weaker version of the main theorem.

Theorem 9.3   Let
Assume that:
1.
E[wt]=0 and
2.
H is a contracting mapping
3.

Then such that Hr* = r*

Proof:Since H is contracting,
for some .
Whithout loss of generality, assume that r*=0,
therefore .

Define: such that .
We show by induction that exists a time tk such that for every then

We bound rt for t>tk. since then (the induction hypothesis).
We have: .
choose l=t-tk then : and .
Since E[wi]=0 and we have that: . Therefore for all but a finite number of t.

We continue:

For we get:

Therefore exists tk+1 such that
Thus when then .