Next: About this document ...
Up: No Title
Previous: SARSA
Convergence proof
The general idea is writing our algorithm as:
where:
We will show that H is a contracting mapping (operator):
where
(MQ_{1})(s)=^{max}_{a}Q_{1}(s,a). Note that
MQ_{1}=V_{1}. We have,
,
Therefore H is
contracting.
The parameter w is "noise" in the sample whose expectation is 0.
By using the contracting property of H we show convergence.
For convergence we require:

 Let T^{s,a} be the time we are in state s and make action a
 F_{t} is history untill time t
 Thus the algorithm can be written as:
where:
 because s^{'} is distributed according to P(.s,a) (the policy is stationary) then
we have:
Therefore the expectation of "noise" is 0.
Theorem 9.1 (A general theorem for Stochastic processes)
Let:
Given that:
 1.

E[w_{t}(i)F_{t}]=0
 2.

for some A,B constants.
 3.
 the steps
are such that:
and
 4.
 the mapping H (operator) is contracting
Then
with probability 1 .
(where
R^{*} is the fixed point of
Hr^{*}=
r^{*})
In the case of QLearning it means that
HQ^{*}=Q^{*} and Q^{*} is the optimal value
function,
(this is the same operator as in Value Iteration).
Thus we achieve by the convergence of QLearning the optimal policy value.
Theorem 9.2
If all r(s,a) are nonnegative and bounded and
Q^{0}(
s,
a)=0 then
For the theorem to be true we assumed that we sample each (s,a) infinite number of times,
thus we could look at each sequence separately.
How to assure that we visit all (s,a) ?
 1.
 If we follow the greedy policy of Q^{(n)} it may be not possible to check all pairs
(s,a)
even if we reach all states
 2.
 We will follow from a state s the
greedy policy. With probability
we choose the greedy action from s,
and do a random policy from s with probability
.
 3.
 Choose the policy by:
T determines if we do
greedy action (T
)
or random action (T
).
We can look at this as if we part the time into sections of explorations(random) and of
explotation(greedy).
On all 3 methods we can decide if we want to exploit the knowledge we have or get more
information about the MDP
The next theorem is a weaker version of the main theorem.
Theorem 9.3
Let
Assume that:
 1.

E[w_{t}]=0 and
 2.
 H is a contracting mapping
 3.

Then
such that
Hr^{*} =
r^{*}
Proof:Since H is contracting,
for some .
Whithout loss of generality, assume that r^{*}=0,
therefore
.
Define:
such that
.
We show by induction that exists a time t_{k} such that for every
then
We bound r_{t} for t>t_{k}.
since
then
(the induction
hypothesis).
We have:
.
choose l=tt_{k} then :
and
.
Since
E[w_{i}]=0 and
we have that:
.
Therefore
for all but a finite
number of t.
We continue:
For
we get:
Therefore exists t_{k+1} such that
Thus when
then
.
Next: About this document ...
Up: No Title
Previous: SARSA
Yishay Mansour
20000107