Next: Example showing that tied Up: No Title Previous: No Title

Large State Space

When we have a large state space and we can not compute V(s,r), we would like to build an approximation function . Let

be the minimal distance between and V*. (It is clear that is the upper bound for any approximation algorithm.(If is large, we can not expect a good approximation regardless the learning process.) We will show later error bounds of the form: or
Theses bounds might seem disappointing, since when our bound diverges. On the other hand if we enrich our architecture (enlarge the family of r) , and when is a constant the bound will go to 0.
If we approximate Q*(s,a) by , the greedy policy is

If we have , then the greedy policy is,

We have to approximate , so we have additional errors due to approximation too.
If we have only a few states S', than we compute the expectation exactly, otherwise we will approximate it by taking samples. We will get a stochastic policy since every time we get different sample and we may decide a different action, unlike the case of where we get deterministic policy. The next theorem relates errors on the value function to differences in the return.

Theorem 11.1   Consider a discounted problem, with parameter . If V satisfies

and is a greedy policy based on V, then

Furthermore there exists s.t for every the policy is the optimal policy.

Proof:
Consider the operators,

and

Then

This implies that,

For the second part,
Since we have finite number of policies, then there exist s.t.

For any such that,

we have that

Therefore we can choose

Next: Example showing that tied Up: No Title Previous: No Title
Yishay Mansour
2000-01-11