Next: The Solution of the Up: Computing the Optimal Policy Previous: Computing the Optimal Policy

Optimality Equations

For a finite horizon we gave the following equations:

For a discounted infinite horizon we have similar equations. We will show that the Optimality equations are:

First we show that maximizing over deterministic and stochastic policies yield the same value.

Theorem 5.4   For all and

Proof:
Since , the right side of the equality is at least as large as the left.
Now we show that the left side is at least as large as the right.
For and v we define

Fix a state . The value of is a weighted average of ws(a), and we have

Hence

For any .
This shows that the left hand side in the theorem is at least as large as the right hand side, which completes the proof.

Let us define the non-linear operator L:

Therefore we can state the optimality equation by

It still remains to be shown that these indeed are optimality equations, and that the fixed point of the operator is the optimal return value.

Theorem 5.5   Let be the optimal return value with parameter
1.
If then
2.
If then
3.
If Lv = v then

Proof:We start by proving (1)

this implies that for any policy d,

For a given policy d

By subtraction

Since and , we have that for any there exists an N > 0, such that for all n > N we have

Since we can write:

For a large enough n we have

By this we derive that

and for all d

Thus

As this is true

(If we assume that there is a state s such that we pick , and reach a contradiction)
We now prove (2)
Since there exists a policy d such that

By theorem

Hence

Part (3) follows immediately from parts (1) and (2).

Next: The Solution of the Up: Computing the Optimal Policy Previous: Computing the Optimal Policy
Yishay Mansour
1999-11-24