Optimality Equations

Next: The Solution of the Up: Computing the Optimal Policy Previous: Computing the Optimal Policy

Optimality Equations

For a finite horizon we gave the following equations:

$\begin{eqnarray*}v_{n}(s) = \max_{a \in A_{s}}\{r(s,a) + \sum_{j\in S}p(j\mid s,a)v_{n+1}(j)\} \end{eqnarray*}$

For a discounted infinite horizon we have similar equations. We will show that the Optimality equations are:

$\begin{eqnarray*}v(s) = \max_{a \in A_{s}}\{r(s,a) + \lambda\sum_{j\in S}p(j\mid s,a)v_(j)\} \end{eqnarray*}$

First we show that maximizing over deterministic and stochastic policies yield the same value.

Theorem 5.4 For all $\vec{v}$ and $1 > \lambda \geq 0$

$\begin{eqnarray*}\max_{d\in \Pi^{MD}}\{r_{d} + \lambda P_{d}\vec{v}\} = \max_{d\in\Pi^{MR}}\{r_{d} + \lambda P_{d}\vec{v}\} \end{eqnarray*}$

Proof:
Since $\Pi^{MD} \subseteq \Pi^{MR}$ , the right side of the equality is at least as large as the left.
Now we show that the left side is at least as large as the right.
For $\pi \in \Pi^{MR}$ and v we define

$\begin{eqnarray*}\forall s\in S \ \ \ \ w_{s}(a) = r(s,a) + \lambda\sum_{j\in S} p(j\mid s,a)v(j) \end{eqnarray*}$

Fix a state $s\in S$ . The value of $\pi$ is a weighted average of w_s(a), and we have

$\begin{eqnarray*}\max_{a\in A_{s}}\{w_{s}(a) \} \geq \sum_{a\in A_{s}}q_{\pi}(a)w_{s}(a)\ . \end{eqnarray*}$

Hence

$\begin{eqnarray*}\max_{d\in \Pi_{MD}}\{r_{d} + \lambda P_{d}\vec{v} \} \geq r_{\pi} + \lambda P_{\pi}\vec{v}\ , \end{eqnarray*}$

For any $\pi \in \Pi_{MR}$ .
This shows that the left hand side in the theorem is at least as large as the right hand side, which completes the proof. $\Box$

Let us define the non-linear operator L:

$\begin{eqnarray*}L\vec{v} = \max_{d\in \Pi^{MD}}\{\vec{r}_{d} + \lambda P_{d}\vec{v}\} \end{eqnarray*}$

Therefore we can state the optimality equation by

$\begin{eqnarray*}v = \max_{d\in \Pi^{MD}}\{\vec{r}_{d} + \lambda P_{d}\vec{v}\} = L\vec{v} \end{eqnarray*}$

It still remains to be shown that these indeed are optimality equations, and that the fixed point of the operator is the optimal return value.

Theorem 5.5 Let $v_{\lambda}^{*}$ be the optimal return value with parameter $1 > \lambda \geq 0$

1.: If $v \geq Lv$ then $v \geq v_{\lambda}^{*}$
2.: If $v \leq Lv$ then $v \leq v_{\lambda}^{*}$
3.: If Lv = v then $v = v_{\lambda}^{*}$

Proof:We start by proving (1)

$\begin{eqnarray*}% latex2html id marker 473 v & \geq & \max_{d\in \Pi^{MD}}\{r_{... ...r_{d} + \lambda P_{d}v\} \ \ \ \ (by\ theorem\ \ref{L3:MD=MR}) \end{eqnarray*}$

this implies that for any policy d,

$\begin{eqnarray*}v & \geq &r_{d} + \lambda P_{d}v\\ & \geq & r_{d} + \lambda P_... ...d})^{i}r_{d} + (\lambda P_{d})^{n}v \ \ \ (where\ [P^{0} = I]) \end{eqnarray*}$

For a given policy d

$\begin{eqnarray*}v_{\lambda}^{d} = \sum_{i=0}^{\infty}(\lambda P_{d})^{i}r_{d} \end{eqnarray*}$

By subtraction

$\begin{eqnarray*}v - v_{\lambda}^{d} \geq (\lambda P_{d})^{n}v - \sum_{i=n}^{\infty}(\lambda P_{d})^{i}r_{d} \end{eqnarray*}$

Since $\lambda^{n}\Vert v\Vert \geq \Vert\lambda^{n}P_{d}^{n}v\Vert$ and $\lambda < 1$ , we have that for any $\epsilon > 0$ there exists an N > 0, such that for all n > N we have $\Vert\lambda^{n}P_{d}^{n}v\Vert \leq \frac{\epsilon}{2}$

Since $\vert\vec{r}_d\vert < M$ we can write:

$\begin{eqnarray*}-\frac{\lambda^{n}M}{1-\lambda}\cdot\vec{1} \leq \sum_{k=n}^{\... ...P_{d}^{k}r_{d} \leq \frac{\lambda^{n}M}{1-\lambda}\cdot\vec{1} \end{eqnarray*}$

For a large enough n we have

$\begin{eqnarray*}\Vert\sum_{k=n}^{\infty}\lambda^{k}P_{d}^{k}r_{d}\Vert \leq \frac{\epsilon}{2}\ . \end{eqnarray*}$

By this we derive that

$\begin{eqnarray*}\forall s\in S \ \ \ v(s) \geq v_{\lambda}^{d}(s) - \epsilon \end{eqnarray*}$

and for all d

$\begin{eqnarray*}v \geq v_{\lambda}^{d} - \epsilon \end{eqnarray*}$

Thus

$\begin{eqnarray*}v \geq \max_{d\in \Pi^{MR}} \{ v_{\lambda}^{d}\} - \epsilon = ... ...} \{ v_{\lambda}^{d}\} - \epsilon = v_{\lambda}^{*} - \epsilon \end{eqnarray*}$

As this is true $\forall \epsilon > 0$

$\begin{eqnarray*}v \geq v_{\lambda}^{*} \end{eqnarray*}$

(If we assume that there is a state s such that $v(s) < v_{\lambda}^{*}(s)$ we pick $\epsilon = \frac{v_{\lambda}^{*} - v(s)}{2}$ , and reach a contradiction)
We now prove (2)
Since $v \leq Lv$ there exists a policy d such that

$\begin{eqnarray*}v \leq r_{d} + \lambda P_{d}v \end{eqnarray*}$

By theorem

$\begin{eqnarray*}v \leq (I - \lambda P_{d})^{-1}r_{d} = v_{\lambda}^{d} \end{eqnarray*}$

Hence

$\begin{eqnarray*}v \leq \max_{d} \{ v_{\lambda}^{d}\} \end{eqnarray*}$

Part (3) follows immediately from parts (1) and (2). $\Box$

Next: The Solution of the Up: Computing the Optimal Policy Previous: Computing the Optimal Policy

Yishay Mansour
1999-11-24