next up previous
Next: The Solution of the Up: Computing the Optimal Policy Previous: Computing the Optimal Policy

Optimality Equations

For a finite horizon we gave the following equations:

\begin{eqnarray*}v_{n}(s) = \max_{a \in A_{s}}\{r(s,a) + \sum_{j\in S}p(j\mid

For a discounted infinite horizon we have similar equations. We will show that the Optimality equations are:

\begin{eqnarray*}v(s) = \max_{a \in A_{s}}\{r(s,a) + \lambda\sum_{j\in S}p(j\mid

First we show that maximizing over deterministic and stochastic policies yield the same value.

Theorem 5.4   For all $\vec{v}$ and $1 > \lambda \geq 0$

\begin{eqnarray*}\max_{d\in \Pi^{MD}}\{r_{d} + \lambda P_{d}\vec{v}\} =
\max_{d\in\Pi^{MR}}\{r_{d} + \lambda P_{d}\vec{v}\}

Since $\Pi^{MD} \subseteq \Pi^{MR}$, the right side of the equality is at least as large as the left.
Now we show that the left side is at least as large as the right.
For $\pi \in
\Pi^{MR}$ and v we define

\begin{eqnarray*}\forall s\in S \ \ \ \ w_{s}(a) = r(s,a) + \lambda\sum_{j\in S}
p(j\mid s,a)v(j)

Fix a state $s\in S$. The value of $\pi$ is a weighted average of ws(a), and we have

\begin{eqnarray*}\max_{a\in A_{s}}\{w_{s}(a) \} \geq \sum_{a\in
A_{s}}q_{\pi}(a)w_{s}(a)\ .


\begin{eqnarray*}\max_{d\in \Pi_{MD}}\{r_{d} + \lambda P_{d}\vec{v} \} \geq r_{\pi}
+ \lambda P_{\pi}\vec{v}\ ,

For any $\pi \in \Pi_{MR}$.
This shows that the left hand side in the theorem is at least as large as the right hand side, which completes the proof. $\Box$

Let us define the non-linear operator L:

\begin{eqnarray*}L\vec{v} = \max_{d\in \Pi^{MD}}\{\vec{r}_{d} + \lambda

Therefore we can state the optimality equation by

\begin{eqnarray*}v = \max_{d\in \Pi^{MD}}\{\vec{r}_{d} + \lambda P_{d}\vec{v}\} =

It still remains to be shown that these indeed are optimality equations, and that the fixed point of the operator is the optimal return value.

Theorem 5.5   Let $v_{\lambda}^{*}$ be the optimal return value with parameter $1 > \lambda \geq 0$
If $v \geq Lv$ then $v \geq v_{\lambda}^{*}$
If $v \leq Lv$ then $v \leq v_{\lambda}^{*}$
If Lv = v then $v = v_{\lambda}^{*}$

Proof:We start by proving (1)

\begin{eqnarray*}% latex2html id marker 473
v & \geq & \max_{d\in \Pi^{MD}}\{r_{...
...r_{d} + \lambda P_{d}v\} \ \
\ \ (by\ theorem\ \ref{L3:MD=MR})

this implies that for any policy d,

\begin{eqnarray*}v & \geq &r_{d} + \lambda P_{d}v\\ & \geq & r_{d} + \lambda
...d})^{i}r_{d} + (\lambda P_{d})^{n}v \ \ \ (where\
[P^{0} = I])

For a given policy d

\begin{eqnarray*}v_{\lambda}^{d} = \sum_{i=0}^{\infty}(\lambda P_{d})^{i}r_{d}

By subtraction

\begin{eqnarray*}v - v_{\lambda}^{d} \geq (\lambda P_{d})^{n}v -
\sum_{i=n}^{\infty}(\lambda P_{d})^{i}r_{d}

Since $\lambda^{n}\Vert v\Vert \geq \Vert\lambda^{n}P_{d}^{n}v\Vert$ and $\lambda
< 1$, we have that for any $\epsilon
> 0$ there exists an N > 0, such that for all n > N we have $\Vert\lambda^{n}P_{d}^{n}v\Vert \leq

Since $\vert\vec{r}_d\vert < M$ we can write:

\begin{eqnarray*}-\frac{\lambda^{n}M}{1-\lambda}\cdot\vec{1} \leq
...P_{d}^{k}r_{d} \leq

For a large enough n we have

\begin{eqnarray*}\Vert\sum_{k=n}^{\infty}\lambda^{k}P_{d}^{k}r_{d}\Vert \leq
\frac{\epsilon}{2}\ .

By this we derive that

\begin{eqnarray*}\forall s\in S \ \ \ v(s) \geq v_{\lambda}^{d}(s) - \epsilon

and for all d

\begin{eqnarray*}v \geq v_{\lambda}^{d} - \epsilon


\begin{eqnarray*}v \geq \max_{d\in \Pi^{MR}} \{ v_{\lambda}^{d}\} - \epsilon =
...} \{ v_{\lambda}^{d}\} - \epsilon =
v_{\lambda}^{*} - \epsilon

As this is true $\forall \epsilon > 0$

\begin{eqnarray*}v \geq v_{\lambda}^{*}

(If we assume that there is a state s such that $v(s) <
v_{\lambda}^{*}(s)$ we pick $\epsilon = \frac{v_{\lambda}^{*} -
v(s)}{2}$ , and reach a contradiction)
We now prove (2)
Since $v \leq Lv$ there exists a policy d such that

\begin{eqnarray*}v \leq r_{d} + \lambda P_{d}v

By theorem [*]

\begin{eqnarray*}v \leq (I - \lambda P_{d})^{-1}r_{d} = v_{\lambda}^{d}


\begin{eqnarray*}v \leq \max_{d} \{ v_{\lambda}^{d}\}

Part (3) follows immediately from parts (1) and (2). $\Box$

next up previous
Next: The Solution of the Up: Computing the Optimal Policy Previous: Computing the Optimal Policy
Yishay Mansour