next up previous
Next: Approximate Value Iteration Up: Approximate Policy Iteration Previous: Solving the Least-Squares Problem

   
Evaluation Of Approximate Policy Iteration

In this section will make two assumptions on our approximation. (Both of them are rather strong.) Assume the algorithm outputs $V_1,\pi_1,V_2,...$ such that
1.
$\forall k \vert\vert V_k - V^{\pi_{k}} \vert\vert < \epsilon$
2.
$\forall k \vert\vert L_{\pi_{k+1}}V_{k} - LV_k\vert\vert < \delta$
For some positive scalars $\epsilon$ and $\delta$.

Theorem 11.2   The sequence of policies $\pi_k$ generated by the approximate policy iteration algorithm satisfies

\begin{displaymath}\lim_{k \rightarrow \infty}sup\vert\vert V^{\pi_{k}} - V^{*} ...
...t\vert \leq \frac{\delta + 2 \lambda \epsilon}{(1 - \lambda)^2}\end{displaymath}

Proof:We will first show that a policy generated by the policy update can not be much worse than the current policy. We will prove it by using the following lemmas

Lemma 11.3   Let $\pi$ be some policy and V is a value function satisfies $\vert\vert V - V^\pi\vert\vert \leq \epsilon$for some $\epsilon \geq 0$.
Let $\bar{\pi}$ be a policy that satisfies:

\begin{displaymath}(L_{\bar{\pi}}V)(s) \geq (LV)(s) - \delta \end{displaymath}

for some $\delta \geq 0$then

\begin{displaymath}V^{\bar{\pi}}(s) \geq V^{\pi}(s) - \frac{\delta + 2\lambda \epsilon}{(1 - \lambda)}\end{displaymath}

Proof:Let

\begin{displaymath}\beta = \max_s\{V^{\pi}(s) - V^{\bar{\pi}}(s)\}\end{displaymath}

and

\begin{displaymath}\vec{1} = (1,1,....1)^t\end{displaymath}

so

\begin{displaymath}V^{\bar{\pi}} \geq V^{\pi} - \beta * \vec{1}\end{displaymath}

Hence we will have

\begin{displaymath}V^{\bar{\pi}} = L_{\bar{\pi}}V^{\bar{\pi}}\end{displaymath}



\begin{displaymath}\geq L_{\pi}(V^{\pi} - \beta * \vec{1}) = L_{\bar{\pi}}V - (\lambda \beta)*\vec{1}\end{displaymath}


Therefore we derive,
1.
$V^{\bar{\pi}} - L_{\bar{\pi}}V + (\lambda \beta)*\vec{1} \geq 0$
>From the hypothesis of the lemma we have,
2.
$0 \geq -L_{\bar{\pi}}V +LV - \delta *\vec{1} \geq -L_{\bar{\pi}}V +L_{\pi}V - \delta *\vec{1}$
Using inequality 1, we obtain

\begin{displaymath}V^{\pi} - V^{\bar{\pi}} \leq V^{\pi} - L_{\bar{\pi}}V^{\pi} + (\lambda \beta)*\vec{1}\end{displaymath}

Using inequality 2, we obtain

\begin{displaymath}V^{\pi} - V^{\bar{\pi}} \leq V^{\pi} - L_{\bar{\pi}}V^{\pi} +...
...da \beta)*\vec{1} + L_{\bar{\pi}}V - L_{\pi}V + \delta *\vec{1}\end{displaymath}


\begin{displaymath}= (L_{\bar{\pi}}V - L_{\bar{\pi}}V^{\pi}) + (V^{\pi} - L_{\pi}V) +(\delta+\beta\lambda)*\vec{1}.. \end{displaymath}


\begin{displaymath}\leq \lambda\vert\vert V^{\pi} - V\vert\vert*\vec{1} + \lambd...
...V - V^{\pi}\vert\vert*\vec{1} + (\delta + \beta\lambda)*\vec{1}\end{displaymath}


\begin{displaymath}= 2*\lambda*\epsilon + \delta + \beta\lambda\end{displaymath}

Thus we conclude the following:

\begin{displaymath}\vert\vert V^{\pi} - V^{\bar{\pi}}\vert\vert = \beta \leq 2*\lambda*\epsilon + \delta + \beta\lambda\end{displaymath}


\begin{displaymath}\beta \leq \frac{2\lambda\epsilon + \delta}{1 - \lambda}\end{displaymath}

and the desired result follows.
$\Box$$\beta_k$ be

\begin{displaymath}\beta_k = \max_s(V^{\pi_{k+1}} - V^{\pi_{k}})\end{displaymath}

We apply the lemma above with $\pi = \pi_{k}$ and $\bar{\pi} = \pi_{k+1}$
As a result we have

\begin{displaymath}\beta_k \leq \frac{2\lambda\epsilon + \delta}{1 - \lambda}\end{displaymath}

We now let $\gamma_k$ to be distance from the optimum. $\gamma_k = \max_s(V^*(s) - V^{\pi_k}(s))$

Lemma 11.4   $\forall k$ $\gamma_{k+1} \leq \lambda\gamma_k + \lambda\beta_k + \delta + 2\lambda\epsilon$

Proof:By the definition of $\gamma_k$:

\begin{displaymath}V^{\pi_k} \geq V^* - \gamma_k.\end{displaymath}


\begin{displaymath}LV^{\pi_k} \geq L(V^* - \gamma_k) = LV^* - \lambda\gamma_k*\vec{1} \end{displaymath}

we then have (assumption 1)

\begin{displaymath}L_{\pi_{k+1}}V^{\pi_{k}} \geq L_{\pi_{k+1}}(V_k - \epsilon) = L_{\pi_{k+1}}V_k - \lambda\epsilon*\vec{1}\end{displaymath}

(assumption 2)

\begin{displaymath}L_{\pi_{k+1}}V^{\pi_{k}} \geq LV_k - \delta*\vec{1} - \lambda\epsilon*\vec{1}\end{displaymath}

(assumption 1)

\begin{displaymath}L_{\pi_{k+1}}V^{\pi_{k}} \geq L(V^{\pi_k} - \epsilon) +(-\del...
...silon)*\vec{1} = LV^{\pi_k} +(-\delta-2\lambda\epsilon)*\vec{1}\end{displaymath}

(definition of $\gamma_k$)

\begin{displaymath}L_{\pi_{k+1}}V^{\pi_{k}} \geq L(V^* - \gamma_k) +(-\delta-2\l...
...1} = V^* + (-\delta-2\lambda\epsilon + \lambda\gamma_k)*\vec{1}\end{displaymath}

Thus
(fixed pint of the operator)

\begin{displaymath}V^{\pi_{k+1}}= L_{\pi_{k+1}} V^{\pi_{k+1}}\end{displaymath}

(definition of $\beta_k$)

\begin{displaymath}V^{\pi_{k+1}} \geq L_{\pi_{k+1}}( V^{\pi_{k}} - \beta_k) = L_{\pi_{k+1}}V^{\pi_{k}} -\lambda \beta_k*\vec{1})\end{displaymath}

(using the previous)

\begin{displaymath}V^{\pi_{k+1}} \geq V^* +(-\delta-2\lambda\epsilon + \lambda\gamma_k -\lambda \beta_k )*\vec{1}\end{displaymath}

This implies that,

\begin{displaymath}V^* - V^{\pi_{k+1}} \leq ( \delta +2\lambda\epsilon + \lambda\gamma_k +\lambda\beta_k)*\vec{1}.\end{displaymath}

$\Box$will use the former lemma and the $\beta_k$ definition to prove the theorem Since one can easily see from the equation

\begin{displaymath}\beta_k \leq \frac{\delta + 2\epsilon\lambda}{1 - \lambda}\end{displaymath}

that bounding is not dependent on k. Therefore, we can define the following

\begin{displaymath}\alpha = \frac{\delta + 2\epsilon\lambda}{1 - \lambda}\lambda...
...2\epsilon\lambda = \frac{\delta + 2\epsilon\lambda}{1- \lambda}\end{displaymath}

while $\alpha$ is constant, which is not dependent on K. Then we have

\begin{displaymath}\gamma_{k+1} \leq \lambda\gamma_k + \alpha\end{displaymath}

By opening the recursion we will get

\begin{displaymath}\lambda^k\gamma_1 + \sum_{i=0}^{k-1}\lambda^i\alpha \end{displaymath}

By taking the limit superior of the equation as $k \rightarrow \infty $ to obtain

\begin{displaymath}\lim_{k \rightarrow \infty}\gamma_k = \frac{\alpha }{(1 - \lambda)^2} = \frac{\delta + 2\epsilon\lambda }{(1 - \lambda)^2}\end{displaymath}

which proves the theorem $\Box$


next up previous
Next: Approximate Value Iteration Up: Approximate Policy Iteration Previous: Solving the Least-Squares Problem
Yishay Mansour
2000-01-11