next up previous
Next: Example showing that tied Up: No Title Previous: No Title

   
Large State Space

When we have a large state space and we can not compute V(s,r), we would like to build an approximation function $\tilde{V}(s,r) $. Let

\begin{displaymath}\epsilon = \min_{r} \{ \vert\vert\tilde{V}(s,r) - V^*\vert\vert \} \end{displaymath}

be the minimal distance between $\tilde{V}(s,r) $ and V*. (It is clear that $\epsilon$ is the upper bound for any approximation algorithm.(If $\epsilon$ is large, we can not expect a good approximation regardless the learning process.) We will show later error bounds of the form: $\frac{\epsilon}{(1-\lambda)^2}$ or $\frac{\epsilon}{(1-\lambda)}.$
Theses bounds might seem disappointing, since when $\lambda \rightarrow 1$ our bound diverges. On the other hand if we enrich our architecture (enlarge the family of r) $\epsilon \rightarrow 0$, and when $\lambda$ is a constant the bound will go to 0.
If we approximate Q*(s,a) by $\tilde{Q}(s,a,r)$, the greedy policy $\pi$ is

\begin{displaymath}\pi{(s,r)} = argmax_{a\in A_s} \{ \tilde{Q}(s,a,r) \} \end{displaymath}


If we have $\tilde{V}(s,r) $, then the greedy policy $\pi$ is,

\begin{displaymath}\pi{(s,r)} = argmax_{a\in A_s} \{ r(s,a) + \lambda E_{s^{'}}[\tilde{V}(s^{'},r)] \} \end{displaymath}


We have to approximate $E_{s^{'}}[\tilde{V}(s^{'},r)]$, so we have additional errors due to approximation too.
If we have only a few states S', than we compute the expectation exactly, otherwise we will approximate it by taking samples. We will get a stochastic policy since every time we get different sample and we may decide a different action, unlike the case of $\tilde{Q}$ where we get deterministic policy. The next theorem relates errors on the value function to differences in the return.

Theorem 11.1   Consider a discounted problem, with parameter $\lambda$. If V satisfies

\begin{displaymath}\epsilon = \vert\vert V^* - V \vert\vert , \end{displaymath}

and $\pi$ is a greedy policy based on V, then

\begin{displaymath}\vert\vert V^{\pi} - V^* \vert\vert \leq \frac{2\lambda\epsilon}{1-\lambda} \end{displaymath}

Furthermore there exists $\epsilon_{0}$ s.t for every $\epsilon \leq \epsilon_{0}$ the policy $\pi$ is the optimal policy.

Proof:
Consider the operators,

\begin{displaymath}L_{\pi}V = r_{\pi} + \lambda P_{\pi},V\end{displaymath}

and

\begin{displaymath}LV = \max_{\pi} \{r_{\pi} + \lambda P_{\pi}r\}.\end{displaymath}

Then

\begin{eqnarray*}\vert\vert V^{\pi} - V^* \vert\vert & = & \vert\vert L_{\pi}V^{...
...vert V^* - V \vert\vert + \lambda \vert\vert V - V^* \vert\vert
\end{eqnarray*}


This implies that,

\begin{displaymath}\vert\vert V^{\pi} - V^* \vert\vert \leq \frac{2\lambda\epsilon}{1-\lambda} \end{displaymath}

For the second part,
Since we have finite number of policies, then there exist $\delta$ s.t.

\begin{displaymath}\delta = \min_{\pi\neq\pi^*} \{ \vert\vert V^{\pi} - V^* \vert\vert \} \end{displaymath}

For any $\epsilon$ such that,

\begin{displaymath}\delta > \frac{2\lambda\epsilon}{1-\lambda},\end{displaymath}


we have that

\begin{displaymath}\pi = \pi^*.\end{displaymath}

Therefore we can choose $\epsilon_{0} = \frac{\delta(1 - \lambda}{2\lambda}.$$\Box$



 
next up previous
Next: Example showing that tied Up: No Title Previous: No Title
Yishay Mansour
2000-01-11