next up previous
Next: Proof for The Indirect Up: Proof for Phased Q-Learning Previous: Putting It All Together

   
Finding the Constants mD and lD

For parameters $w,\ m_D\ and\ l_D$ s.t.:
1.
$\varepsilon = w \cdot \sum_{i=1}^{l_D}{\gamma^i} + \frac{\gamma^{l_D-1}}{1-\gamma} \left\Vertv_1 -
v_0\right\Vert$
2.
$\delta = {l_D} \cdot \vert S\vert \cdot \vert A\vert \cdot 2e^{- 2 {m_D} {w^2}}$
we get, by the above result, that with probability at least $1-\delta$ we have reached an $\varepsilon-optimal$ policy i.e.: $\Pr( \left\Vert\widehat{Q}_{l_D} - Q^*\right\Vert < \varepsilon ) >
1-\delta$
So, let's find such parameters $w,\ m_D\ and\ l_D$. Note that mD and lD must be natural numbers, but it suffices to constrain them to a real value, since taking an upper value won't change the asymptotic complexity. >From (1) we get:

\begin{eqnarray*}w &=& \frac{ \varepsilon - \frac{\gamma^{l_D-1}}{1-\gamma} \lef...
...}\left\Vertv_1- v_0\right\Vert) }{ \gamma-\gamma^{{l_D}+1} }\\
\end{eqnarray*}


In order to use Chernoff, we must have w>0,

\begin{eqnarray*}w>0 &\Leftrightarrow& \frac{ (1-\gamma)\varepsilon - \gamma^{l_...
...gamma)\varepsilon }{ \left\Vertv_1-
v_0\right\Vert })} + 1\\
\end{eqnarray*}


And we get the constraint:

\begin{displaymath}l_D > log_{\gamma}{(\frac{ (1-\gamma)\varepsilon }{ \left\Vertv_1-
v_0\right\Vert })} + 1\\
\end{displaymath} (14)

This constraint is fulfilled by setting $l_D =
log_{\gamma}{(\frac{ (1-\gamma)\varepsilon }{ \left\Vertv_1-
v_0\right\Vert })} + f$, for any f>1, and in particular, for $f = log_{\gamma}{2}$ ( $log_{\gamma}{2}>1$ since $0<\gamma<1$), we get:

 \begin{displaymath}
l_D = {\log_{\gamma}}{\frac{ (1-\gamma)\varepsilon }{ \le...
...2(1-\gamma)\varepsilon }{ \left\Vertv_1-
v_0\right\Vert }}
\end{displaymath} (15)

>From (2) we get:

\begin{displaymath}\ln\delta = \ln{(l_D\cdot \vert S\vert\cdot \vert A\vert\cdot...
... = \ln{(2 l_D \vert S\vert\cdot \vert A\vert)} - 2 {m_D}{w^2}
\end{displaymath} (16)

Hence:

\begin{displaymath}m_D = \frac{ \ln{(2 l_D \vert S\vert \cdot \vert A\vert)} - \...
...l_D \vert S\vert \cdot \vert A\vert}{\delta})} } { 2{w^2}
}
\end{displaymath} (17)

By substituting with the w received from (1), we get:

\begin{eqnarray*}% latex2html id marker 811
m_D &=& \frac{ \ln{(\frac{2 l_D \ver...
...\Vert })} }{\delta})} }
{ 2 {(1-\gamma)}^4 \varepsilon^2 }\\
\end{eqnarray*}


Now, let's bound the number of calls to PS(M) i.e. let's bound $m_D \times l_D$:


So the number of calls to PS(M) that insures with probability at least $1-\delta$ that the received policy in the Phased-Q-Learning algorithm is within $\varepsilon$ distance from the optimal policy is $O( \frac{1}{\varepsilon^2} \cdot
\ln{\frac{\vert S\vert\cdot \vert A\vert}{\de...
...silon^2} \cdot \ln{\frac{1}{\varepsilon}} \cdot \ln\ln{\frac{1}{\varepsilon}} )$
$\Box$
next up previous
Next: Proof for The Indirect Up: Proof for Phased Q-Learning Previous: Putting It All Together
Yishay Mansour
2000-05-30