next up previous
Next: Putting It All Together Up: Proof for Phased Q-Learning Previous: Notations

   
Bounding $\left\Vert \widehat{Q}_{l_D} - Q^*\right\Vert$

After lD iterations, the algorithm Phased-Q-Learning returns a policy within $\varepsilon$ of the optimal policy, if

\begin{displaymath}\left\Vert \widehat{Q}_{l_D} - Q^*\right\Vert \leq \varepsilon
\end{displaymath} (5)

(due to the definition of Q*).
By the triangle inequality:

\begin{displaymath}\left\Vert \widehat{Q}_l - Q^*\right\Vert \leq \left\Vert \widehat{Q}_l - Q_l \right\Vert + \left\Vert Q_l - Q^* \right\Vert
\end{displaymath} (6)

Let's bound each of this values separately:

Claim 7.1  

\begin{displaymath}\forall (s,a)\in S\times A,\ 0\leq i\leq l_D\ \\
\Pr( \lef...
...widehat{v}_i(j)
\right\Vert\geq w ) \leq 2e^{-2{m_D}{w^2}}
\end{displaymath} (7)

Proof:Note that

Therefore, from Chernoff Inequality we receive that $\forall w>0$

\begin{displaymath}\forall s\in S,\ a\in A,\ 0\leq i\leq l_D\ \\
\Pr( \left\V...
...widehat{v}_i(j)
\right\Vert\geq w ) \leq 2e^{-2{m_D}{w^2}}
\end{displaymath} (8)

$\Box$

Claim 7.2   Bounding $\left\Vert \widehat{Q}_l - Q_l \right\Vert$:

Proof:First by induction let's see that:

\begin{displaymath}\left\Vert \widehat{Q}_l - Q_l \right\Vert \leq \max_{(s,a)\i...
...{l-i}(j) \right\Vert \cdot
\sum_{i=1}^{l}\gamma^i \right\}
\end{displaymath} (9)

Induction's Base: For l=0 both sides of the equation are zero, and the inequality holds.
Induction's Step: Assume the claim for l-1 and prove it for l:


Now, after we've accomplished the bound for $\left\Vert \widehat{Q}_l - Q_l \right\Vert$, let's show that:

\begin{displaymath}\Pr( \left\Vert \widehat{Q}_l - Q_l \right\Vert \geq w \cdot ...
...cdot \vert S\vert \cdot \vert A\vert \cdot 2e^{-2{m_D}{w^2}}
\end{displaymath} (10)



$\Box$

Claim 7.3   Bounding $\left\VertQ_l - Q^*\right\Vert$:

\begin{displaymath}\left\VertQ_l - Q^*\right\Vert \leq \frac{\gamma^{l-1}}{1-\gamma} \left\Vertv_1 - v_0\right\Vert
\end{displaymath} (11)

Proof:We saw in class that:

\begin{displaymath}\forall l \left\Vert v_l - v^* \right\Vert
\leq \frac{\gamma^l}{1-\gamma} \left\Vertv_1 - v_0\right\Vert
\end{displaymath} (12)

so,

\begin{eqnarray*}\left\VertQ_l - Q^*\right\Vert &=& \max_{(s,a) \in S\times A}\l...
...frac{\gamma^{l-1}}{1-\gamma} \left\Vertv_1 - v_0\right\Vert\\
\end{eqnarray*}


and the bound is received. $\Box$


next up previous
Next: Putting It All Together Up: Proof for Phased Q-Learning Previous: Notations
Yishay Mansour
2000-05-30