next up previous
Next: The Expected Discounted Sum Up: Infinite Horizon Problems Previous: The Return Function

   
Example 1

This example is an expantion of exmaple 2 given in lecture 3.
We will first examine the value gathered from different return functions using two specific policies:
1.
$\pi_1$- always chooses a11 when in state s1
2.
$\pi_2$ - always chooses a12 when in state s1

  
Figure: Infinite Horizon Example

Let us start by calculating $V^{\pi}_N$:

\begin{eqnarray*}V^{\pi_2}_N & = & 10 - (N-2) = 12 - N\\
V^{\pi_1}_N & = & 5 + ...
...N-2) + (1-\frac{1}{2^{N-2}})\\
& = & 13 - N - \frac{6}{2^{N-2}}
\end{eqnarray*}


For $N \rightarrow \infty$ the gap between the two policies goes to 1 in favor of $\pi_1.$
The three suggested return functions evaluate to:
1.
The expected sum of the immediate rewards:

\begin{displaymath}V^{\pi_1}(s_1) = V^{\pi_2}(s_1) = -\infty\end{displaymath}

2.
Expected average reward:

\begin{eqnarray*}g^{\pi_2}(s_1) & = & \lim_{N \rightarrow \infty} \frac{12-N}{N} = -1\\
g^{\pi_1}(s_1) & = & -1
\end{eqnarray*}


3.
Expected discounted sum:

\begin{eqnarray*}V^{\pi_2}_\lambda (s_1) = 10 + \sum_{i=1}^\infty \lambda^i(-1) = 10 - \frac{\lambda}{1-\lambda}
\end{eqnarray*}




Yishay Mansour
1999-11-18