next up previous
Next: A Comparison between VI Up: Policy Iteration Previous: Convergence of Policy Iteration

   
Example: Running Policy Iteration Algorithm

(Consider the MDP in figure [*] )
Let:

\begin{eqnarray*}\lambda& = &0.95
\end{eqnarray*}


Step 1:

\begin{eqnarray*}{d_0}({s_1})& = &a_{12}\\
{d_0}({s_2})& = &a_{21}
\end{eqnarray*}


Policy Evaluation:

\begin{eqnarray*}{v_0}({s_1})& = &10 + 0.95 v_0({s_2})\\
{v_0}({s_2})& = &-1 +...
...\\
& \Downarrow &\\
{v_0}({s_1}) = -9 &, &\ {v_0}({s_2}) = -20
\end{eqnarray*}


Policy Improvement:

\begin{eqnarray*}MAX\{ 5 + 0.475v_0({s_1}) + 0.475v_0({s_2}),\ 10 + 0.95v_0({s_2}) \}
= MAX\{ -0.8775,\ -9 \}
\end{eqnarray*}


Therefore:

\begin{eqnarray*}d_1({s_1})& = &a_{11}\\
d_2({s_2})& = &a_{21}\\
\end{eqnarray*}


Policy Evaluation:

\begin{eqnarray*}{v_1}({s_1})& = &5 + 0.475 v_1({s_1}) + 0.475 v_1({s_2})\\
{v...
...\\
&\Downarrow&\\
v_1({s_1}) = -8.571 & , & \ v_1({s_2}) = -20
\end{eqnarray*}


The next policy improvement step shows that d2 = d1, and therefore the algorithm terminates and outputs d1 as the optimal policy.

Yishay Mansour
1999-12-18