next up previous
Next: Convergence of Policy Iteration Up: Policy Iteration Previous: Policy Iteration

   
Policy Iteration Algorithm

Input: MDP, and $\lambda$
1.
Initialize: $d_0\in\Pi^{MD}$, $n\leftarrow 0$

2.
(policy evaluation)
Find vn (the value of dn) by solving the equations:
$(I - \lambda{P_{d_n}})v = r_{d_n}$

3.
(policy improvement)
Choose a greedy policy with respect to vn:
Choose the next policy, dn+1, s.t.:
$d_{n+1}\in argmax_{d\in\Pi^{MD}}\{ r_d + \lambda{P_d}{v_{d_n}} \}$
Choose dn+1 = dn if possible.
4.
If dn+1 = dn stop,
else $n\leftarrow n+1$, return to (2).



Yishay Mansour
1999-12-18