Evaluating Policy Reward

Let ** be a policy.
We would like to calculate the reward of policy ** : *
For simplicity, we assume that there exist a state *s*_{0}* in the MDP, such that **s*_{0}* has a reward of 0, and *
*Prob*(*s*_{0}|*s*_{0},*a*) = 1* : *
Also, we assume that each policy reachs state *s*_{0}* within a finite number of steps with probability 1.
Under these assumptions, we can assume each run is finite.
*Yishay Mansour*

*1999-12-16*