next up previous
Next: The Naive Approach Up: No Title Previous: Evaluating the average

Evaluating Policy Reward

Let $\pi$ be a policy. We would like to calculate the reward of policy $\pi$ : $V^{\pi}(s)$.

For simplicity, we assume that there exist a state
s0 in the MDP, such that s0 has a reward of 0, and Prob(s0|s0,a) = 1 : $ \forall a \in
Also, we assume that each policy reachs state
s0 within a finite number of steps with probability 1.
Under these assumptions, we can assume each run is finite.


Yishay Mansour