Next: The Naive Approach Up: No Title Previous: Evaluating the average

Evaluating Policy Reward

Let $\pi$ be a policy. We would like to calculate the reward of policy $\pi$ : $V^{\pi}(s)$ .

For simplicity, we assume that there exist a state s₀ in the MDP, such that s₀ has a reward of 0, and Prob(s₀|s₀,a) = 1 : $\forall a \in A_{s_{0}}$ .
Also, we assume that each policy reachs state s₀ within a finite number of steps with probability 1.
Under these assumptions, we can assume each run is finite.

The Naive Approach
First Visit
Every Visit
Example

Yishay Mansour
1999-12-16