next up previous
Next: Example Up: Evaluating Policy Reward Previous: First Visit

Every Visit

We can hope to improve the approximations by using every appearance of state s in each single run.

for each $s\in S$ :

Run $\pi$ from s for m times, where the i-th run is Ti.
For each run Ti and state s in it, let r(s,Ti,j) be the reward of $\pi$ in run Ti from the j-th appearance of s in Ti until the run ends (reaching state s0).
Let the reward of policy $\pi$ starting from s be : $\hat{V^{\pi}}(s) = Avg(r(s,T_{i},j))$

The problem is that the random variables r(s,Ti,j) are dependent for different j's.

Yishay Mansour