next up previous
Next: Every Visit Up: Evaluating Policy Reward Previous: The Naive Approach

First Visit

We can improve the approximations by updating rewards of many states in each single run.

for each $s\in S$ :

Run $\pi$ from s for m times, where the i-th run is Ti.
For each run Ti and state s in it, let r(s,Ti) be the reward of $\pi$ in run Ti from the first appearance of s in Ti until the run ends (reaching state s0).
Let the reward of policy $\pi$ starting from s be : $\hat{V^{\pi}}(s) = Avg_{s \in T_{i}}(r(s,T_{i}))$

The random variables r(s,Ti) for a given state s and different Ti's, are independent since different runs are independent. The improvement is based on increasing the number of samples without increasing the number of runs, so we have a smaller estimation-error.

Yishay Mansour