The Naive Approach

Next: First Visit Up: Evaluating Policy Reward Previous: Evaluating Policy Reward

The Naive Approach

for each $s\in S$ :
Run $\pi$ from s for m times, where the i-th run is T_i.
Let r_i be the reward of T_i.
Estimate the reward of policy $\pi$ starting from s, by :

$\hat{V^{\pi}}(s) = Avg(r_{i})= \frac{1}{m}\sum_{i=1}^{m}r_{i}$

The variables r_i are independent since the runs T_i are independent. By Chernoff's theorem :

$Prob[\vert\hat{V^{\pi}(s)}-V^{\pi}(s)\vert\geq \gamma]\leq 2e^{-2m\gamma^{2}}$

This implies that :

$Prob[\exists s \in S : \vert\hat{V^{\pi}(s)}-V^{\pi}(s)\vert \geq \gamma] \leq ... ...}-V^{\pi}(s)\vert \geq \gamma] \leq \vert S\vert 2e^{-2m\gamma^{2}} = \epsilon$ .

for $\gamma = \sqrt{\frac{ln(\vert S\vert)/\epsilon}{2m}}$ the above holds.

Yishay Mansour
1999-12-16