Next: First Visit
Up: Evaluating Policy Reward
Previous: Evaluating Policy Reward
for each :
The Naive Approach
Run from s for m times, where the i-th run is Ti.
Let ri be the reward of Ti.
Estimate the reward of policy starting from s, by :
The variables ri are independent since the runs Ti are independent. By Chernoff's theorem :
This implies that :
the above holds.