First Visit

*for each ** :
*

- 1.
- Run
from s for m times, where the i-th run is
*T*_{i}.

- 2.
- For each run
*T*_{i}and state s in it, let*r*(*s*,*T*_{i}) be the reward of in run*T*_{i}from the first appearance of s in*T*_{i}until the run ends (reaching state*s*_{0}). - 3.
- Let the reward of policy starting from s be :

*The random variables **r*(*s*,*T*_{i})* for a given state s and different **T*_{i}*'s, are independent
since different runs are independent. The improvement is based on increasing the number of samples without
increasing the number of runs, so we have a smaller estimation-error.
*

*
*