Every Visit

*for each ** :
*

- 1.
- Run
from s for m times, where the i-th run is
*T*_{i}.

- 2.
- For each run
*T*_{i}and state s in it, let*r*(*s*,*T*_{i},*j*) be the reward of in run*T*_{i}from the j-th appearance of s in*T*_{i}until the run ends (reaching state*s*_{0}). - 3.
- Let the reward of policy starting from s be :

*The problem is that the random variables **r*(*s*,*T*_{i},*j*)* are dependent for different *

*
*