next up previous
Next: Q-learning and SARSA algorithms Up: Policy Sampling Previous: Problem of sampling

   
conclusion:

1.
To calculate the ratio we don't need any knowledge on the Model
only about the two policies we use.
2.
The ratio is $(\frac{D_{2}(X)}{D_{1}(X)})F(x)$ is the case of Importance Sampling.
3.
1+2 imply we can use samples from one policy to calculate samples on another policy.
4.
conclusion 3 explain why Q-learning can work.
5.
The Variance must be limited to avoid errors.


Yishay Mansour
2000-01-07