conclusion:

Next: Q-learning and SARSA algorithms Up: Policy Sampling Previous: Problem of sampling

1.: To calculate the ratio we don't need any knowledge on the Model
only about the two policies we use.
2.: The ratio is $(\frac{D_{2}(X)}{D_{1}(X)})F(x)$ is the case of Importance Sampling.
3.: 1+2 imply we can use samples from one policy to calculate samples on another policy.
4.: conclusion 3 explain why Q-learning can work.
5.: The Variance must be limited to avoid errors.

Yishay Mansour
2000-01-07