Next: Problem of sampling
Up: Evaluating One Policy With
Previous: Importance Sampling
The conclusion from the above equivalence is that if we can compute
able to "transform"
samples from distribution D1 to samples in distribution D2.
of policy .
each sample(Tj) is a run on the model using policy ,
Tj = s1,a1,r1,s2,a2,r2...
The probability of generating Tj is actually a product of two independent
a policy depeneded probablity on actions and a model depended probability.
We calculte the ratio of probabilites to have the same Tj on differnt policies:
The important fact is that the ratio does not depened on the model, but only on
the policies. Therefor we can compute it with out the model.
- Therefor the ratio is a product of X,
is the random policy than we simply have a uniform distribution on the
runs. Consistent with .