next up previous
Next: Problem of sampling Up: Evaluating One Policy With Previous: Importance Sampling

Policy Sampling

The conclusion from the above equivalence is that if we can compute $\frac{D_{2}(X)}{D_{1}(X)}$ then we
able to "transform" samples from distribution D1 to samples in distribution D2.
We produce $T \in {\{S,A,R\}^{*}}: T_{1}..T_{m}$ of policy $\pi_{1}$. each sample(Tj) is a run on the model using policy $\pi_{1}$, i.e. Tj = s1,a1,r1,s2,a2,r2...
The probability of generating Tj is actually a product of two independent probabilities: a policy depeneded probablity on actions and a model depended probability.

Prob[Tj] = $\prod_{i=1}^{\vert T_{j}\vert}(\pi_{1}(s_{i},a_{i})*Prob(s_{i+1}\vert s_{i},a_{i}))$ =
$(\prod_{i=1}^{\vert T_{j}\vert}\Pi_{1}(s_{i},a_{i}))*(\prod_{i=1}^{\vert T_{j}\vert}Prob(s_{i+1}\vert s_{i},a_{i%

We calculte the ratio of probabilites to have the same Tj on differnt policies:

$\frac{Prob_{\pi_{2}}[T_{j}]}{Prob_{\pi_{1}}[T_{j}]} =
\frac {(\prod_{i=1}^{\v...
...\prod_{i=1}^{\vert T_{j}\vert}\frac{\pi_{2}(s_{i},a_{i})}{\pi_{1}(s_{i},a_{i})}$

The important fact is that the ratio does not depened on the model, but only on the policies. Therefor we can compute it with out the model.

Input: Computation:


Yishay Mansour