Policy Sampling

Next: Problem of sampling Up: Evaluating One Policy With Previous: Importance Sampling

Policy Sampling

The conclusion from the above equivalence is that if we can compute $\frac{D_{2}(X)}{D_{1}(X)}$ then we
able to "transform" samples from distribution D₁ to samples in distribution D₂.
We produce $T \in {\{S,A,R\}^{*}}: T_{1}..T_{m}$ of policy $\pi_{1}$ . each sample(T_j) is a run on the model using policy $\pi_{1}$ , i.e. T_j = s₁,a₁,r₁,s₂,a₂,r₂...
The probability of generating T_j is actually a product of two independent probabilities: a policy depeneded probablity on actions and a model depended probability.

Prob[T_j] = $\prod_{i=1}^{\vert T_{j}\vert}(\pi_{1}(s_{i},a_{i})*Prob(s_{i+1}\vert s_{i},a_{i}))$ =
$(\prod_{i=1}^{\vert T_{j}\vert}\Pi_{1}(s_{i},a_{i}))*(\prod_{i=1}^{\vert T_{j}\vert}Prob(s_{i+1}\vert s_{i},a_{i% }))$

We calculte the ratio of probabilites to have the same T_j on differnt policies:

$\frac{Prob_{\pi_{2}}[T_{j}]}{Prob_{\pi_{1}}[T_{j}]} = \frac {(\prod_{i=1}^{\v... ...\prod_{i=1}^{\vert T_{j}\vert}\frac{\pi_{2}(s_{i},a_{i})}{\pi_{1}(s_{i},a_{i})}$

The important fact is that the ratio does not depened on the model, but only on the policies. Therefor we can compute it with out the model.
EXAMPLE 2

Input:

policy $\Pi_{1}$ .
policy $\Pi_{2}$ is determinstic.

Computation:

because $\Pi_{2}$ is determinstic $\Pi_{2}(s_{i},a_{i}) \in {\{0,1\}}$
Therefor the ratio is a product of X, $X \in {\{0,\frac{1}{\Pi_{1}(s,a)}\}}$
If $\Pi_{1}$ is the random policy than we simply have a uniform distribution on the runs. Consistent with $\Pi_{2}$ .

Yishay Mansour
2000-01-07