Another way to look on MC algorithm

In lecture 7 we discussed Monte-Carlo (MC) method for evaluation
policy reward.

This method performs number of experiments and uses the average to
evaluate policy reward.

Another way to express the evaluation is the following:

where
is total reward of *n*-th run starting first visit in *s*
(given *s* was visited in this run).

Note that

We can rewrite this formula ,as follows,

where:
- for some nonlinear operator
*H* ,and
is "noise" and
.

Recall the operator
, that was introduced to compute the return of
policy. We've already shown that
is a contracting
operator.

*Yishay Mansour*

*2000-01-06*