** Next:** Computational Complexity
** Up:** Return Estimation of a
** Previous:** Return Estimation of a

###

Dynamic Programming Algorithm for

1.

2. Repeat
until t=1

The next lemma states the correction of the algorithm.

**Proof:**We will use a reversed induction to show that
equals the expected return of
given history *h*_{t}.

*Basis*: for t=N the lemma is obviously true (by the definition of V).

*Induction Step*:
can be written as:

(1)

*X*_{t}=*s*_{t} is known from *h*_{t}

*Y*_{t}=*d*_{t}(*h*_{t}) is known from
since the policy is deterministic.

(2)
and at stage 2 of the algorithm we calculate exactly that expectation. Therefore, by using the induction hypothesis, it holds also for *t*. This implies that at the end, we have computed the expected return of
from the start state, which is by definition
.the policy was a stochastic one (
), we would change stage 2 of the algorithm and get:

(2)

*Yishay Mansour*

*1999-11-15*