2. Repeat until t=1
The next lemma states the correction of the algorithm.
Proof:We will use a reversed induction to show that equals the expected return of given history ht.
Basis: for t=N the lemma is obviously true (by the definition of V).
Induction Step: can be written as:
Xt=st is known from ht
Yt=dt(ht) is known from since the policy is deterministic.
(2) and at stage 2 of the algorithm we calculate exactly that expectation. Therefore, by using the induction hypothesis, it holds also for t. This implies that at the end, we have computed the expected return of from the start state, which is by definition .the policy was a stochastic one ( ), we would change stage 2 of the algorithm and get: