## The Optimality Principal

Let us define: . For the simplicity of the discussion we assume that both S and A are finite. The optimality equations (also known as Bellman equations) are:

Where UN(hN)=rN(sN) for hN=(hN-1,aN-1,sN).

Note that replacing the max operation in the equation with the action taken by policy , we get the equation for computing the return of policy .

Theorem 3.2   Let Ut be the solution for the optimality equations for t = 1 up to N-1, then:

(1)

(2)

Proof:First, we will show by induction that for and for every . Then we exhibit a specific policy for which and that this will conclude the proof.

Basis: When t=N, by definition for every policy and therefore uN(hN)=u*N(hN).

Induction Step: Let us assume that for and . Let be a policy in (that performs the operation dn' on step n). for t=n:

Consider an arbitrary policy , that for history hn uses stochastic action q(a). By the induction assumption,

Therefore, for every and in particular to the optimal policy, . It is left to show that there exists a policy for which . Let us define in the following manner:

We will now show that using reversed induction on n.

Basis: for n=N this is obviously true.

Induction Step:

The second equality is by the induction hypothesis, that for . Therefore and since it is clear that un(hn)=u*n(hn) as was required.