next up previous
Next: Computational Complexity Up: Return Estimation of a Previous: Return Estimation of a

Dynamic Programming Algorithm for $\pi\in\Pi^{HD}$

1. $t \leftarrow N$

$\forall h_N=(h_{N-1},a_{N-1},s_N) : U^{\pi}_N(h_N)\leftarrow r_N(s_N)$

2. Repeat $t\leftarrow t-1$ until t=1

$\forall h_t=(h_{t-1},a_{t-1},s_t) \in H_t$

$U^{\pi}_t(h_t)\leftarrow r_t(s_t,d_t(h_t)) + \sum_{j\in S} P_t(j\vert s_t,d_t(h_t))*U^{\pi}_{t+1}(h_t,d_t(h_t),j)$

The next lemma states the correction of the algorithm.

Lemma 3.1   for $\pi\in\Pi^{HD}, U^{\pi}_1(s)=V^{\pi}_N(s)$

Proof:We will use a reversed induction to show that $U^{\pi}_{t}(h_{t})$ equals the expected return of $\pi$ given history ht.

Basis: for t=N the lemma is obviously true (by the definition of V).

Induction Step: $U^{\pi}_{t}(h_{t})$ can be written as:

(1) $U^{\pi}_t(h_t)= E^{\pi}_{h_t}[r_t(X_t,Y_t)+U_{t+1}(h_t,Y_t,X_{t+1)}]$

Xt=st is known from ht

Yt=dt(ht) is known from $\pi$ since the policy is deterministic.

(2) $U^{\pi}_t(h_t) = r_t(s_t,d_t(h_t)) + E^{\pi}_{h_t}[U^{\pi}_{t+1}(h_t,d_t(h_t),x_{t+1)}]$ and at stage 2 of the algorithm we calculate exactly that expectation. Therefore, by using the induction hypothesis, it holds also for t. This implies that at the end, we have computed the expected return of $\pi$ from the start state, which is by definition $V^{\pi}_{N}(s)$.$\Box$the policy was a stochastic one ( $\pi\in\Pi^{HR}$), we would change stage 2 of the algorithm and get:

(2) $U^{\pi}_t(h_t)\leftarrow \sum_{a\in A} q_{h_t}(a)\{r_t(s_t,d_t(h_t)) + \sum_{j\in S} P_t(j\vert s_t,d_t(h_t))*U^{\pi}_{t+1}(h_t,d_t(h_t),j)\}$

Yishay Mansour