next up previous
Next: The Optimality Principal Up: Return Estimation of a Previous: Dynamic Programming Algorithm for

Computational Complexity

Let K be the number of states and L the number of actions. Then |Ht|=Kt+1 Lt and the computation time would be: $\sum_{t=0}^{N-1} K\vert H_t\vert \leq K^2 \sum_{t=0}^{N-1} (KL)^t$

The above value is for a general history dependent policy. If the policy is Markovian, then |Ht|=K and we get a computation time of K2(N-1)

Yishay Mansour