Algorithms - optimal control
CLAIM: A policy p is optimal if and only if at each state s:
Vp(s) = MAXa {Qp(s,a)} (Bellman Eq.)
PROOF: Assume there is a state s and action a s.t.,
Then the strategy of performing a at state s (the first time)
This is true each time we visit s, so the policy that
performs action a at state s is better than p.