POMDP - Belief State Algorithm
Given a history of actions and observable value
we compute a posterior distribution for the state
we are in (belief state).
States: distribution over S (states of the POMDP).
actions: as in the POMDP.
Transition: the posterior distribution (given the observation)
We can perform the planning and learning on the belief-state MDP.