Learning - Model FreeTemporal Differences -Policy evaluation
Assume that at state st we performed action at, received reward
rt and moved to state st+1.
Our “estimation error” is rt+gV(st+1)-V(st), we update:
Vt +1(st+1) = Vt(st ) + a [rt +gVt (st+1)-Vt (st )]
Note that for the correct value function we have: