Learning - Model FreeTemporal Differences -Policy evaluation

Assume that at state st we performed action at, received reward

rt and moved to state st+1.

Our “estimation error” is rt+gV(st+1)-V(st), we update:

Previous slide Next slide Back to first slide View graphic version