** Next:** Dynamic Programming Algorithm for
** Up:** Finite Horizon
** Previous:** Expected Total Reward

##

Return Estimation of a Given Policy

Let
be a given policy. We would like to calculate it's value
.
An inefficient way of calculating
is creating all possible histories, finding the return value for each one and calculating the expectation of it. Then we can compute the value by taking weighted sum. An alternative calculation method is calculating return of the last step first, and recurse for the previous steps, until reaching the first step.
Let us define a set of variables
to represent the expected return starting in time *t* until time *N* where *h*_{t} notes the history (previous steps and actions) until time t. We can write
as:

where
*h*_{t} = (*h*_{t-1}, *a*_{t-1}, *s*_{t})

We can define
as a function of
as follows:

Note that *X*_{t} is known from the history and *Y*_{t} is known if
is deterministic. *X*_{t+1} is unknown for any policy since the environment is stochastic.

*The idea*: Calculate the values of
from *t*=*N* to *t*=1 using a dynamic programming algorithm.

** Next:** Dynamic Programming Algorithm for
** Up:** Finite Horizon
** Previous:** Expected Total Reward
*Yishay Mansour*

*1999-11-15*