Next: Example 2 Up: Introduction Previous: Policy

## Example 1

The first example is a Markovian decision making problem with one time phase. This implies that N=2, and r(s') is the value of state s' in the end.

When the system is started at state , the goal of the agent is to choose an action that maximizes the value of v(a)=E[R1(s0,a1) + r(s')]

We will choose a deterministic policy that selects an action as follows: u(a)= E[R1(s,a)] +E[R(x2)], when x2 is a random variable describing the state reached after action a. Writing it explicitly,

The agent will choose an action a*, that maximizes V(a), i.e.

In the above example, there is no stochastic rule that is better then the deterministic policy presented. If there was, then such a stochastic rule would have an action distribution of q(a) and the return of such a policy would be , but since we chose a* so that for any action a, then (in simpler words, the best deterministic choice is always at least as good as any average).

Yishay Mansour
1999-11-15