next up previous
Next: Example 2 Up: Introduction Previous: Policy

Example 1

The first example is a Markovian decision making problem with one time phase. This implies that N=2, $T=\{1,2\}$ and r(s') is the value of state s' in the end.

When the system is started at state $s_0\in{S}$, the goal of the agent is to choose an action $a\in{A}$ that maximizes the value of v(a)=E[R1(s0,a1) + r(s')]

We will choose a deterministic policy $\pi$ that selects an action as follows: u(a)= E[R1(s,a)] +E[R(x2)], when x2 is a random variable describing the state reached after action a. Writing it explicitly, $u(a) = r_1(s,a) + \sum_{j \in S} P(j\vert s,a)*r_2(j)$

The agent will choose an action a*, that maximizes V(a), i.e. $ a^* \in \{a_1\vert u(a_1) = \max_{a \in A} \{U(a)\}\}$

In the above example, there is no stochastic rule that is better then the deterministic policy presented. If there was, then such a stochastic rule would have an action distribution of q(a) and the return of such a policy would be $\sum q(a)U(a)$, but since we chose a* so that $U(a) \leq U(a^*)$ for any action a, then $\sum q(a)U(a) \leq \sum q(A)U(a^*) \leq U(a^*)$ (in simpler words, the best deterministic choice is always at least as good as any average).

Yishay Mansour