Lets define a state machine with two states s_{1}, s_{2}.
Formally, in the example we have:
time:
states:
actions:
reward: r_{t}(s_{1},a_{11}) = 5, r_{t}(s_{1},a_{12}) = 10, r_{t}(s_{2},a_{21}) = -1
final reward: r_{N}(s_{1}) = 0, r_{N}(s_{2}) = 0
transition function: P_{t}(s_{1}|s_{1},a_{11}) = 0.5, P_{t}(s_{2}|s_{1},a_{11}) = 0.5
P_{t}(s_{2}|s_{1},a_{12}) = 1, P_{t}(s_{2}|s_{2},s_{21}) = 1
Now we want to maximize the return. First we compare two deterministic policies:
- always chooses a_{11} when in state s_{1}.
- always chooses a_{12} when in state s_{2}.
Recall that
is the expected return of policy
in N time steps.
Lets try determining the best policy. We do this by looking first at the reward gained in the last step, for N=2:
For N=3, we use the optimal policy for N=2, after we do one action.