** Next:** Equality of schemes using
** Up:** Forward Updates Versus Backward
** Previous:** Forward Updates Versus Backward

###

Algorithm

The algorithm is as follows:
- 1.
- Define an arbitrary
,
set
- 2.
- Repeat for every run : Set initial state
- 3.
- Repeat for every step of the run:

Choose action according to policy

Perform action
, receive next state
and immediate reward

set
for
and

repeat steps 2,3 until final state is reached (for
every run)

Using this scheme, we update previous values according to new
immediate reward and according to state we have reached. For more
distant states weight of the update is smaller and for more close
states the weight is larger.

Let us examine different values of
.

For
:
iff
and
otherwise

So, we will update only
current state according to received reward and we once again have
the TD(0) scheme.

For
: *e*(*s*) is contribution of this
step to state s, i.e. if s is visited in the run in steps
then

where
is
contribution of current step to step

Although TD(1) resembles MC , there is a minor difference : MC
waits for the end of the run to perform updates while TD(1)
performs updates online. By the end of the run, both methods give
the same value function , but functions computed by TD(1) will be
hopefully more accurate during the run, since they are updated
online.

** Next:** Equality of schemes using
** Up:** Forward Updates Versus Backward
** Previous:** Forward Updates Versus Backward
*Yishay Mansour*

*2000-01-06*