** Next:** "Encoding" Functions
** Up:** TD-Gammon
** Previous:** State encoding TD-Gammon

##

MDP
description

- 1.
- The discount factor is set to
.
- 2.
- Immediate rewards:
- (a)
- In non-terminal states the immediate rewards are equal to 0.
- (b)
- In a winning terminal state the immediate reward equals 1.
- (c)
- In a losing terminal state the immediate reward equals 0.

I.e. the TD has a difference of
*V*(*s*_{t+1}) - *V*(*s*_{t}) for all
non-terminal states.
In every move the parameters are changed in the direction of the
TD. Generally, assume we have a function
*F*(*s*,*r*) = *V*_{r}(*s*)
,which gives each state *s* a value according to *r* (*r* is
actually a program which gets a state *s* as an input). We will
update
by the derivative of *V*_{r}(*s*)
according to
.
I.e. according to the vector
.
Updating
in this direction will hopefully change the
value of *V*_{r}(*s*) in the "right" direction. TD tries to minimize
the difference between two succeeding states: assuming that
*V*_{r}(*s*_{t+1}) > *V*_{r}(*s*_{t}), we would like to "strengthen" the
weight of the action taken, and update in the direction of
For example, if *r* is a table:
,
then
,
and the update will occur only
in the *s*'th entry of *r*.
TD Gammon updates
while running the system,
where the current policy is the greedy policy with respect to the
function *V*_{r}(*s*).
Specifically,
,
where
TD-Gammon simply "learns" by playing against itself, i.e it updated
while playing against itself.
After about 300,000 games the system achieved a very good
playing skill (equivalent to other backgammon playing programs).

** Next:** "Encoding" Functions
** Up:** TD-Gammon
** Previous:** State encoding TD-Gammon
*Yishay Mansour*

*2000-01-17*