Choosing the parameters for F(r,s)

Next: TD-Gammon Up: TD-Gammon Previous: methods of encoding

Choosing the parameters for F(r,s)

If it is possible, we would like to find a vector r such that:

$\begin{displaymath}\forall s:\ F(r,s) = V^{\pi}(s)\end{displaymath}$

In most cases, it isn't possible since there aren't sufficient computing resources. Therefore, we should discuss the distance between F and $V^{\pi}$ . One way to measure this distance is to calculate the minimum square error (MSE):

$\begin{displaymath}^{\min}_{\ r}E_{s}[(F(r,s)-V^{\pi}(s))^{2}]\end{displaymath}$

If it is possible to encode $V^{\pi}$ using F(r,s), this minimum equals to 0. Otherwise, we try to get as close as possible to $V^{\pi}(s)$ . For linear functions it is sometimes possible to compute $\overrightarrow{r}$ which minimizes the MSE, but generally, it is a difficult task. Let's look at the problem in the following manner: we try to minimize

$\begin{displaymath}G(r,s)=E_{s}[(F(r,s)-V^{\pi}(s))^{2}]\ .\end{displaymath}$

Finding the global minimum cannot always be done efficiently, and it is much easier to find a local minimum (and hope it is good enough).To find a local minimum, we can follow the derivative of G(r,s). As long as the derivative is non zero, we have a direction, in which G(r,s) is descending. When the value of the derivative equals to zero, we are done. We now only need to establish that this is a minimum (and not a maximum or an inflection point). The reason that this is not a maximum value is that the function descends to this point. However, we could be at an inflection point, and remain there (because the derivative could be fixed on zero in that neighborhood).

$\begin{displaymath}\overrightarrow{r_{t+1}} = \overrightarrow{r_{t}}-\frac{\alp... ...)-V_{t}(s_{t})]\cdot\nabla_{\overrightarrow{r_{t}}}V_{t}(s_{t})\end{displaymath}$

Recall that:

$\begin{displaymath}\nabla_{\overrightarrow{r_{t}}}V_{t}(s_{t}) = (\frac{\partia... ...(s_{t}),\ldots,\frac{\partial}{\partial r_{k}}V_{t}(s_{t}))\ .\end{displaymath}$

Of course, we don't have access to $V^{\pi}(s_{t})$ , but rather to a sample of it v_t (which could be noisy). Thus, we have:

$\begin{displaymath}\overrightarrow{r_{t+1}} = \overrightarrow{r_{t}} + \alpha[v_{t}-V_{t}(s_{t})]\cdot\nabla_{\overrightarrow{r_{t}}}V_{t}(s_{t})\end{displaymath}$

For example, v_t could be a Monte-Carlo estimate, i.e. the total reward from time t onwards R_t. In the case of backgammon, v_t denotes whether we won or lost the game. Similarly,

$\begin{displaymath}\overrightarrow{r_{t+1}} = \overrightarrow{r_{t}} + \alpha[... ...}-V_{t}(s_{t})]\cdot\nabla_{\overrightarrow{r_{t}}}V_{t}(s_{t})\end{displaymath}$

Next: TD-Gammon Up: TD-Gammon Previous: methods of encoding

Yishay Mansour
2000-01-17