next up previous
Next: TD( ) algorithm Up: No Title Previous: Online Versus Off-Line Updates

   
Differences between TD(0) and MC

If we run TD(0) and MC for long enough time (and gives them enough data) both algorithms will converge to the correct value function. However on the finite input , algorithms can give different results.

For understanding basic difference between MC and TD(0) we'll study following example:
Let us assume that we have run series of runs and got following results:
1.
(A,0) (B,0)
2.
(B,1)
3.
(B,1)
4.
(B,1)
5.
(B,1)
6.
(B,1)
7.
(B,1)
8.
(B,0)
We can create infinite sample from such a finite sample as follows. Each time we need to sample a run, we choose a random run out of the eight runs.
We want to estimate V(A) and V(B) using this finite sample.
For V(B) we have 8 runs, 6 give return 1 and 2 of them give return 0. Our estimate for V(B) is $\ \frac{3}{4}$.
How should we estimate V(A)? There are two reasonable options.
1.
We have one run visiting state A with return 0, so V(A) estimation is 0.
2.
We have one run visiting state A and this run moves from state A to state B. So we can assume that we always move from state A to state B. This implies that estimation is V(A) = V(B) = $\ \frac{3}{4}.$
Clearly first approach corresponds to MC. Now we show that TD(0) implements second approach. It means that TD(0) evaluation is in fact done using Markovian model produced by inspecting input : for any state s an action a the next state distribution is identical to the one observed in the sample.

  
Figure: Empirical model.
\begin{figure}\begin{picture}
(400,150)(0,25) % begins picture environment
\p...
...akebox(0,0)[bl]{ $\ >$\ }}
\end{picture}

\end{figure}

Claim 8.2   Given finite sample $\ T_{1}...T_{l} $ . If we run TD(0) until it converges (sampling every time randomly one of the runs $\ T_{1}...T_{l} $ ) then estimation $\ \widehat{V}^{\pi} $ equals to return of policy $\ \pi $ in empirical model for $\ T_{1}...T_{l} $ .



Definition
Given runs $\ T_{1}...T_{l} $ , the empirical model is defined as follows:
$\ \forall s \in S , a \in A: $ $\
\widehat{p}(s'\vert s,a) = \frac{\char93 (s,a,r,s')}{\char93 (s,a)} $ and $\
\widehat{r}(s,a) = Avg(r(s,a)) $
where $\ \char93 (s,a)$ is the number of times in $\ T_{1}...T_{l} $ we did action a in state s and $\ \char93 (s,a,r,s') $ is the number of times when after doing action a we reached state s'.


$\ \mathbf{Explanation}: $ It is easy to see from the definition of TD(0) that based are impacted only on successive pairs of states $\ \char93 (s,a,r,s') $.
$\ V_{n+1}(s) \leftarrow V_{n}(s) + \alpha_{n} [\widehat{r} +
\lambda V_{n}(s') - V_{n}(s)] $
Number of updates for state $\ s'$ is distributed according to $\ \widehat{p} $ and so:
$\ V_{n+1}(s) \leftarrow V_{n}(s) + \alpha_{n} [\widehat{r} +
\lambda E_{\widehat{p} \sim s'} [V_{n}(s')]] $
We received equations for model with $\ \widehat{p} $ and $\
\widehat{r} $ which is exactly the empirical model.
$\ \mathbf{Conclusion}.$ Here is conceptual difference between MC and TD(0):
MC considers only information about initial state of the run (received at the end of the run) and TD(0) assumes that there is Markovian model and makes use of this assumption.

next up previous
Next: TD( ) algorithm Up: No Title Previous: Online Versus Off-Line Updates
Yishay Mansour
2000-01-06