Next: TD( ) algorithm Up: No Title Previous: Online Versus Off-Line Updates

## Differences between TD(0) and MC

If we run TD(0) and MC for long enough time (and gives them enough data) both algorithms will converge to the correct value function. However on the finite input , algorithms can give different results.

For understanding basic difference between MC and TD(0) we'll study following example:
Let us assume that we have run series of runs and got following results:
1.
(A,0) (B,0)
2.
(B,1)
3.
(B,1)
4.
(B,1)
5.
(B,1)
6.
(B,1)
7.
(B,1)
8.
(B,0)
We can create infinite sample from such a finite sample as follows. Each time we need to sample a run, we choose a random run out of the eight runs.
We want to estimate V(A) and V(B) using this finite sample.
For V(B) we have 8 runs, 6 give return 1 and 2 of them give return 0. Our estimate for V(B) is .
How should we estimate V(A)? There are two reasonable options.
1.
We have one run visiting state A with return 0, so V(A) estimation is 0.
2.
We have one run visiting state A and this run moves from state A to state B. So we can assume that we always move from state A to state B. This implies that estimation is V(A) = V(B) =
Clearly first approach corresponds to MC. Now we show that TD(0) implements second approach. It means that TD(0) evaluation is in fact done using Markovian model produced by inspecting input : for any state s an action a the next state distribution is identical to the one observed in the sample.

Claim 8.2   Given finite sample . If we run TD(0) until it converges (sampling every time randomly one of the runs ) then estimation equals to return of policy in empirical model for .

Definition
Given runs , the empirical model is defined as follows:
and
where is the number of times in we did action a in state s and is the number of times when after doing action a we reached state s'.

It is easy to see from the definition of TD(0) that based are impacted only on successive pairs of states .

Number of updates for state is distributed according to and so:

We received equations for model with and which is exactly the empirical model.
Here is conceptual difference between MC and TD(0):
MC considers only information about initial state of the run (received at the end of the run) and TD(0) assumes that there is Markovian model and makes use of this assumption.

Next: TD( ) algorithm Up: No Title Previous: Online Versus Off-Line Updates
Yishay Mansour
2000-01-06