next up previous
Next: UPGMA Up: Distance Matrix Methods Previous: Distance between DNA Sequences

     
Least Squares Methods

One of the more statistically justified methods to approximate a distance matrix is the least squares approach. Basically, our goal is to find a tree T, whose leaves are the n given species, and that predicts distances dij between the species, so that the following expression is minimized:

 \begin{displaymath}
SSQ(T) \equiv \sum_{i=1}^{n} \sum_{j \neq i} w_{ij} (D_{ij} - d_{ij})^2
\end{displaymath} (9.6)


where Dij is the observed distance between species i and j, and wij are given weights. The SSQ is a measure of the discrepancy between the observed distances Dij and the distances dij predicted by T. The weights wij are usually all 1, or $w_{ij}=\frac{1}{D_{ij}^2}$.

Problem 9.10   Least Squares Tree.
INPUT: The distance Dij between species i and j, for each $1 \leq
i,j \leq n$, and a corresponding set of weights wij.
QUESTION: Find the phylogenetic tree T, with the species as its leaves, that minimizes SSQ(T).

In general, finding the least squares tree is an NP-complete problem  [2]. We will discuss two polynomial heuristics - UPGMA and Neighbor-Joining. We have already studied these algorithms in lecture #5, where we used them to iteratively add one additional string to a growing multiple alignment, thus obtaining a progressive alignment.
next up previous
Next: UPGMA Up: Distance Matrix Methods Previous: Distance between DNA Sequences
Itshack Pe`er
1999-02-18