next up previous
Next: UPGMA Up: Distance Based Methods Previous: Distance between DNA Sequences

   
Least Squares Methods

One of the more statistically justified methods to approximate a distance matrix is the least squares approach. In this formulation we are giving, for each pair of species, the measurred distance Di,j between them, and the weightwi,j that intuitively quuantifies the accuracy of this measure .Our goal is to find a tree T, whose leaves are the n given species, and that predicts distances dijbetween the species, so that the following expression is minimized:

 \begin{displaymath}
SSQ(T) \equiv \sum_{i=1}^{n} \sum_{j \neq i} w_{ij} (D_{ij} - d_{ij})^2
\end{displaymath} (6)


The SSQ is a measure of the discrepancy between the observed distances Di,j and the distances di,j predicted by T. The weights wi,j are usually all 1, or $w_{i,j}=\frac{1}{D_{i,j}^2}$.


\begin{problem}Least Squares Tree.\\
{\bf INPUT:} The distance $D_{i,j}$\space ...
...s length, with the species as its leaves,
that minimize $SSQ(T)$ .
\end{problem}

Again, a "small" version of this problam is formulated for a given tree, only trying to minimize SSQ by determining the branches length. In general, the "large" problam of finding the least squares tree is NP-complete  [2]. We will discuss two polynomial heuristics - UPGMA and Neighbor-Joining. We have already studied these algorithms in lecture #5, where we used them to iteratively add one additional string to a growing multiple alignment, thus obtaining a progressive alignment.


next up previous
Next: UPGMA Up: Distance Based Methods Previous: Distance between DNA Sequences
Peer Itsik
2001-01-01