next up previous
Next: Neighbor Joining Up: Distance Based Methods Previous: Least Squares Methods

   
UPGMA

Being able to assign branch lengths to a given tree, as we have demonstrated, we need to minimize SSQ(T) over the possible tree topologies. The UPGMA, or Unweighted Pair Group Method with Arithmetic mean [13], is a heuristic algorithm that usually generates satisfactory results. Basically, the algorithm iteratively joins the two nearest clusters (or groups of species), until one cluster is left.

UPGMA algorithm:
Let d be the distance function between species, we define the distance Di,j between two clusters of species Ci and Cj the following:

\begin{displaymath}D_{i,j} = \frac{1}{n_i + n_j} \sum_{p\in C_i} \sum_{q\in
C_j}d(p,q)
\end{displaymath}

where ni = |Ci| and nj = |Cj|

Complexity: The time and space complexity of UPGMA is O(n2), since there are n-1 iterations, with O(n) work in each one.

A clocklike, or ultrametric, tree is a rooted tree, in which the total branch length from the root to any leaf is equal. In other words, there is a ``molecular clock'' that ticks in a constant pace (i.e., the mutation rate is identical for all species), and all the observed species are at an equal number of ticks from the root (see also page [*]). If the solution to the least squares problem is 0, and there is a molecular clock (i.e., the solution is a clocklike tree), then UPGMA is guaranteed to return the optimal solution. Actually, UPGMA implicitly assumes the existence of an ultrametric tree, which explains why the new node, (ij), is the mean of the two nodes that were joined to create it, as shown in figure 8.8. It is therefore not surprising that for substantially non-clocklike trees, the algorithm might give seriously misleading results.


  
Figure 8.8: A clocklike tree, showing the clustering (ab) of the two nodes a and b by UPGMA and by the Neighbor-Joining algorithm.
\includegraphics{lec08_figs/clocktree.ps}

Another assumption that UPGMA does is additivity: In the "real" tree, distances between species are the sum of distances along the path between the corresponding leaves.

There are two corollaries of additivity that the next algorithm will use


  
Figure 8.9: di,j+dk,l <= di,j+dk,l = di,j+dk,l.
\includegraphics{lec08_figs/additivitycor.ps}


next up previous
Next: Neighbor Joining Up: Distance Based Methods Previous: Least Squares Methods
Peer Itsik
2001-01-01