next up previous
Next: Bibliography Up: No Title Previous: BLOSUM - BLOcks SUbstitution

   
Multiple Alignment

 
\begin{dfn}{\rm
A {\em multiple alignment} of strings $S_1, S_2, \dots, S_k$\spa...
...extension of $S_j$ , obtained by insertion of blanks.
\end{enumerate}} \end{dfn}


  
Figure 3.3: A multiple alignment of ACBCBD, CADDB and ACABCD.
\fbox{ \begin{minipage}[h]{\textwidth} \begin{center}\input{lec03_figs/lec03_multial.eepic} \setlength{\unitlength}{0.1000in} \end{center} \end{minipage} }

We are interested in finding a common alignment of several sequences, because this multiple similarity suggests a common structure of the protein product, a common function or a common evolutionary source. A multiple alignment carries more information than a pairwise one, as a protein can be matched against a family of proteins instead of only against another one.

The best multiple alignment of r sequences is calculated using an r-dimensional hyper-cube D, defining $D(j_1,j_2,\dots,j_r)$ to be the best score for aligning the prefixes of lengths $j_1,j_2,\dots,j_r$ of the sequences $x_1,x_2,\dots,x_r$, respectively.
We define

\begin{displaymath}D(0,0,\dots,0) = 0 \end{displaymath}


And we calculate

\begin{displaymath}D(j_1,j_2,\dots,j_r) = min_{\epsilon \in \{0,1\}^n,\, \epsilo...
...ilon_r) +
\rho(\epsilon_1 x_{j_1},\dots,\epsilon_r x_{j_r})\}\end{displaymath}


where $\rho$ is the cost function. The size of the hyper-cube is $O(\prod^{r}_{j=1}n_j)$, where nj is the length of xj, where computation of each of each entry consider 2r -1 others.
If $n_1=n_2=\dots=n_r=n$, the space complexity is of O(nr) and the time complexity is of O(2r nr).

There are several known useful possibilities for measuring the divergence of a set of aligned strings, namely the total distance between them.

Carrillo and Lipman [3] found a heuristic method for accelerating the search for the best multiple alignment. The method is based on the property that if the strings are relatively similar, the alignment path would be close to the main diagonal, therefore not all the values in the multi-dimensional cube need to be calculated, we now detail this algorithm.

Assuming an upper bound on cost of the best alignment, we will discard some alignments that are a priori known to be more expensive than the bound on the cost.

Let A be an alignment of strings $X_1,x_2, \dots , x_r$. Denote by Ai,j the pair of rows in A containing only xi and xj, and by c(Ai,j) the cost of this pairwise alignment. Denote by c(A) the total cost of A, and suppose we define $c(A)=\sum_{i<j}c(A_{i,j})$. Let A* be the optimal alignment (the one with the minimal cost), and suppose we know that $c(A^*) \leq c'$. Therefore,

\begin{displaymath}c' \geq c(A^*) = \sum_{i<j}c(A^{*}_{i,j}) = c(A^{*}_{u,v}) + ...
...) \geq
c(A^{*}_{u,v}) + \sum_{i<j \, (i,j)\neq(u,v)} D(x_i,x_j)\end{displaymath}


Where D(x,y) is the optimal score for aligning strings x and y. It follows that

\begin{displaymath}c(A^{*}_{u,v}) \leq c' - \sum_{i<j \, (i,j)\neq(u,v)} D(x_i,x_j)\end{displaymath}


A*u,v is a projection of A* on the uv-plain. By calculating D(xi,xj) for each i and j, we can find $B(u,v) = c' - \sum_{i<j \, (i,j)\neq(u,v)} D(x_i,x_j)$.

Now, consider a cell $(i_1 ,i_2 ,\dots ,i_u=s ,\dots ,i_v=t ,\dots ,i_r)$ whose projection to the uv-plane is (s,t). If the best alignment A* passes through this cell, then its projection A*u,v passes through (s,t), and its cost c(A*u,v) agrees with $best^{(u,v)}_{s,t} \leq c(A^*_{u,v}) \leq B(u,v) $ where best(u,v)s,t is an upper bound on the optimal score for an alignment through (s,t) in the uv-plain. We can compute such an upper bound as:

\begin{displaymath}best^{(u,v)}_{i,j} = D(x_{u,1} x_{u,2} \dots x_{u,i-1} , x_{v...
...,j})
+ D(x_{u,i+1}\dots x_{u,n_u} , x_{v,j+1} \dots x_{v,n_v})\end{displaymath}


where $d(\kappa_1,\kappa_2)$ is the cost of matching the characters $\kappa_1$ and $\kappa_2$.

Therefore if best(u,v)s,t > B(u,v), then the best alignment A* cannot pass through the cell
$(i_1 ,i_2 ,\dots ,i_u=s ,\dots ,i_v=t ,\dots ,i_r)$ for any $i_1,i_2,\dots,i_{u-1},i_{u+1},\dots,i_{v-1},i_{v+1},\dots,i_r$, and these cells can be discarded from the computation.


next up previous
Next: Bibliography Up: No Title Previous: BLOSUM - BLOcks SUbstitution
Itshack Pe`er
1999-01-10