BLOSUM - BLOcks SUbstitution Matrix

The BLOSUM matrix is another amino acid substitution matrix, first calculated by Henikoff and Henikoff [5]. For its calculation only blocks of amino acid sequences with small change between them are considered. These blocks are called conserved blocks (See figure 3.2). One reason for this is that one needs to find a multiple alignment between all these sequences and it is easier to construct such an alignment with more similar sequences. Another reason is that the purpose of the matrix is to measure the probability of one amino acid to change into another, and the change between distant sequences may include also insertions and deletions of amino acids. Moreover, we are more interested in conservation of regions inside protein families, where sequences are quite similar, and therefore we restrict our examination to such.

**Figure 3.2:** Alignment of several sequences. The conserved blocks are marked.
$\fbox{ \begin{minipage}[h]{\textwidth} \begin{center}\input{lec03_figs/lec03_block.eepic} \setlength{\unitlength}{0.1000in} \end{center} \end{minipage} }$

The first stage of building the BLOSUM matrix is eliminating sequences, which are identical in more than x% of their amino acid sequence. This is done to avoid bias of the result in favor of a certain protein. The elimination is done either by removing sequences from the block, or by finding a cluster of similar sequences and replacing it by a new sequence that represents the cluster. The matrix built from blocks with no more the x% of similarity is called BLOSUM-x (e.g. the matrix built using sequences with no more then 50% similarity is called BLOSUM-50.)

The second stage is counting the pairs of amino acids in each column of the multiple alignment. For example in a column with the acids AABACA (as in the first column in the block in figure 3.2), there are 6 AA pairs, 4 AB pairs, 4 AC, and one BC. The probability q_i,j for a pair of amino acids in the same column to be A_i and A_j is calculated, as well as the probability p_i of a certain amino acid to be A_i.

In the third stage the log odd ratio is calculated as $s_{i,j} = \log_2 \frac{q_{i,j}}{p_i p_j}$ . As final result we consider the rounded 2s_i,j, this value is stored in the (i,j) entry of the BLOSUM-x matrix.

In contrast to the PAM matrices, more sequences are examined in the process of computing the BLOSUM matrix. Moreover, the sequences are of specific nature of resemblance, and therefore the two sets of matrices differ.

Comparing the efficiency of two matrices is done by calculating the ratio between the number of pairs of similar sequences discovered by a certain matrix but not discovered by another one and the number of pairs missed by the first but found by the other. According to this comparison BLOSUM-62 is found to be better than other BLOSUM-x matrices as well as than PAM-x matrices.