BLOSUM - BLOcks SUbstitution Matrix

The BLOSUM matrix is another amino acid substitution matrix, first calculated by *Henikoff* and *Henikoff* [5]. For its calculation only blocks of amino acid sequences with small
change between them are considered. These blocks are called *conserved blocks* (See figure 3.2).
One reason for this is that one needs to find a multiple alignment between all these sequences and it is easier
to construct such an alignment with more similar sequences. Another reason is that the purpose of the matrix is to measure
the probability of one amino acid to change into another, and the change between distant sequences may include also
insertions and deletions of amino acids. Moreover, we are more interested in conservation of
regions inside protein families, where sequences are quite similar, and therefore we restrict our examination to such.

The first stage of building the BLOSUM matrix is eliminating sequences, which are identical in more than *x*% of
their amino acid sequence. This is done to avoid bias of the result in favor of a certain protein. The elimination
is done either by removing sequences from the block, or by finding a cluster of similar sequences and replacing it by a new sequence
that represents the cluster. The matrix built from blocks with no more the *x*% of similarity is called BLOSUM-*x* (e.g. the matrix
built using sequences with no more then 50% similarity is called BLOSUM-50.)

The second stage is counting the pairs of amino acids in each column of the multiple alignment. For example in a column
with the acids AABACA (as in the first column in the block in figure 3.2), there are 6 AA pairs, 4 AB pairs,
4 AC, and one BC.
The probability *q*_{i,j} for a pair of amino acids in the same column to be *A*_{i} and *A*_{j} is calculated,
as well as the probability *p*_{i} of a certain amino acid to be *A*_{i}.

In the third stage the *log odd ratio* is calculated as
.
As final result we consider the rounded 2*s*_{i,j}, this value is stored in the (*i*,*j*) entry of the BLOSUM-*x* matrix.

In contrast to the PAM matrices, more sequences are examined in the process of computing the BLOSUM matrix. Moreover, the sequences are of specific nature of resemblance, and therefore the two sets of matrices differ.

Comparing the efficiency of two matrices is done by calculating the ratio between the number of pairs of similar sequences
discovered by a certain matrix but not discovered by another one and the number of pairs missed by the first but
found by the other. According to this comparison BLOSUM-62 is found to be better
than other BLOSUM-*x* matrices as well as than PAM-*x* matrices.