next up previous
Next: Detection of Promoter Regions Up: Detection of Coding Regions Previous: ORFs as Markov chains

Using codon frequencies

In the model described above the probability of a codon occurrence depends on the preceding codon. We now consider a simpler model in which successive codons are independent. Let fabc denote the frequency with which the codon abc occurs in a coding region. Given a coding sequence $a_1,b_1,c_1,a_2,b_2,c_2,\ldots
,a_{n+1},b_{n+1}$ with an unknown reading frame, the probability of observing the sequence of n codons appearing in the reading frame starting with a1b1c1 is

\begin{displaymath}p_1=f_{a_1b_1c_1}\times
f_{a_2b_2c_2}\times\ldots\times f_{a_nb_nc_n} \end{displaymath}

Similarly, the probability of observing the n codons in the second and third coding frames are:

\begin{eqnarray*}p_2&=&f_{b_1c_1a_2}\times f_{b_2c_2a_3}\times\ldots\times
f_{b...
...}\times f_{c_2a_3a_3}\times
\ldots\times f_{c_na_{n+1}b_{n+1}}
\end{eqnarray*}


Let Pi denote the probability of the ith reading frame being the coding reading frame (assuming the region is coding). Pi can be calculated as follows:

\begin{displaymath}P_i=\frac{p_i}{p_1+p_2+p_3}
\end{displaymath}

The above computation can be used in a search algorithm as follows: Slide a window of size n along the sequence, and compute Pi for each start position of the window. The Codon Preference program, which is part of the GCG library, implements this method.
  
Figure: Results of codon preference program [1].


Figure [*] shows a the plot of log(P/1-P), which is the log likelihood, for the three reading frames. Each point represents the score for a 25 codon window around it. The actual genes are plotted as rectangles at the bottom. We can see that in the reading frame matching the upper plot, the genes are clearly recognized.
  
Figure: Results of codon preference program - 3rd position bias [1].


Figure [*] shows the plot of a program using only the 3rd position bias information. These methods depend on the accuracy of the codon frequency statistics of already found genes. The algorithm will also have difficulty in detecting horizontal gene transfer and other causes of heterogeneity.
next up previous
Next: Detection of Promoter Regions Up: Detection of Coding Regions Previous: ORFs as Markov chains
Peer Itsik
2000-12-25