next up previous
Next: Compositional Differences Up: Gene Finding Previous: Prokaryotes

   
Finding Long ORF's

One way to distinguish coding regions from non-coding regions, is to look at the frequencies of stop codons. Assuming a uniform random distribution, a stop codon is expected to be observed every $64/3\approx 21$ codons (since there are 3 stop codons). Average proteins are much longer, being coded for by about 1000bp (base pairs). Each coding region has only one stop codon, which terminates the region. Therefore, one way to detect the coding regions, is to look for long sequences of codons, without any stop codon. The algorithm that uses the above idea, scans the DNA sequence, looking for long ORF's in all three reading frames. After detecting a stop codon, the algorithm scans backward, searching for a start codon. This algorithm will fail to detect very short genes, and also won't identify overlapping long ORFs on opposite strands.

Itshack Pe`er
1999-02-03