next up previous
Next: Splice junctions Up: Gene Finding in Eukaryotes Previous: Typical figures: vertebrates

Markov Sequence Models

There are some models for distinguishing coding regions from non-coding regions that use Markov chains. These models are based on statistical differences between coding and non-coding regions.
A popular model is based on examining windows of 6 consecutive bases in the DNA sequence. This is a 5th order Markov model. We'll prepare in advance two probability tables one for coding regions and one for non-coding regions. Each table will be of size 46. For each 6-tuple of bases the table will register the probability of observing the 6th base, given the 5 preceding bases appeared in our window. Given a sequence we'll estimate the likelihood of it being coding using those 2 tables.
This model does not take into account any reading frame information. It is therefore called ahomogeneous model. An non-homogeneous model is a model that has different tables for the 3 possible reading frames. The problem with such models when dealing with eukaryote genome is that sometimes the exons are too short and that it is hard to detect splice junctions (donor and acceptor sites).

Peer Itsik
2000-12-25