next up previous
Next: GENSCAN Up: Gene Finding Previous: Gene Structure in Eukaryotes

   
A Probabilistic Model of Gene Structure

As we have seen, a Hidden Markov Model (HMM) is a Markov chain in which the states are not directly observable. Instead, the output of the current state is observable. The output symbol for each state is randomly chosen from a finite output alphabet according to some probability distribution. A Generalized Hidden Markov Model (GHMM) generalizes the HMM as follows: in a GHMM, the output of a state may not be a single symbol. Instead, the output may be a string of finite length. For a particular current state, the length of the output string as well as the output string it self might be randomly chosen according to some probability distribution. The probability distribution need not be the same for all states. For example, one state might use a weight matrix model for generating the output string, while another might use a HMM. Formally a GHMM is described by a set of four parameters:
  
Figure 7.9: GHMM model describing the eukaryotic gene. E states correspond to exons, while I states correspond to introns.

The probabilistic model for gene structure as suggested by Berge and Karlin [1], is based on GHMM (see figure 7.9). The states of the GHMM correspond to the different functional units on a gene, like promoter regions, exon, intron etc. The transition between the states ensure that the order in which the model visits various states is biologically consistent. The states for an intron and an internal exon are subdivided according to phase offset to the codon frames. For $0\leq
i\leq 2$, the state Ii (respectively, Ei) corresponds to introns (exons) starting i positions after a codon starts. Note that the only transition from Ii to any internal exon state is to Ei. Also note that the model is divided into two symmetric halves. The upper half of the figure (states with a ``+" superscript) models a gene on the forward strand and the lower half models a gene on the backward strand of the genomic sequence. If the parameters (like $\pi$, ai,j, etc.) are suitably determined, then the model can be used for gene structure prediction in the following manner.

Definition 7.1   A parse $\Phi$ of sequence S is an ordered sequence of states ( $q_1,\dots ,q_t$) with an associated duration di to each state. The length of $\Phi$ is $L=\sum_{i=1}^{t} d_i$

Suppose we are given a DNA sequence S and a parse $\Phi$, both of length L. The conditional probability of the parse $\Phi$ given that the sequence generated is S, can be computed as:

\begin{displaymath}P(\Phi_i\vert S)=\frac{P(\Phi_i,S)}{P(S)}=\frac{P(\Phi_i,S)}{...
...its _{\Phi _j \mbox { is a parse of length L}}^{}P(\Phi_j,S)}
\end{displaymath}

Let Si be the segment of S produced by qi, and let P(Si|qi,di) be the probability of generating Si by the sequence generation model of state qi with length di.

\begin{displaymath}P(\Phi_i\vert S)=\prod_{}^{}f_{q_i}(d_i)P(S_i\vert q_i,d_i)=\prod_{k=2}^{t}T_{q_{k-1}q_k}f_{q_k}(d_k)P(S_k\vert q_k,d_k)
\end{displaymath}

The most probable parse, $\Phi_{opt}$, can be computed by Viterbi like algorithm. P(S) can be computed by a forward-like algorithm.
next up previous
Next: GENSCAN Up: Gene Finding Previous: Gene Structure in Eukaryotes
Itshack Pe`er
1999-02-03