next up previous
Next: Signal models Up: GENSCAN Previous: Transition probabilities

State length distributions

Different functional units on a gene have vastly different lengths. For example, an average internal exon is about 150bp long, while introns of the order of 1Kbp length are not uncommon. Thus, in our probabilistic model of gene structure, different states need to have different length distributions.
Intron lengths are known to vary dramatically with the C+G content category. For example, the mean intron length for category I ( < 43% C+G) of the training set is 2069bp as opposed to only 518bp for category IV ( > 57 % C+G) (see Figure [*]). Thus, the program uses separate distributions for intron states in each category.
The learning set shows quite different length distributions for initial exons, internal exons and terminal exons. Consequently, different distributions are used for them. It is important to note here that the length of an internal exon has to be consistent with the phase of its adjacent introns. For example, if the preceding state is I2 and the succeeding state is I1, then the generated internal exon length (for state E2 in this case) must be 3n+2 for some n. n is therefore generated randomly according to the length distribution and then a string of length 3n+2 is generated according to the string generating model for that state.
For the 5' UTR and 3' UTR states, geometric distributions with mean values of 769bp and 457bp are used.
next up previous
Next: Signal models Up: GENSCAN Previous: Transition probabilities
Peer Itsik
2000-12-25