next up previous
Next: A Probabilistic Model of Up: Gene Finding Previous: Detecting Promoter Regions

   
Gene Structure in Eukaryotes

The gene structure and the gene expression mechanism in eukaryotes are far more complicated than in prokaryotes. In typical eukaryotes, the region of the DNA coding for a protein is usually not continuous. This region is composed of alternating stretches of exons and introns. During transcription, both exons and introns are transcribed onto the RNA, in their linear order. Thereafter, a process called splicing takes place, in which, the intron sequences are excised and discarded from the RNA sequence. The remaining RNA segments, the ones corresponding to the exons are ligated to form the mature RNA strand. A typical multi-exon gene has the following structure (as illustrated in figure 7.5). It starts with the promoter region, which is followed by a transcribed but non-coding region called 5' untranslated region (5' UTR). Then follows the initial exon which contains the start codon. Following the initial exon, there is an alternating series of introns and internal exons, followed by the terminating exon, which contains the stop codon. It is followed by another non-coding region called the 3' UTR. Ending the eukaryotic gene, there is a polyadenylation (polyA) signal: the nucleotide Adenine repeating several times. The exon-intron boundaries (i.e., the splice sites) are signalled by specific short (2bp long) sequences. The 5'(3') end of an intron (exon) is called the donor site, and the 3'(5') end of an intron (exon) is called the acceptor site. The problem of gene identification is complicated in the case of eukaryotes by the vast variation that is found in gene structure. In order to be able to apprehend this, we shell consider some statistics from the available genomic data. On average, a vertebrate gene is around 30Kb long, out of which the coding region is only about 1Kb long. The average coding region consists of 6 exons, each about 150bp long. Huge deviations from the average are observed. For example, the gene called dystrophin is 2.4MB long. Blood coagulation-factor VIII has 26 exons whose size varies from 69bp to 3106bp, with the total coding region reaching length around 186Kb and the introns lengths adding up to 32.4Kb. Intron number 22 produces 2 transcripts unrelated to this gene, one for each strand. An average 5' UTR is 750bp long, but it can be longer and span several exons (for example, in the MAGE family). On average, the 3' UTR is about 450bp long, but examples exist where its length exceeds 5Kb (e.g., the gene for Kallman's syndrome).
next up previous
Next: A Probabilistic Model of Up: Gene Finding Previous: Detecting Promoter Regions
Itshack Pe`er
1999-02-03