next up previous
Next: The HCS Algorithm Up: cDNA Clustering Previous: Motivation

The Experimental Problem

Recall that a gene is transcribed into mRNA, which is then translated into a protein. In order to check what genes are expressed in a given tissue, we use cDNA - a reverse-transcript of the mRNA, which is more stable. There exists methods which enable us to extract cDNA in large quantities from the tissue, and we can see, at a given moment, which cDNA molecules exist in the tissue (details are omitted). In fact, we sample cDNA molecules from the tissue. The more a gene is expressed, the more samples of its matching cDNA we will find. The sample we have obtained contains about 100,000 cDNA fragments, each of them between 500 and 2,500 base-pairs long, the average being around 1,200. Reverse transcription of mRNA uses a poly-T primer that hybridizes to the poly-A tail of the mRNA. All cDNA fragments we have from the gene will thus have a common start. Since reverse transcription mya stop abruptly, the length of such fragments may vary. We can now formulate the problem we face:
\begin{problem}Determining gene expression. \\ *
{\bf INPUT:} Unsequenced cDNA...
... {\bf GOAL:} Find which genes are present, and in what abundance.
\end{problem}
The simple solution is to sequence all the cDNA fragments we have extracted from the tissue. This is both wasteful and slow. We have extracted a very large quantity of cDNA, and many fragments come from the same genes. Sequencing all of them will mean sequencing the same genes over and over again.
next up previous
Next: The HCS Algorithm Up: cDNA Clustering Previous: Motivation
Peer Itsik
2001-01-31