Preface: CpG islands

(6.1) |

We assume that is a random process with a memory of length 1, i.e., the value of the random variable

(6.2) |

The probability of the whole sequence

(6.3) |

We can also add fictitious and symbols to simplify the formula, where is the background probability of the symbol

(6.4) |

Let

(6.5) |

The higher this score, the more likely is that

The main disadvantage in this algorithm is that we have no information about the lengths of the islands, while the algorithm suggested above assumes that those islands are at least nucleotides long. Should we use a value of which is too large, the CpG islands would be short sub-strings of our windows, and the score we give those windows may not be high enough. A better approach to such problems is described in the following section.