Sample input files for Ron Shamir's CS workshop 0368-3500-07 (Fall 2007-8)

Input #1:   input1.fasta
The first input file is small - 100 sequences of length 200 (the real input files will be much larger, of course).
It contains a pair of simple (string) motifs of length 6 with a significant order bias:
    motif A: ACCTTT   ,   motif B: GGGAAG
There should be 15 occurrences of the pair A->B (i.e., A upstream of B) with a gap of length 10-20 between them;
there are 5 additional occurrences on the reverse-complement strand.

Input #2:   input2.fasta
The second sample is larger - 2,000 sequences of length 1,000.
It contains several pairs of motifs of length 8 with an order bias:

  1. A pair of string motifs:     motif A: TAAAAAAT   ,   motif B: CCCCGGGG
    This pair appears on both strands with a gap of length 20-50.
    There are 54 occurrences of the pair A->B with the above gap lengths, and 13 occurrences of the reverse order B->A.
  2. A pair of consensus motifs:     motif A: TA[ACT]AA[AG]AT   ,   motif B: GGAA[AT]TTT
    This pair appears only on the "+" (=original) strand with a gap of length 10-30.
    There are 160 occurrences of the pair A->B.
    This pair is also localized (i.e., its hits aren't distributed uniformly along the sequences).
  3. Another pair of consensus motifs:     motif A: GAGA[CG][AT]CC   ,   motif B: CTATACC[CG]
    This pair appears on both strands with a gap of length 40-45 (the sequence of the gap is quite conserved).
    There are roughly 370 occurrences of the pair A->B.
    A third motif (AACGTTCC) appears 60-80 bases upstream of this pair (only on the "+" strand).