Overview ======== CrisprDetector is a Python program that identifies CRISPR repeats from raw sequence data. The input should be given in a FASTA format. The available code is a POC of the CRISPR detection algorithm for raw sequence data of short reads. In particular, it contains neither the usage of the Turtle tool for efficient k-mer counting, nor heuristics for filtering metagenomic data sets. Information on k-mer appearing multiple times in the same read is also discarded in this implementation. Execution Instructions ====================== Unzip the CrisprDetector zip file. Run the CrisprDetector.py file using the following command (code is suited for Python 2.X) python CrisprDetector.py [options] using the following notation: INPUT-FILE-PATH name of the FASTA input file Options may include the following parameters: -k length of the k-mer used (default is 23) -t threshold for frequent k-mers (default is 45) -o minimum overlap required between reads, as a fraction of the read length (default is 0.3) -sd length of read boundaries that may contain k-mers in spcaer-edge overlaps, as a fraction of the read length (default is 0.4) -n minimum number of spacers in an array (default is 2) -r minimum length of a repeat (default is 21) -R maximum length of a repeat (default is 60) -s minimum length of a spacer (default is 15) -S maximum length of a spacer (default is 75) -bn upper bound on number of isolated nodes to sample from the overlap graph (default is 50) -be upper bound on number of edges to sample for each node in the overlap graph (default is 250) For example: python CrisprDetector.py /home/machine/crispr/SRR123456.fasta python CrisprDetector.py /home/machine/crispr/SRR123456.fasta -k 22 -t 40 A list of additional constants is documented in the code. Output ====== Results are shown both on the screen and in a log file that is auto-generated ("log-crispr.txt").