Tel-Aviv University

School of Computer Science

Spring 2009


Computational Genomics

  0382.3102

http://www.cs.tau.ac.il/~bchor/CG06/comp-genom.html

Instructor: Prof. Benny Chor

Teaching Assistant:  Dr. Igor Ulitsky

Lectures: Wednsdays 9:00-12:00, Schreiber 008

Recitations: Thursdays 12:00-13:00, Schreiber 006

 


Contact Info:
 

 

 

      Email

Phone

Office

OfficeHours

Instructor

Prof. Benny Chor

benny AT cs.tau.ac.il

640-5977

Schreiber  223

     e-appointment 

Teaching Assistant

Dr. Igor Ulitsky

Ulitskyi AT gmail.com

640-5394

Schreiber  011

     e-appointment 

 

New Course Bulletin Board

Date posted

Contents

March 22
Assignment 1 has been posted (see link below)
April 11
Assignment 2 has been posted (see link below)
May 19
Assignment 3 has been posted (see link below)

 

Course Outline

The Human Genome Project has completed the sequencing of the human DNA.  Other genome projects of various organisms (bacteria, fungi, worms, rice, wheat, flies, chicken, mice and primates) are either completed or in the process of advanced sequencing.  New bio-technologies produce huge amounts of biological data, whose manual analysis not possible.  These advances have pushed for a new scientific discipline of bioinformatics, which integrates computer science, statistics, mathematics, and various biological fields.  In this introductory course we concentrate on the algorithmic aspects of various computational tasks with biological origins. We will explore their computational complexity, and ways to solve them even if they are computationally intractable, such as approximation algorithms and heuristics.

Course topics include comparisons and alignments of biological sequences; Suffix trees and their uses for analyzing very long sequences (like complete genomes); Reconstruction of evolutionary trees using various optimization criteria, such as maximum parsimony, maximum likelihood, and distance criteria; Physical mapping of genomes; Genome alignment; Probabilistic models, such as PSSM (Position Specific Scoring Matrix) and HMM (Hidden Markov Models); HMM algorithms (Viterby, Expectation Maximization); Biological applications of the probabilistic models (e.g. gene prediction, protein classification, and sequence alignment); Analyzing biological networks; DNA microarrays and their uses; Clustering gene expression data; Classification using SVM (Support Vector Machines), and applications to clinical diagnosis; Metagenomics.

 

  

 

 

Tantative Course Plan (subject to changes):

Lecture

Date

 Subject

 Reference(s) 

Recitation

1

4/3/09

 Introduction and administrativia.
Pairwise Sequence Alignment: global & local.

 Kanehisa, ch. 3 
 Gusfield, ch. 11.

 Durbin et. al., ch. 2.

 

  no recitation this week

2

11/3

Similarity and distance; Affine gap penalties; Fasta and Blast heuristics.

Durbin et. al., ch. 2.

 Gusfield, ch. 11

More pairwise sequence allignments
3
18/3

Scoring functions: PAM and BLOSUM

Multiple sequence alignment

Durbin et. al., 2.7

Gusfield, ch. 11,15

MSA

 

4
25/3

Suffix trees - Introduction (Dan Geiger pdf)

Suffix trees and arrays - (Fernndez-Baca ppt)

Phylogenetic trees based on whole genomes

Gusfield, ch. 5,6,7 

Testing multiple hypotheses

Suffix trees

5

1/4

(no kidding)

Phylogenetic trees reconstruction:

Distance methods

Maximum parsimony

Durbin et. al.,7.1, 7.2, 7.4

Gusfield, ch. 17 

ML and AML in phylogenetic trees
6
21/4

Maximum parsimony cont., including proof of hardness.

Maximum likelihood (1)

Maximum likelihood (2) - a status report

 

Durbin et. al., ch. 7 and 8

Gusfield, ch. 17 

 

PSSMs

Gene Finding

7
6/5

1) Introduction to discrete Markov Models

2) Hidden Markov models (HMM): the three problems

3) Baum's Forward-Backwards Algorithm (board presentation, following Rabiner's tutorial).

Rabiner's Tutorial

Survey from Univ. of Leeds

Durbin et al. , ch. 3

 

HMMs for gene finding

8
13/5

Viterbi's algorithm (board presentation, following Rabiner's tutorial)..

Using EM to optimize HMM parameters (board presentation, following Rabiner's tutorial).

Determining protein families by domains: PSSMs, HMMs, and the Pfam database

(see ppt from Leuven)

Dan Geiger (Technion) EM slides

Bilmes' gentle tutorial to the EM algorithm

Durbin et al., ch. 3

 

EM (expectation maximization) and ML

(maximum likelihood)

9
20/5

Random graphs basics:

Introduction to biological networks

No recitation: Student's day
10
27/5
Intro to DNA microarrays

 

No recitation: Shavuot vacation

11
3/6
More microarray analysis: Clustering; Diagnosis.

 

Differential expression

12
10/6

Pluto's intro to machine learning.

Differential gene expression and class discovery (based on a presentation of Dr. Zohar Yakhini)

Linear separability of gene expression pairs (based on joint work with (Giora Unger)

Pluto's intro to machine learning (3.5GB file): Based on ch. 1 from Schlkopf and Smola's book
13
17/6

DNA sequencing

Pizza, pasta, and beer

 

Prerequisites

The course is officially opened to 3rd year students of the bioinformatics track, to 3rd year and graduate Computer Science students. The algorithms course (formerly efficiency of computations) is a prerequisite, and the computability course (computational

models) should at least be taken concur8rently with this course. Students from other disciplines are welcome to take the course, but are encouraged to first contact the instructor or teaching assistant. Some background in algorithms, probability, and programming is required.(In the previous times this course was given by this instructor, a number of life science students completed it successfully).
 
 
Administratrivia

 

Following last years’ successful experience, in addition to a 3 hours weekly lecture, there will also be a 1 hour weekly recitation, given by the teaching assistant. Attending this recitation is absolutely optional, but is strongly encouraged.

BTW this is also the case regarding the lectures: If you can understand the material, answer the home assignments (on your own) and pass the exam in flying colors without attending classes, then surely windsurfing or boogy-boarding is a better use of your time.)

Grades in the course will be based on five-six problem sets (20-30% of total grade) and a final exam (70-80%). Intelligent participation in class may yield a small (unspecified, but surely non-negative) bonus. 

Homework submission in singles or pairs (no triplets, quartets, quintets etc.). There will be a total of 3-5 problem sets, involving both "dry" assignments and "wet" ones. The wet parts will require understanding and running existing software, but not writing any code. Homework should be done and written independently. External sources (books, journal articles, web pages) can be used but should be clearly quoted. Research done at the Technion has shown a significant correlation between independent work and final performance.

Moed A from 2005/6 can be found here 

 

Problem Sets

Assignment 1  Posted March 22nd. Due April 5th. 

Assignment 2   Posted April 11th. Due May 1st (except wet problem, due later). 

Assignment 3   Posted May 19th. Due June 7th. 

Assignment 4  

 

Books

Biological Sequence Analysis, R. Durbin et al. , Cambridge University Press, 1998.

Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, D. Gusfield, Cambridge University Press, 1997.

Post-genome Informatics, M. Kanehisa , Oxford University Press, 2000.

Introduction to Computational Molecular Biology, J. Meidanis and J. Setubal, Brooks/Cole Pub Co., 1997.

Computational Molecular Biology : An Algorithmic Approach, P. Pevzner, MIT Press, 2000.

Introduction to Bioinformatics Algorithims, Neil C. Jones, Pavel A. Pevzner, MIT Press, 2004.

 Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond, Bernhard Schlkopf and Alex Smola, MIT Press, 2002.    Available on-line: Preface, Section 1.1, Section 1.2, and more.

Biology Background Books

The Cartoon Guide to Genetics, by Larry Gonick and Mark Wheelis, Harper Perennial, 1991 (I have not read this book, but I was told it is gives a good popular overview. Some students complained the book spends way too much time before getting down to relevant business).


 

Other On-line Courses

 

Dan Geiger course: Algorithms in Computational Biology , Technion, Spring 2008.

Lior Pachter course: Algebraic Statistics for Computational Biology, UC Berkeley, Fall 2004.
Ron Shamir course notes: Computational Genomics,
Tel-Aviv University, Fall 2003/4

Nir Friedman's course slides: Computational Molecular Biology the Hebrew University, Jerusalem, Fall 1999.
Ruzzo and Tompa's course notes: Algorithms in Molecular Biology University of Washington, Seattle

 

Other Interesting Resources (not directly course related)
 

    Mother Jones: Alternative views on some world affairs

    The Ig Nobel Prizes

    Snails are faster than ADSL  (see also here, and pay special attention to the apparatus)