Tel-Aviv University

School of Computer Science

Fall 2005-2006


Computational Genomics

  0382.3102

http://www.cs.tau.ac.il/~bchor/CG06/comp-genom.html

Benny Chor , Eytan Ruppin, and Tomer Shlomi

Lectures: Wednsdays 11:00-14:00, Kaploon 324

Recitations: Thursdays 14:00-15:00, Interdisciplinary Building

 


Contact Info:
 

 

 

      Email

Phone

Office

OfficeHours

Instructor

Prof. Benny Chor

benny AT cs.tau.ac.il

640-5977

Schreiber  223

     e-appointment 

Instructor

Prof. Eytan Ruppin

ruppin AT cs.tau.ac.il

640-6528

Schreiber  118

     e-appointment 

Teaching Assistant

Mr. Tomer Shlomi

shlomito AT post.tau.ac.il

640-5369

Schreiber  19מ

     e-appointment 

 

Course Outline

The Human Genome Project has completed the sequencing of the human DNA.  Other genome projects of various organisms (bacteria, fungi, worms, rice, wheat, flies, chicken, mice and primates) are either completed or in the process of advanced sequencing.  New bio-technologies produce huge amounts of biological data, whose manual analysis not possible.  These advances have pushed for a new scientific discipline of bioinformatics, which integrates computer science, statistics, mathematics, and various biological fields.  In this introductory course we concentrate on the algorithmic aspects of various computational tasks with biological origins. We will explore their computational complexity, and ways to solve them even if they are computationally intractable, such as approximation algorithms and heuristics.

Course topics include comparisons and alignments of biological sequences; Suffix trees and their uses for analyzing very long sequences (like complete genomes); Reconstruction of evolutionary trees using various optimization criteria, such as maximum parsimony, maximum likelihood, and distance criteria; Physical mapping of genomes; Genome alignment; Probabilistic models, such as PSSM (Position Specific Scoring Matrix) and HMM(Hidden Markov Models); HMM algorithms (Viterby, Expectation Maximization);Biological applications of the probabilistic models (e.g. gene prediction, protein classification, and sequence alignment); Analyzing biological networks; DNA microarrays and their uses; Clustering gene expression data; Classification using SVM (Support Vector Machines), and applications to clinical diagnosis.

 

Final exam: Closed book except for 4 double-sided pages of your choice.   

 

 

Tantative Course Plan (subject to changes):

Lecture

Date

 Subject

 Reference(s) 

Recitation

1

2/11/05

 Introduction and administrativia.
Pairwise Sequence Alignment: global & local.

 Kanehisa, ch. 3 
 Gusfield, ch. 11.

Durbin et. al., ch. 2.

 

        ______

2

9/11

Pairwise Sequence Alignment: Linear space algorithm; FASTA and BLAST heuristics;

PAM and BLOSSUM

Durbin et. al., 2.5, 2.7

Gusfield, ch. 11

 

Gap Penalties

3

16/11

Multiple sequence alignment.

Genome Alignment (no slides)

Introduction to suffix trees and suffix arrays: Dan Geiger (pdf), Fernndez-Baca slides (ppt).

 Gusfield, ch. 15 
 

Gusfield, ch. 5,6,7 

 

RNA Structure Analysis

4

23/11

Phylogenetic trees based on whole genomes.

Intro to phylogenetic trees reconstruction.

Durbin et al., 7.1, 7.2

 Gusfield, ch. 17.

Multiple Sequence Alignment: Family Representations and
Progressive Alignment Methods

 

5

30/11

Reconstructing phylogenetic (evolutionary) trees:

Neighbour joining.

Maximum parsimony (MP), big and small. Fitch algorithm

 

    

      Cancelled

6

7/12

Maximum likelihood estimate (MLE); ML inference of trees.

ML and AML properties. Large, small, and tiny versions. Hardness results.

 

 Solutions to assignment 1

(recitation in Schreiber 309)

7

14/12

1) Maximum likelihood (ML and AML): Felsenstein "tiny ML" algorithm, Pupko et. al "tiny AML" algorithm.

2) PSSM: A Simple model for sequence motives.

3) Introduction to discrete Markov Models.

 

Weighted parsimony.

Factor 2 approximation to parsimony.

TSP: Approximation and simulated annealing heuristic.

8

21/12

Hidden Markov models (HMM): the three problems

Baum's Forward-Backwards Algorithm; Viterbi's algorithm.

Rabiner's Tutorial

Survey from Univ. of Leeds

Durbin et al. , ch. 3

ML vs. MP.

ML examples.

9

            28/12 

Hidden Markov models (HMM): Parameter estimation using the EM algorithm.

ML vs MAP (maximum apostriori probability);

HMMs for gene predictions.

Durbin et al. , ch. 3 and 4

Profile HMM to identify protein families.

10

    4/1/2006   

 Introduction to biological networks 1

 

Solutions: Assignment 2

 

        11

    11/1

Introduction to metabolic networks 2

 

Pair HMM

         12

      19/1

 Intro to DNA chips and an animation;

 

 

 Network Motives

         

         13

 

      26/1

Pluto's Introduction to Machine Learning.

Support vector machines and kernels.

SVMs for string problems and protein classification.

Disease diagnosis by SVM on gene expression data.

Linear separability.

Schlkopf-Smola,1.1,1.2

 Solutions: Assignment 3.

 

 Linear programming.

         14

      3/2

.

 

Exam examples

 

Prerequisites

The course is "officially'' opened to 3rd year students of the bioinformatics track, to 3rd year and graduate Computer Science students. The algorithms course (“efficiency of computations”) is a prerequisite, and the computability course (“computational models”) should at least be taken concurrently with this course. Students from other disciplines are welcome to take the course, but are encouraged to first contact the instructor or teaching assistant. Some background in algorithms, probability, and programming is required.(In the previous times this course was given by this instructor, a number of life science students completed it successfully).
 
 
Administratrivia

 

  • Following last year's successful experience, in addition to a 3 hours weekly lecture, there will also be a 1 hour weekly recitation, given by the teaching assistant. Attending this recitation is absolutely optional, but is strongly encouraged.

BTW this is also the case regarding the lectures: If you can understand the material, answer the home assignments (on your own) and pass the exam in flying colors without attending classes, then surely windsurfing or boogy-boarding is a better use of your time.)

  • Grades in the course will be based on five-six problem sets (20-25% of total grade) and a final exam (75-80%). Intelligent participation in class may yield a small (unspecified, but surely non-negative) bonus. 

      Homework submission in singles or pairs (no triplets, quartets, quintets etc.). There will be a total of 5-6 problem sets, involving both "dry" assignments and "wet" ones. The wet parts will require understanding and running existing software, but not writing any code. Homework should be done independently. External sources (books, journal articles, web pages) can be used but should be clearly quoted. Research done at the Technion has shown a significant correlation between independent work and final performance.

 

Problem Sets

General Submission Guidelines

 

Assignment 1   (due 24/11/05)

Assignment 2   (due 22/12/05)

Assignment 3   (due 25/11/06)

Assignment 4   (not there yet)

 

Books

  • Biological Sequence Analysis, R. Durbin et al. , Cambridge University Press, 1998.
  • Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, D. Gusfield, Cambridge University Press, 1997.
  • Post-genome Informatics, M. Kanehisa , Oxford University Press, 2000.
  • Introduction to Computational Molecular Biology, J. Meidanis and J. Setubal, Brooks/Cole Pub Co., 1997.
  • Computational Molecular Biology : An Algorithmic Approach, P. Pevzner, MIT Press, 2000.
  • Introduction to Bioinformatics Algorithims, Neil C. Jones, Pavel A. Pevzner, MIT Press, 2004.
  •  Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond, Bernhard Schlkopf and Alex Smola, MIT Press, 2002.    Available on-line: Preface, Section 1.1, Section 1.2, and more.

Biology Background Books

  • The Cartoon Guide to Genetics, by Larry Gonick and Mark Wheelis, Harper Perennial, 1991 (I have not read this book, but I was told it is gives a good popular overview. Some students complained the book spends way too much time before getting down to relevant business).


 

Other On-line Courses

Lior Pachter course: Algebraic Statistics for Computational Biology, UC Berkeley, Fall 2004.
Ron Shamir course notes: Computational Genomics,
Tel-Aviv UniversityFall2003/4

Martin Tompa's course notes: Computational Biology, (CSE 527), University of Washington, Seattle, Winter 2000.

Nir Friedman's course slides: Computational Molecular Biology the Hebrew University, Jerusalem, Fall 1999.
Karp, Ruzzo and Tompa's course notes: Algorithms in Molecular Biology University of Washington, Seattle, Winter 1998. 

 

Other Interesting Resources (not course related)