Workshop in Bioinformatics

(0368-3500-34)

Instructor :  Prof. Benny Chor (benny@cs.tau.ac.il), Schreiber, room 223.
Teaching Assistants : Tamir Tuller (tamirtul@post.tau.ac.il), Schreiber, room M20 (basement),                                                                                             and Giora Unger (giorau@post.tau.ac.il), Schreiber, room 012.
Where : Schreiber, Room 008.
When : Mondays 11:00-13:00, Spring 2004.

Students of the bioinformatics track will have higher priority in registering to this workshop.




 

Goals & scope

  1. Analysis, design and implementation of combinatorial optimization algorithms with bioinformatics relevance.
  2. Some contemporary problems in comparative genomics, DNA chips analysis, phylogenetic analysis, and more.
  3. Getting acquainted with publicly available Bioinformatics databases and using them.
  4. Conducting supervised research in computational biology.
  5. Efficient implementation of algorithms in C, C++, Java or Matlab.

The workshop will provide hands on experience in implementing various optimization techniques, developing statistical tools for analyzing biological data, and using public biological databases. Most (but not all)  projects have both research and implementation aspects. We offer projects in various current subjects in computational biology, including analysis of gene expression data and DNA microarrays, analysis of gene structure, understanding genomic pathways, construction of phylogenetic trees and networks, etc.

The workshop also includes four lectures on various topics in software engineering, given by the Computer Science system staff. These will include subjects relevant to ant medium to large scale software project. Students are expected to utilize these tools and techniques in the final projects and documentation. Attending these four lectures is mandatory.

An integral part of the workshop is presenting an outline of the project, as well as the completed project,
to all participants (not just the course staff). It is mandatory to physically attend class during these entire parts of the workshop,
which are expected to take place on March 29th (outlines presentation) and during the last three meetings (final projects presentation). 
Naturally, your projects should be completed by then, and submitted no later than the last day of the semester.

Projects will be performed in groups consisting of at most two students (if the number of students is odd, one triplet may be allowed).

In the first two weeks, each group must choose two preferred projects (a blue one and a red one) and send them to Giora (giorau@post.tau.ac.il).
We will try to match choices with availability, but in case of collisions, assignments will be based on the time at which the request reaches Giora's mailbox.


Tentative Timetable  (subject to changes)

 

March 1

Lecture: Introduction to BioInformatics. Administratrivia.

March 8

Lecture:  Projects' Overview

March 15

Project selection completed.

March 15

Lecture on CVS by Alon Shalita from the CS system staff  (slides, summery)

March 22

Lecture on parallel processing and Condor by Edward Aronovich from the CS system staff (slides)

March 29

Outlines of all planned projects presented by students in class   (10 minutes per group)

April  19

Lecture on databases and MySQL by Alon Shalita from the CS system staff

 May 5

 Lecture on databases, MySQL, and packaging by Alon Shalita from the CS system staff

 May 31

 Highly recommended (free, registration required): Annual Israeli BioInformatics day in the Crowne Plaza Hotel, Jerusalem

June 7

Presentation of completed projects in class   (20 minutes per group). Will  take place 11-13 and 14-16.

June 11

Last date for projects submission, including a detailed presentation of the projects to the course staff.

 

Assigned Projects

Giora Unger

 

 

Yaara Azaria, Maya Mograbi, Daniela Raijman

Extending linear separability tests from pairs to triplets and quadruples,
and allowing violations

Eran Ophir, Amir Segall

 MAD MEX - motif finding

Udi Altshuler, Elad Mazor

Extending linear separability tests from SVM to additional classifiers

Tamir Tuller

 

Tal Peled, Uri Zonnens

Simultaneous identification of duplication and lateral transfer

Igor Ulitsky, Dudu Burstein

A phylogenetic tree construction algorithm for complete genomes

Tal Tamir, Lior Gad

Gene prediction by spectral rotation measure and LZW compression algorithm

 Erez Makabi, Eyal Megran

 Constructing Bayes nets

 

Suggested Workshop Projects (List Subject to Change)





  1. Phylogenetic Trees and Networks

1.       Simultaneous identification of duplication and lateral transfer

Overview
This project is based on recent work by M.Hallet, J.Lagergren and A.Tofigh. The work introduce a combinatorial model that incorporate duplication events as well as lateral gene transfer event. The goal is to explain the deference between a gene tree an a species tree. In the project you will implement two algorithms from the article and apply it on synthetic and biologic data.



References

M.Hallet, J.Lagergren and A.Tofigh, Simultaneous identification of duplications and lateral transfer, 2004.


What to do

1. Reading and understanding the article of  M.Hallet, J.Lagergren and A.Tofigh.
2. Implementation of two algorithms from the article.
3. Applying the methods on synthetic and biologic data.

 

2.       Using Hadamard transform and the LogDeterminant for construction of phylogenetic trees.

Overview

The spectral analysis of sequences and distances data is Hadamard transform based method for phylogenetic tree construction. The LogDeterminant transform provides distances which allows the correct tree to be recovered consistency when sequences differ markedly in nucleotide frequencies. In this project you will implement and analyze those two methods for phylogenetic tree construction.

               References

 

              Steel,M.., Lockhart,P.J. and Penny,D. (1993) Confidence in evolutionary trees from biological sequence data. Nature, 364, 440–442.

M.D.Hendy. Spectral analysis of phylogenetic data. 1993.

M.D.Hendy, D.Penny, M.A.Steel. A discrete fourier analysis for evolutionary trees. 1993.

 

Additional (optional) References

              Lockhart,P.J., Steel,M.A., Hendy,M.D. and Penny,D. Recovering evolutionary trees under a more realistic model of sequence

              evolution. Mol. Biol. Evol., 11, 605–612, 1994

Steel,M., Huson,D. and Lockhart,P.J. (2000) Invariable sites models and their use in phylogeny reconstruction. Syst. Biol., 49, 225–232

D.Penny, M.Hasegawa, P.J.Waddel, M.D.Hendy. Mammalian evolution: timing and implications from using the LogDeterminant Transform for proteins of differing amino acid composition. 1999.


What to do

1. Reading and understanding the references
2. Implementation of two methods for reconstruction of phylogenetic tree, partial code will be given by Tamir.
3. Applying the methods on synthetic and biologic data, analyzing the results and checking the number of maximum points reached by those methods on biological data.

 

 

 

 

3.       A phylogenetic tree construction algorithm for complete genomes.

Overview
In this project you will implement an new LZW based algorithm for generating a phylogenetic tree, this algorithm is different from other known method in two main point: it doesn’t restricted to any nucleotide - substitution model and it can be applied to complete genomes.

References

Http://www.Data-compression.com/vq.html.

More material will be given by  Tamir.


What to do

1. Implementation of some LZW based algorithms for finding distances between genomes.
2. Using Matlab or any other public software for generating a phylogenetic tree from the distances matrix found in stage 1.
3. Download complete genomes from NCBI, applying and checking the method on those genomes.
4. More precise details will be given by Tamir.
.

 

  1. Gene Expression Data and DNA Microarrays
    1. DNA microarray data normalization tool

      Overview
      Gene expression data extracted from DNA microarrays undergo various manipulations, normalizations etc., before being used by algorithmic tools. There are numerous approaches and methods for normalizing such data, depending on the type of microarray (e.g. cDNA or Affymetrix chips) and other factors. In this project, you’ll write a platform that would facilitate several options for normalizing already-processed gene expression data.

      References
      O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R. Altman. Missing value estimation methods for DNA microarrays. Bioinformatics, 17(6):520-525, 2001.

      S. Dudoit, J. Fridlyand, and T.P. Speed. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 97(457):77-87, 2002.

      M. Dettling and P. Buhlmann. Boosting for tumor classification with gene expression data. Bioinformatics, 19(9):1061-9, Jun 2003.

      Giora Unger's M.Sc. thesis dissertation – “Linear Separability and Classifiability of Gene Expression Datasets”.

      What to do
      1. Implement a normalization platform, that would support:
          a. Imputation of missing data for cDNA datasets, using the kNN (k-Nearest-Neighbours) method.
             This method is presented in Troyanskaya et. al, is explained clearly in Dudoit et. al (section 3.2), and was used also in Dettling & Buhlmann.
          b. Normalizing Affymetrix datasets, in a manner similar to that described in Dettling & Buhlmann (for the “Leukemia” and
              “Estrogen and Nodal” datasets) and in Giora Unger’s M.Sc. thesis (appendix B). A skeleton code for that can be provided  by Giora.
      2. Support convenient user interface
      3. Demonstrate the operation of the platform on several publicly available datasets (to be defined later with Giora)
      4. Bonus: implement additional methods for missing data imputation, as described in Troyanskaya et. al.
       
       
    2. Extending linear separability tests from SVM to additional classifiers (AdaBoost, naive Bayes, geometric Bayes etc.)
      Overview
      One of the most promising (and challenging) applications of gene expression data is classification of cancer tumors. In this project you’ll extend an existing classification tool, based on pairs of genes that exhibit certain unexpected geometric characteristics.

      References
      Giora Unger's M.Sc. thesis dissertation – “Linear Separability and Classifiability of Gene Expression Datasets”.

      What to do
      1. Read Giora’s M.Sc. thesis. Note, that basically 3 classifiers are used there – SVM, simple majority and weighted voting.
      2. Study standard classification algorithms – AdaBoost, NaiveBayes, Geometric Bayes, Distance from Average and another one of your choice.
      3. Integrate these classifiers into the code (to be provided be Giora).
      4. Demonstrate the results.

      Note: using existing, ready made code for the classifiers is highly encouraged, so that most of the work would be integrating them into the existing platform.
       
       
    3. Extending linear separability tests from pairs to triplets and quadruples.
      Overview
      As briefly introduced in class, and detailed in Giora’s M.Sc. thesis, in many gene expression datasets an interesting geometric phenomenon was discovered: Many pairs of genes exhibit linear separability. In this project, you’ll extend the separability test carried out there to testing larger groups of genes – triplets and quadruples.

      References
      Giora's M.Sc. thesis dissertation – “Linear Separability and Classifiability of Gene Expression Datasets”.

      What to do
      1. Read Giora’s M.Sc. thesis. Note, that basically 3 classifiers are used there – SVM, simple majority and weighted voting.
      2. Understand the algorithm and the relevant code (to be provided be Giora).
      3. Implement the test for triplets and quadruples.
      4. More precise details – to be finalized with Giora at a later stage.



    4. A tool for adjoining and unifying gene expression datasets
      Overview
      One of the prominent problems with current gene expression datasets is that different laboratories generate data in different formats. It is often desired to “merge” several datasets together, thus creating one larger dataset. In this project you’ll write a tool for adjoining and unifying gene expression datasets. In addition, automatic tools for generating annotations based on genes’ IDs are sorely missing, making it much harder to infer biological data from algorithmic outputs.

      References
      CAMDA 2003 contest web page - http://www.camda.duke.edu/camda03/datasets/

      What to do
      1. Write a tool for unifying datasets.
      2. Write a tool for generating a well-known IDs for a dataset (e.g.: accession number and locus-link), based on
          Affymetrix probe IDs – to be elaborated by Giora at a later stage.
      3. Download the four datasets from the CAMDA 2003 site.
      4. Unify these four datasets (possibly only three of them – to be discussed with Giora at a later stage)
      5. Demonstrate the ID extraction tool on several datasets (to be provided by Giora)





  1. Analysis of Gene Structure, Introns and Exons
    1. Gene prediction by spectral rotation measure and LZW compression algorithm.

Overview
One of the major problems in bioinforamtics is gene finding. In this project you will write a gene prediction software. Your algorithm will combine two methods, one is based on FFT and the other is based on the LZW compression algorithm.


References

D.Kotlar, Y.Lavner. Gene prediction by spectral rotation (SR) measure : a new method for identifying protein-coding regions, Genome Research 13:1930-1937, 2003

Http://www.Data-compression.com/vq.html.


What to do
 

1. Implement an LZW based algorithm for gene prediction.
2. Implement a combined LZW/FFT based algorithm for gene prediction by using neuron network/SVD or any other learning method (after discussing with the Tamir), the code for the FFT method will be given by the instructor.
3. Checking the performance of the method on several genomes and compare it to other methods.

  1.  Understanding Genomic Pathways
    1. Efficient implementation of data structures and algorithms for Bayesian Networks

Overview
DNA hybridization array simultaneously measure expression level for thousands of genes. This measurements provide a “snapshot” of the transciption level within the cell, one of the methods for discovering interactions between genes base on multiple expression measurements is based on Bayesian network, which is a graph based model of joint multivariate probability distribution that capture properties of conditional dependence between variables. In this project you will write a code for learning a Bayesian network according to a known algorithm and you’ll check it on synthetic and real data

Remark: If needed, this project can be expanded to two projects, so that two groups may work (independently) on them.


References

D. Margarities and S. Thrun. Bayesian network induction via local Neighborhoods, 1999.

N.Friedman, M.Linial, I.Nachman, D.pe’er. Using Bayesian network for analyze expression data, 2000.


What to do

 1. Write a code for a known Bayesian network inferring algorithm.
2. Test the performances of the algorithm on synthetic data, which you will sample from a known Bayesian network, and on real biological data.
3. Improving the performance of the algorithm on biological data (for more details please contact Tamir).




 

    1. Tools for inferring pathways based on gene expression data
      Overview
      One of the key issues in computational biology in understanding complex biological systems using computational tools. A prominent example is inferring biological pathways based on various types of data, gene expression data being one of them. In this project you'll develop several tools for testing the validity of a given function, in light of data (input and output), for assessing the quality of the function and for expnading and improving it, and so on.

      References
      To be provided later.

      What to do

      This project will require larger amounts of research, and might be split into two projects, possibly of two groups working separately but interacting. The precise tasks to be implemented would be defined later.