(0368-3500-34)

Teaching Assistant : Amos Tanay, Computational Genomics Lab, Schrieber 011.

Where : Shenkar (Physics), Rm 114

When : Mondays 11:00-13:00, Spring 2003.

Mailing list at http://listserv.tau.ac.il/archives/cs0368-3500-32.html

The workshop will provide hands on experience in implementing
various optimization techniques

and applying them to combinatorial optimization problems with computational biology origins.

Many real world problems are NP hard or even hard to approximate, so often one employs heuristics

with no mathematical guarantee of performance, that "behave well" on practical problem instances.

Computational Biology offer some of the hottest (and most difficult) optimization problem in science today,

frequently involving huge solution spaces and multi goals. We shall learn and then apply some of the best

heuristics to study questions in areas such as analyzing gene expression data sets from DNA chips, phylogenetic

tree reconstruction, multiple sequence alignment, gene finding, RNA secondary structure prediction, and others.

and applying them to combinatorial optimization problems with computational biology origins.

Many real world problems are NP hard or even hard to approximate, so often one employs heuristics

with no mathematical guarantee of performance, that "behave well" on practical problem instances.

Computational Biology offer some of the hottest (and most difficult) optimization problem in science today,

frequently involving huge solution spaces and multi goals. We shall learn and then apply some of the best

heuristics to study questions in areas such as analyzing gene expression data sets from DNA chips, phylogenetic

tree reconstruction, multiple sequence alignment, gene finding, RNA secondary structure prediction, and others.

An integral part of the workshop is giving a presentation of the projects to all participants (not just the course

staff). It is mandatory to be physically present during this entire part of the workshop, which is expected to take

during the last three meetings (May 19, May 26, June 2nd). Naturally, your projects will have to be completed

by that time.

Goals, scope :

- Analysis, design and implementation of combinatorial optimization algorithms
- Using hybrid algorithms for real world problem solving
- Some contemporary problems in comparative genomics, DNA chips analysis and more.
- Building efficient local search procedures
- Efficient implementation of algorithms in C++ and STL.

March 23 |
Project(s) preferences submitted |

March 31 |
Project selection confirmed |

April 7 |
A two pages plan (general approach and software design)
submitted |

April 28 |
Progress report submitted. |

May 19- June 2 |
Presentation given in class. Final reports submitted. |

Liat greenshtein and Taly bukra |
The signature algorithm for gene expression analysis |

Sharon Mograbi and Oren Bahat |
Regulatory motif search in cancer data |

Shai Ophir |
Linear motif-expression model optimization using GA |

Ido Liberti and Yariv Eisenberg |
Implementation of SVM learning with a small number
of features |

Support vectore machines (SVMs) are one of the most effective and widely used approaches to supervised learning.

In the basic schenario, we are given two sets of points in some Euclidean space, and wish to find a linear

separator, or a hyperplane, separating the two sets. If such separation is possible, we are further interested in an optimal

one, where optimality is defined in terms of an induced margin. Given two such sets of points, the optimization problem

is a quadratic optimazation problem. There are some solutions available on the web, but they tend to be slower and

produce numerical instabilities when the dimension is

on data from various biological sources. See the excellent book Learning with Kernels: Support Vector Machines,

Regularization, Optimization and Beyond, by Bernhard Schölkopf and Alex Smola,

Support vectore machines (SVMs) are one of the most effective and widely used approaches to supervised learning.

In the basic schenario, we are given two sets of points in some Euclidean space, and wish to find a linear

separator, or a hyperplane, separating the two sets. If such separation is possible, we are further interested in an optimal

one, where optimality is defined in terms of an induced margin. Now in many cases most data sets will not be separable,

so in such cases it makes sense to first filter the data and check for linear separability, and apply the relatively heavy SVM

code only in positive cases.

Your task is to write a fast separability checker. It will work in low dimension (2 to 10) but should be able to check 10 million

sets of size (50,50) in approximately 1-2 minutes. If interested, please approach the instructor for more details.

The goal in this project is to model gene expression in a set of different conditions as a linear combination of transcription factor activities, which are detected via appearances of specific DNA signals upstream each of the genes. Specifically, we are given a set of genes, each with a

small set of sequence motifs (6-mers, 7-mers) fully determine expression. The idea is that the expression of each gene is a linear combination of the activities of those k-mers that

are present in its promoter. We do not know in advance which motifs are important and what are their activities. To find these, we perform a genetic algorithm to find the motifs

and least square minimization to find the activities given the motifs. The project will use yeast data for testing a hopefully efficient implementation of the algorithm.

Ref: Bussemaker et al. Nature genetics, 2001.

The idea in this project is to find significant similarities in two complete genomes (human and mouse). We know that human and mouse genomes are rather conserved but

simple alignment of their sequences is not practical due to many rearrangement of the genome which permute parts of the two sequence with respect to their common ancestor. A simple

strategy for identifying conserved regions in mouse and human is using local alignment, but such strategy possibly lose much structure. Our goal here is to examine several possibilities

for improved alignment of conserved regions. The development should be theoretical in part, with prototyping to prove the concepts.

Ref: Mouse genome issue, Nature 2002.

This project will be dedicated for analysis of cancer gene expression and searching for DNA motifs in promoter of co-regulated genes. The idea is to search exhaustively for DNA

motifs that characterize significantly up or down regulated cancer genes and then generalize them using a simple EM algorithm. The exhaustive search will use hashing for efficient

screening of all k-mers with possible degeneracy. The EM algorithm will use high scoring exact motif and generalize them into a PSSM (simple Markov model) such that the mean

expression of genes with the motif will be optimized. We will test the algorithm on clinical data from lung and breast cancer studies.

Ref: Lung cancer paper

Breast cancer paper

Bussemaker paper

The signature algorithm is a simple method for finding dependency patterns (or biclusters) in gene expression data. The algorithm is a two phase process using as input

a collection of genes and a matrix of gene expression. The idea is to start with some interesting set of genes and refine it by zooming on all condition which somehow characterize it.

The main technique is the analysis of mean expression in the set of genes or conditions and the comparison of it to an expected normal distribution. The goal of this project is

to test the signature algorithm with real and synthetic data and present detail analysis of its behavior (analysis here is as important then the simple implementation).

Ref: Ihmels et al. Nature Genetics 2002

Recent experimental techniques in biology have led to the collection of large number of interaction among different biological factors including

genes and proteins. An interesting global view of these data collections stress properties of the interaction graphs and seeks for simple, yet deep,

phenomenon in its structure. A popular model in that respect is the scale free model (or power law generative model which was successful used to model the internet). A different approach was recently suggested by Shen-Orr et al. The idea is to search for over-represented network motifs (small isomorphic subgraphs) and view the graph as a combination of such subgraphs. In this project we will search for such motifs in protein network

graphs available on the yeast system. To find network motifs we count their appearence compared to a random graph model preserving some characteristics of the graph. Apart from programming the motif search algorithm efficiently, you will have to analyze your results carefully, making sure motifs are not artifacts.

Ref: Shen-Orr et al. Nature Genetics 2002

Milo et al. Science 2002

This project is based on a recent work by D. Kotlar and Y. Lavner, where spectral methods and FFT are used for gene finding in

eukaryots. You will have to implement the algorithm, apply it to several genomes, and compare the results to other methods and to

published literature.

If interested, please contact the instructor for more details.