Workshop in Computer Science

Tel Aviv University School of Computer Science

Fall 2011-12

0368-3500-22

http://www.cs.tau.ac.il/~rshamir/workshop/11/

Workshop instructor: Prof. Ron Shamir
Lab Instructor: David Amar (davidama AT post.tau.ac.il)

Workshop:Tuesday 14-16 Schreiber 7; Lab: Tuessday 16-19

♣ ♣ presentation from the first meeting
♣ ♣ presentation from the second meeting
links to the papers from the second lecture:
Popescu and Yona 2005
Kharchenko et al. 2006
Kharchenko et al. 2005
Chen and Vitkup 2006

Stage 1 data
Stage 1 pathways training set
Explanation on how your executable should look like
Congratulations to Shahar and Nofar for having the best predictions on Stage 1 test data

Stage 2 metabolic dependencies data
Stage 2 pathways training set
An example for a paper that used graphs for pathway extension
Congratulations to Yael and Michal G for having the best predictions on Stage 2 test data

Stage 3 protein interaction graph
Stage 3 pathways training set

Background and motivation: Researchers in biotechnology and bioengineering often try to use organisms for production of a specific compound. For example, in 2009 a research group from UCLA engineered cyanobacteria to convert greenhouse gas into liquid fuel . Naturally, to be able to do this one must understand very well the relevant pathway that may be engineered, and exactly which genes take part in the pathway.

Other labs try to enhance the level of compounds called Carotenoids in plants. These compounds are used by plants for production of vitamin A, a vital factor in human nutrition. Vitamin A deficiency is a major problem in children nutrition in developing countries: Approximately 250,000 to 500,000 malnourished children in the developing world go blind each year from a deficiency of vitamin A, and about half of which die within a year of becoming blind. Since plants are the only source for vitamin A, the problem of enhancing the level of Carotenoids in plants is extremely important.

These types of problems require finding candidate genes that can alter a given biological process. To do this one has to first answer a seemingly simple question: are there additional genes we do not know of that are involved in the process, and if so, which are they? This problem is complex since our biological information is incomplete: usually only few dozen genes are known to be related to the tested process, while the process may involve a hundred genes or more, out of the tens of thousands of genes in the organism. The challenge is to combine in the best way incomplete data available from multiple sources in order to find the most likely candidate genes.

A pathway of interest (center)
and candidate genes that may
may be missing from it (blue).
One missing gene (red) interacts
with several pathway genes.
The goal is to find this gene
from the large candidate set
(all the blue genes).

Prerequisites: The workshop is open to all 3rd year students in computer science. No biological background is assumed. In case the workshop is oversubscribed, there will be preference to students in the bioinformatics track. Knowledge of Java is required.

Workshop description: Given a pathway (a set of genes that are known to take part in one biological process), the goal is to rank other candidate genes in terms of how likely they are to belong to the pathway by utilizing background data. You will have to do the ranking on several pathways, using different types of data sets that contain information about genes and their relations. Use of some of the data sets is obligatory, and the others are optional. The goal is to use all the obligatory data sets and possibly some optional ones in order to rank all candidates for each specific pathway. The evaluation of the prediction power of possible methods will use cross-validation, a widely used statistical idea. The work will be done in stages, building the predictors gradually.

Stage 1: Prediction using only one type of data set. , due 6/12/11.

Stage 2: Prediction using two types of data. , due 10/1/12.

Stage 3: A final prediction system using at least three different types of data. , due 20/3/12.

The work will be done by pairs of students or individually. We shall have 2-3 introductory meetings in the beginning of the semester to provide the necessary background.

Problem input
(1) The pathway: A set P of genes that are known to take part in a specific process or pathway (e.g., photosynthesis light reactions or ribosome activation).
(2) Supporting data: Obligatory and optional data sets that contain descriptors on genes and their relations.
(3) The candidates: A set S of genes (disjoint from P) to be ranked.

Output
Ranking of the candidates. A ranked list of the genes in S, with a score for each gene.

In each stage a training set of pathways will be given for developing the methods. Upon delivery of the results and the software at the end of the phase, the software will be run on an additional test set of new pathways.

Software: The algorithms will be implemented in JAVA and tested on Linux.

Grading:

15% for the performance on stage 1 (5% for the accuracy on the training pathways and 10% for the accuracy on the test pathways)
25% for the performance on stage 2 (10% for the accuracy on the training pathways and 15% for the accuracy on the test pathways)
30% for the performance on stage 3 (15% for the accuracy on the training pathways and 15% for the accuracy on the test pathways)
20% for the implementation (modularity, clarity, documentation, efficiency)
10% for the final report and presentation
5% bonus in each stage to the group with the most accurate results
5% bonus for use of optional data types (provided or suggested by the group)
5% penalty for not meeting any stage deadline

Fall 2011-12