Duke Center for Bioinformatics & Computational Biology

STA 293B/BGT 08: Genomic Expression Analysis: Technology, Computation & Analysis

BGT08 home Class topics Computing Data sets Notes Project Refs & links

Project

Students will be assessed for the course based on a mini-project summarised in a paper to be handed in at the end of semester. Students will identify individual project topics. Some examples (not intended to be at all limiting) include:

development of gene expression analysis using course data sets or other data sets, either using custom software or student developed tools;
reviews of published research on expression analysis, including reviews and critiques of statistical and computational methods;
studies exploring different methods of expression analysis using specific data sets;
development and computer implementation of published or new methodology for analysis of expression data (e.g., new clustering methods, regression methods, etc).

Within this general scope, more specific project ideas:

A review and critique of the MIT leukemia study (Golub et al). This should include some original data analysis as well as discussion of the goals, methods and results in the paper. This study uses a specific methods of screening genes to select a set for the follow-on analysis. You might reanalyse using some of the methods from the class.
A review and critique of one of the other papers on the web site here, or another you have identified (and have approved for this purpose by the instructors), with original data analysis and exploration, as mentioned above.
Further exploration of breast cancer data set. Exploration of relationships among smallish subsets of genes found "interesting" in connection with ER or Nodal status, using regression methods as explored in class. Development of these methods for each of, say, a set of 30 or 40 such genes using regression variable selection. Development of graphical "network" representation of results. Some web/literaure search to identify information on genes in selected group to provide plausible biological rationales for any empirical findings.
As above, exploring the MIT leukemia data. Identify subsets of genes (say 50 or 100) implicated in ALL/AML distinction, via raw correlations. Explore regression models to describe/explain variation in expression levels of the most highly related gene, and then all in a smaller subset of, say, 40 or so. Explore similar issues as described above in breast cancer study.
As above, using one of the other course data sets, your own or a colleagues data, or other data from public/web resources, depending on your interests.
As above, using SVD/PCA decompositions to identify smaller subsets of genes that score most highly (positive or negative) on factor loadings. For example, with breast cancer, the 6th factor in the full data set is most highly related to the ER discrimination. What about the other factors? For example, factor 2? Look at highly weighted genes on factor 2. Explore regression models to understand the empirical realtionships, as above.
Binary regression model analyses of breast cancer nodal status, or revised analyses of ER status using different methods to select subsets of genes. What results do you get using all the data? Other ways of "throwing out genes" before you fit the model?
Binary regression model analyses of MIT leukemia data to explore models that discriminate ALL/AML leukemias. Again, gene subset selection is up to you. What results do yo get using all the data? Other ways of "throwing out genes" before you fit the model?
Related analyses of E2F data.
Use of SVD/PCA analysis methods to try to identify clusters of genes implicated in the cell cycle data.
Exploration of any of the above data sets, and using selected subsets of genes, using either Matlab clustering via k-means or some of the commercial software packages for clustering. Example questions include: what genes are grouped together in clustering breast cancer data, and how do the clusters compare with results from SVD/PCA analyes? Do we find an "ER" cluster? What else shows up? How does selection of subsets of genes affect this?
Bayesian networks/graphical models. Development of custom software to build Bayes nets (sets of regression models, exactly as explored in class) for subsets of genes. May address the issue of modelling genes "expressed" versus "unexpressed." There is software (in Matlab and other tools) out there on the web for some simpler graphical models. Biological interpretation and identification of know relationships among genes?
Other topics can be suggested and will be OK once approved by instructors.

Report: Your final report will be handed in for assessment at the end of semester. It should be no more than 10 pages long, with additional material in appendices that will likely include a lot of graphs summarising various analyses, among other things. The report should be written as a scientific document: one paragraph abstract summarising the work, a short introductory section describing the data, problem context and goals, a section describing the work you performed, in detail, and the computational and statistical aspects, a section describing the results and any biological interpretations, a short final section identifying aspects of the work that might be improved or extended, and the appendices.