STA 293B/BGT 08: Genomic Expression Analysis:
Technology, Computation & Analysis
Project
Students will be assessed for the course based on
a mini-project summarised in a paper to be handed in at the end of semester. Students
will identify individual project topics. Some examples (not intended to be
at all limiting) include:
- development of gene expression analysis using course data sets or other
data sets, either using custom software or student developed tools;
- reviews of
published research on expression analysis, including reviews and critiques of
statistical and computational methods;
- studies exploring different methods of
expression analysis using specific data sets;
- development and computer implementation of
published or new methodology for analysis of expression data (e.g., new clustering
methods, regression methods, etc).
Within this general scope, more specific project ideas:
- A review and critique of the MIT leukemia study (Golub et al). This should include
some original data analysis as well as discussion of the goals, methods and results in the
paper. This study uses a specific methods of screening genes to select a set for
the follow-on analysis. You might reanalyse using some of the methods from the class.
- A review and critique of one of the other papers on the web site here, or another
you have identified (and have approved for this purpose by the instructors), with
original data analysis and exploration, as mentioned above.
- Further exploration of breast cancer data set. Exploration of relationships among
smallish subsets of genes found "interesting" in connection with ER or Nodal status,
using regression methods as explored in class. Development of these methods for each of,
say, a set of 30 or 40 such genes using regression variable selection. Development of
graphical "network" representation of results. Some web/literaure search to identify
information on genes in selected group to provide plausible biological rationales for
any empirical findings.
- As above, exploring the MIT leukemia data. Identify subsets of genes (say 50 or 100)
implicated in ALL/AML distinction, via raw correlations. Explore regression models to
describe/explain variation in expression levels of the most highly related gene, and
then all in a smaller subset of, say, 40 or so. Explore similar issues as described
above in breast cancer study.
- As above, using one of the other course data sets, your own or a colleagues data,
or other data from public/web resources, depending on your interests.
- As above, using SVD/PCA decompositions to identify smaller subsets of genes
that score most highly (positive or negative) on factor loadings. For example,
with breast cancer, the 6th factor in the full data set is most highly related to
the ER discrimination. What about the other factors? For example, factor 2? Look
at highly weighted genes on factor 2. Explore regression models to understand
the empirical realtionships, as above.
- Binary regression model analyses of breast cancer nodal status, or revised
analyses of ER status using different methods to select subsets of genes.
What results do you get using all the data? Other ways of "throwing out genes"
before you fit the model?
- Binary regression model analyses of MIT leukemia data to explore models
that discriminate ALL/AML leukemias. Again, gene subset selection is up to you.
What results do yo get using all the data? Other ways of "throwing out genes"
before you fit the model?
- Related analyses of E2F data.
- Use of SVD/PCA analysis methods to try to identify clusters of genes implicated
in the cell cycle data.
- Exploration of any of the above data sets, and using selected subsets of
genes, using either Matlab clustering via k-means or some of the commercial
software packages for clustering. Example questions include: what genes
are grouped together in clustering breast cancer data, and how do the clusters
compare with results from SVD/PCA analyes? Do we find an "ER" cluster? What
else shows up? How does selection of subsets of genes affect this?
- Bayesian networks/graphical models. Development of custom software to
build Bayes nets (sets of regression models, exactly as explored in class)
for subsets of genes. May address the issue of modelling genes "expressed"
versus "unexpressed." There is software (in Matlab and other tools) out there
on the web for some simpler graphical models. Biological interpretation and
identification of know relationships among genes?
- Other topics can be suggested and will be OK once approved by instructors.
Report: Your final report will be handed in for assessment at the end
of semester. It should be no more than 10 pages long, with additional material in appendices
that will likely include a lot of graphs summarising various analyses, among other
things. The report should be written as a scientific document: one paragraph abstract
summarising the work, a short introductory section describing the data, problem
context and goals, a section describing the work you performed, in detail, and
the computational and statistical aspects, a section describing the results and
any biological interpretations, a short final section identifying aspects of the
work that might be improved or extended, and the appendices.