Computational Biology
- Gene Set Enrichment Analysis: A Knowledge-Based Approach for
Interpreting Genome-wide Expression Profiles
Brief description:
Gene expression profiling experiments have been conducted on a wide
variety of cell lines and animal models with the goal of
characterizing genes sets whose expression patterns characterize
specific genetic or molecular perturbations. These gene sets contain
candidate players in pathways, or sub-pathways, that are ``annotated''
by the experimental perturbation.
A fundamental idea in this work is that such a gene set serves as a
reference base for interrogating other expression data sets. A new
data set in which a specific pathway gene set appears to be
enriched, in terms of multiple genes in that set evidencing
expression changes, can then be annotated by that reference pathway.
An analogy can be made here with sequence annotation in a
BLAST search: sets of experimentally derived pathways
serve as annotation reference sets for future experiments in the way
that annotated sequences serve as references in a sequence
search. Statistical methods are needed to define computational tools
for such expression-based pathway annotation: Gene Set Enrichment
Analysis (GSEA) provides formal statistical evaluation, and confidence
assessments, for annotation of an expression data set by measuring
the overlap of significantly perturbed genes with those in each
pathway in a database of pathways. GSEA has been successfully applied
in a number of basic and clinical studies, including pathway
deregulation in cancer genomics.
URL: GSEA
Paper: Gene Set Enrichment Analysis:
A Knowledge-Based Approach for Interpreting Genome-wide Expression Profiles
- Analysis of Sample Set Enrichment Scores: assaying the enrichment of sets of genes for individual samples in genome-wide expression profiles
Brief description:
Gene expression profiling experiments in cell lines and animal models
characterized by specific genetic or molecular perturbations have
yielded sets of genes ``annotated'' by the perturbation. These gene
sets can serve as a reference base for interrogating other expression
data sets. For example, a new data set in which a specific pathway
gene set appears to be enriched, in terms of multiple genes in that
set evidencing expression changes, can then be annotated by that
reference pathway. We introduce in this paper a formal statistical
method to measure the enrichment of each sample in an expression data
set. This allows us to assay the natural variation of pathway activity
in observed gene expression data sets from clinical cancer and other
studies. Validation of the method and illustrations of biological
insights gleaned are demonstrated on cell line data, mouse models, and
cancer-related datasets. Using oncogenic pathway signatures, we show
that gene sets built from the model systems are indeed enriched in the
model system. We employ ASSESS for the use of molecular classification
by pathways. This provides an accurate classifier that can be
interpreted at the level of pathways instead of individual
genes. Finally, ASSESS can be used for cross-platform expression
models where data on the same type of cancer are integrated over
different platforms into a space of enrichment scores.
URL:
ASSESS
Paper: Analysis of Sample Set Enrichment Scores: assaying the enrichment of sets of genes for individual samples in genome-wide expression profiles
Statistics and Machine learning
- Learning Coordinate Covariances via Gradients
Brief description:
We introduce an algorithm that learns gradients from samples in
the supervised learning framework. An error analysis is given for
the convergence of the gradient estimated by the algorithm to the
true gradient. The utility of the algorithm for the problem of
variable selection as well as determining variable covariance is
illustrated on simulated data as well as two gene expression
data sets. For square loss we provide a very efficient
implementation with respect to both memory and time.
URL: COVIAGRA
Paper: Learning Coordinate Covariances via Gradients
- Estimation of Gradients and Coordinate Covariation in Classification
Brief description:
We introduce an algorithm that simultaneously estimates a
classification function as well as its gradient in the supervised
learning framework. The motivation for the algorithm is to find
salient variables and estimate how they covary. An efficient
implementation with respect to both memory and time is given. The
utility of the algorithm is illustrated on simulated data as well as
a gene expression data set. An error analysis is given for the
convergence of the estimate of the classification function and
its gradient to the true classification function and true gradient.
URL: COVIAGRA
Paper: Estimation of Gradients and Coordinate Covariation in Classification
- Bayesian kernel regression (BAKER)
Brief description:
Kernel models for classification and regression have emerged as widely
applied tools in statistics and machine learning. We discuss a
Bayesian framework and theory for kernel methods, providing new
rationalisation of kernel regression based on non-parametric Bayesian
models. Functional analytic results ensure that such a non-parametric
prior specification induces a class of functions that span the
reproducing kernel Hilbert space corresponding to the selected
kernel. Bayesian analysis of the model allows for direct and formal
inference on the uncertain regression or classification
functions. Extending the model with Bayesian variable selection priors
over kernel bandwidth parameters extends the framework to
automatically address the key practical questions of kernel feature
selection. Novel, customised MCMC methods are detailed and used in
the analysis implementations in examples. The practical benefits and
modelling flexibility of the Bayesian kernel framework are illustrated
in both simulated and real data examples that address prediction and
classification inference with high-dimensional data.
URL:
BAKER
Paper: Non-parametric Bayesian kernel models