Software

Computational Biology

Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-wide Expression Profiles

Brief description:
Gene expression profiling experiments have been conducted on a wide variety of cell lines and animal models with the goal of characterizing genes sets whose expression patterns characterize specific genetic or molecular perturbations. These gene sets contain candidate players in pathways, or sub-pathways, that are ``annotated'' by the experimental perturbation. A fundamental idea in this work is that such a gene set serves as a reference base for interrogating other expression data sets. A new data set in which a specific pathway gene set appears to be enriched, in terms of multiple genes in that set evidencing expression changes, can then be annotated by that reference pathway. An analogy can be made here with sequence annotation in a BLAST search: sets of experimentally derived pathways serve as annotation reference sets for future experiments in the way that annotated sequences serve as references in a sequence search. Statistical methods are needed to define computational tools for such expression-based pathway annotation: Gene Set Enrichment Analysis (GSEA) provides formal statistical evaluation, and confidence assessments, for annotation of an expression data set by measuring the overlap of significantly perturbed genes with those in each pathway in a database of pathways. GSEA has been successfully applied in a number of basic and clinical studies, including pathway deregulation in cancer genomics.

URL: GSEA

Paper: Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-wide Expression Profiles
Analysis of Sample Set Enrichment Scores: assaying the enrichment of sets of genes for individual samples in genome-wide expression profiles

Brief description:
Gene expression profiling experiments in cell lines and animal models characterized by specific genetic or molecular perturbations have yielded sets of genes ``annotated'' by the perturbation. These gene sets can serve as a reference base for interrogating other expression data sets. For example, a new data set in which a specific pathway gene set appears to be enriched, in terms of multiple genes in that set evidencing expression changes, can then be annotated by that reference pathway. We introduce in this paper a formal statistical method to measure the enrichment of each sample in an expression data set. This allows us to assay the natural variation of pathway activity in observed gene expression data sets from clinical cancer and other studies. Validation of the method and illustrations of biological insights gleaned are demonstrated on cell line data, mouse models, and cancer-related datasets. Using oncogenic pathway signatures, we show that gene sets built from the model systems are indeed enriched in the model system. We employ ASSESS for the use of molecular classification by pathways. This provides an accurate classifier that can be interpreted at the level of pathways instead of individual genes. Finally, ASSESS can be used for cross-platform expression models where data on the same type of cancer are integrated over different platforms into a space of enrichment scores.

URL: ASSESS

Paper: Analysis of Sample Set Enrichment Scores: assaying the enrichment of sets of genes for individual samples in genome-wide expression profiles

Statistics and Machine learning

Learning Coordinate Covariances via Gradients

Brief description:
We introduce an algorithm that learns gradients from samples in the supervised learning framework. An error analysis is given for the convergence of the gradient estimated by the algorithm to the true gradient. The utility of the algorithm for the problem of variable selection as well as determining variable covariance is illustrated on simulated data as well as two gene expression data sets. For square loss we provide a very efficient implementation with respect to both memory and time.

URL: COVIAGRA

Paper: Learning Coordinate Covariances via Gradients
Estimation of Gradients and Coordinate Covariation in Classification

Brief description:
We introduce an algorithm that simultaneously estimates a classification function as well as its gradient in the supervised learning framework. The motivation for the algorithm is to find salient variables and estimate how they covary. An efficient implementation with respect to both memory and time is given. The utility of the algorithm is illustrated on simulated data as well as a gene expression data set. An error analysis is given for the convergence of the estimate of the classification function and its gradient to the true classification function and true gradient.

URL: COVIAGRA

Paper: Estimation of Gradients and Coordinate Covariation in Classification
Bayesian kernel regression (BAKER)

Brief description:
Kernel models for classification and regression have emerged as widely applied tools in statistics and machine learning. We discuss a Bayesian framework and theory for kernel methods, providing new rationalisation of kernel regression based on non-parametric Bayesian models. Functional analytic results ensure that such a non-parametric prior specification induces a class of functions that span the reproducing kernel Hilbert space corresponding to the selected kernel. Bayesian analysis of the model allows for direct and formal inference on the uncertain regression or classification functions. Extending the model with Bayesian variable selection priors over kernel bandwidth parameters extends the framework to automatically address the key practical questions of kernel feature selection. Novel, customised MCMC methods are detailed and used in the analysis implementations in examples. The practical benefits and modelling flexibility of the Bayesian kernel framework are illustrated in both simulated and real data examples that address prediction and classification inference with high-dimensional data.

URL: BAKER

Paper: Non-parametric Bayesian kernel models