BFRM Software: Bayesian Factor Regression Modelling

BFRM 2.0

The program can be run directly (Windows, unix or Mac) or from within Matlab or R, taking text file inputs and producing summary text file outputs. The code is self-contained, requires no libraries and has been optimized for speed. A script file interface is available so BFRM can be run from a command window as well as from within Matlab or R. Running:Command line calls are:

on Windows, bfrm.exe parameters.txt > printout.txt
on unix, ./bfrm64 parameters.txt >printout.txt
(or the 32bit version)

where parameters.txt is the required input file of parameters for the particular analysis (described below and fully detailed on the Inputs page.), and printout.txt records a runnining commentary of the analysis and its iterations. Default is to screen output if there is no redirection to a printout.txt file. Running in Matlab: !bfrm.exe parameters.txt

Input files:

parameters.txt - input parameters/setup: each row is a parameter (name, value) pair for a predefined set of parameters, described in detail on the Inputs page. Many can be left as defaults in initial exploratory analyses, but entries related to the specific data set are of course required.
dataset.txt - rows are variables, columns are samples. If response variables are included in the model they should be in the initial rows in the order
- binary responses (if any),
- categorical responses (if any),
- survival responses (if any),
- continuous responses (if any),
followed by rows of the X variables. So, if fitting a factor model to X data alone, then the first row will be the first X variable, and so on.
ymask.txt - 0/1 indicators of missing and/or censored observations in responses Y, with rows as response variables and columns as samples. Values: 0 for observed, 1 for missing (non-observed) and 2 for censored.
varin.txt - Integer index file of X variables to be included in the initial model.
H.txt - Design matrix: rows are observations and columns are covariates. The first column must be a column of ones (intercept). The columns in order thereafter are fixed covariate vectors for regression or Anova covariates.

Output files: Currently, all outputs are simply sets of posterior means (based on the MCMC analysis that delivers Monte Carlo approximations) of most model parameters, latent factors, indicators and so forth. Future revisions of BFRM will add options to output subsamples of MCMC streams, and other posterior summaries.

mAlpha.txt - Posterior mean for the total mass parameter alpha of the Dirichlet process model for the latent factors.
mA.txt - Posterior mean of the coefficient matrix of the regression and factor model. Rows represent variables (in the order in the dataset.txt file) and columns represent parameters. The order of columns is as follows:
- first column: posterior means of intercepts, then
- posterior means of regression/anova parameters (as laid out in H), then
- posterior means of loadings on the latent response factors (if any response variables are in the model), then
- posterior means of loadings on the latent factors (if any).
For example, In a model with 100 X variables, 2 binary Y responses, 2 covariates and 5 latent factors, mA.txt will have 102 rows and 10 columns (1 intercept + 2 covariates + 2 response factors + 5 latent factors)
mF.txt - Posterior means of the values of response factors and latent factors, with rows as factors and columns as samples. The first row is always a row of 1's (intercept).
mPib.txt - Posterior means of the base-rate probability of inclusion in each factor (rho's in the referenced paper). The first element is always 1 (intercept).
mPostPib.txt - Matrix of posterior means for inclusion probabilities. The rows represent variables (Y and X as in dataset.txt) and the columns refer to the intercept, covariates, response factors and latent factors (in that order) - the format and order of columns is precisely the same as that of mA.txt.
mPsi.txt - Posterior means for the residual variances for all variables (Y and X).
mTau.txt - Posterior means of the variances in the prior distributions of factor or effect parameters. The first element is always zero (variance of the intercept), and is followed by values corresponding to the covariate effects, response factors and latent factors, again in the order as in mA.txt.
mVariablesIn.txt - Integer indices of the X variables included in the final model. If evolutionary factor analysis has been performed, these variables are listed in the order of their inclusion in the model analysis (following the initial list varin.txt of those included by choice initially).
mExternalProb.txt - Approximate posterior predicted probabilities of non-zero X variable-factor loadings for all X variables NOT included in the factor model analysis. This is most useful when running an evolutionary factor analysis that terminates with some subset of the full set of X variables, since it provides a posterior assessment of the association between the remaining variables and the estimated factors in the model. The first column is an integer index list of variables "outside" the model, and the columns represent the response and latent factors (same order as in mA.txt).

Example 1: This is an example concerning a factor analysis of a subset of the breast cancer data analyzed in High-Dimensional Sparse Factor Modelling: Applications in Gene Expression Genomics, by Carvalho et al. The analysis has: n=251 observations, p=226 total genes (GeneNames.txt), and the example sets up and runs the evolutionary factor model for all 226 genes, starting with 5 factors. The final output is still a model with 5 factors. The point of this example is to illustrate how BFRM can be used for a fix set of variables and a pre-specified number of factors. The input files are:

parameters.txt
dataset.txt -- the expression data matrix (p=226 rows, n=251 columns).
H.txt -- the design matrix containing a column of 1's plus 3 control variables.
ymask.txt -- an empty file since no data is missing or held-out of the analysis.
varin.txt -- list of variables in the model (in this case a list from 1 to 226)

Example 2: This examples illustrates the use of the factor regression component of BFRM together with the evolutionary analysis. We use the same 226 genes from Example 1 as well as two binary outcome variables: p53 mutation status and ER status. The analysis is set up to start from 25 genes (in varin.txt below) and 1 latent factor. The final output includes 200 genes modelled with 2 response factors and 5 latent factors. This is a modified version of the p53 example described in High-Dimensional Sparse Factor Modelling: Applications in Gene Expression Genomics, by Carvalho et al. The input files are:

parameters.txt
dataset.txt -- the data file (p+2=228 rows, n=251 columns) with the p53 and ER binary responses in the first two rows, and the expression data in the remainder.
H.txt -- there is no H.txt file for this example (it is optional if the model has only an intercept).
ymask.txt -- file indicating the missing observations in the outcome variables.
varin.txt -- the initial set of 25 genes.

Example 3: This is an example concerning sparse multivariate analysis of variance and regression for data arising from a series of oncogene intervention experiments. The primary study was reported in Oncogenic pathway signatures in human cancers as a guide to targeted therapies (Bild et al 2006) and some additional aspects of analysis appear in Sparse Statistical Modelling in Gene Expression Genomics (Lucas et al 2006). The example has n=97 DNA microarray samples with p=8509 genes. The samples comprise two sets of controls created at different times, and nine sets of replicate samples corresponding to the nine treatment groups -- each treatment represents up-regulation of one of nine selected oncogenes. The design layout is:

Sample index	Oncogene
1-10	GFP1 - control
11-15	GFP2 - control
16-26	MYC
26-32	SRC
33-41	Beta Catenin
42-50	E2F3
51-60	RAS
61-69	P63
70-78	AKT alpha
79-88	E2F1
89-97	P110

The design matrix H has a first column of 1's for the control group GFP1 defining baseline expression, and then 10 columns with 0/1 entries, the 1s in column 2 indicating the samples corresponding to samples in the second control group GFP2, those in column 3 indicating samples in the "MYC upregulated" group, and so forth. This Anova layout defines the first 11 columns of H. The remaining 8 columns are assay artifact covariates used as candidate predictors, gene by gene (the assay artifact method is a very useful general "gene-sample" specific normalisation method for microarrays, here customised to the Affymetrix arrays of this experiment, introduced and developed by this group as reported in the papers above.)

The input files are:

parameters.txt
dataset.txt -- the expression data matrix (p=8509 rows, n=97 columns).
H.txt -- the design matrix as described above.
ymask.txt -- an empty file since no data is missing or held-out of the analysis.
varin.txt -- the list 1:8509 to include all genes in the analysis.