The program can be run
directly (Windows, unix or Mac) or from within Matlab
or R, taking text file inputs and producing summary
text file outputs. The code is selfcontained, requires
no libraries and has been optimized for speed. A script
file interface is available so BFRM can be run from a
command window as well as from within Matlab or
R. Running:Command line calls
are:
 on Windows, bfrm.exe parameters.txt >
printout.txt
 on unix, ./bfrm64 parameters.txt
>printout.txt
 (or the 32bit
version)
where parameters.txt is the required input
file of parameters for the particular analysis
(described below and fully detailed on the Inputs page.), and
printout.txt records a runnining commentary of
the analysis and its iterations. Default is to screen
output if there is no redirection to a
printout.txt file. Running in
Matlab: !bfrm.exe
parameters.txt
Input
files:
 parameters.txt  input parameters/setup:
each row is a parameter (name, value) pair for a
predefined set of parameters, described in detail on
the Inputs page. Many
can be left as defaults in initial exploratory
analyses, but entries related to the specific data
set are of course required.

dataset.txt  rows are variables, columns
are samples. If response variables are included in
the model they should be in the initial rows in the
order
 binary responses (if any),
 categorical responses (if any),
 survival responses (if any),
 continuous responses (if any),
followed by rows of the X variables. So, if
fitting a factor model to X data alone, then the
first row will be the first X variable, and so on.
 ymask.txt  0/1 indicators of missing
and/or censored observations in responses Y, with
rows as response variables and columns as samples.
Values: 0 for observed, 1 for missing (nonobserved)
and 2 for censored.
 varin.txt  Integer index file of X
variables to be included in the initial model.
 H.txt  Design matrix: rows are
observations and columns are covariates. The first
column must be a column of ones (intercept). The
columns in order thereafter are fixed covariate
vectors for regression or Anova covariates.
Output
files: Currently, all outputs are simply
sets of posterior means (based on the MCMC analysis
that delivers Monte Carlo approximations) of most model
parameters, latent factors, indicators and so forth.
Future revisions of BFRM will add options to output
subsamples of MCMC streams, and other posterior
summaries.
 mAlpha.txt  Posterior mean for the total
mass parameter alpha of the Dirichlet process model
for the latent factors.

mA.txt  Posterior mean of the coefficient
matrix of the regression and factor model. Rows
represent variables (in the order in the
dataset.txt file) and columns represent parameters.
The order of columns is as follows:
 first column: posterior means of intercepts,
then
 posterior means of regression/anova
parameters (as laid out in H), then
 posterior means of loadings on the latent
response factors (if any response variables are
in the model), then
 posterior means of loadings on the latent
factors (if any).
For example, In a model with 100 X variables,
2 binary Y responses, 2 covariates and 5 latent
factors, mA.txt will have 102 rows and 10 columns
(1 intercept + 2 covariates + 2 response factors +
5 latent factors)
 mF.txt  Posterior means of the values of
response factors and latent factors, with rows as
factors and columns as samples. The first row is
always a row of 1's (intercept).
 mPib.txt  Posterior means of the
baserate probability of inclusion in each factor
(rho's in the referenced paper). The first element is
always 1 (intercept).
 mPostPib.txt  Matrix of posterior means
for inclusion probabilities. The rows represent
variables (Y and X as in dataset.txt) and the columns
refer to the intercept, covariates, response factors
and latent factors (in that order)  the format and
order of columns is precisely the same as that of
mA.txt.
 mPsi.txt  Posterior means for the
residual variances for all variables (Y and X).
 mTau.txt  Posterior means of the
variances in the prior distributions of factor or
effect parameters. The first element is always zero
(variance of the intercept), and is followed by
values corresponding to the covariate effects,
response factors and latent factors, again in the
order as in mA.txt.
 mVariablesIn.txt  Integer indices of the
X variables included in the final model. If
evolutionary factor analysis has been performed,
these variables are listed in the order of their
inclusion in the model analysis (following the
initial list varin.txt of those included by choice
initially).
 mExternalProb.txt  Approximate posterior
predicted probabilities of nonzero X variablefactor
loadings for all X variables NOT included in the
factor model analysis. This is most useful when
running an evolutionary factor analysis that
terminates with some subset of the full set of X
variables, since it provides a posterior assessment
of the association between the remaining variables
and the estimated factors in the model. The first
column is an integer index list of variables
"outside" the model, and the columns represent the
response and latent factors (same order as in
mA.txt).
Example
1: This is an example concerning a factor
analysis of a subset of the breast cancer data analyzed
in HighDimensional
Sparse Factor Modelling: Applications in Gene
Expression Genomics, by Carvalho et al. The
analysis has: n=251 observations, p=226 total genes
(GeneNames.txt),
and the example sets up and runs the evolutionary
factor model for all 226 genes, starting with 5
factors. The final output is still a model with 5
factors. The point of this example is to illustrate how
BFRM can be used for a fix set of variables and a
prespecified number of factors. The input files are:
 parameters.txt
 dataset.txt 
the expression data matrix (p=226 rows, n=251
columns).
 H.txt  the design
matrix containing a column of 1's plus 3 control
variables.
 ymask.txt  an
empty file since no data is missing or heldout of
the analysis.
 varin.txt 
list of variables in the model (in this case a list
from 1 to 226)
Example
2: This examples illustrates the use of
the factor regression component of BFRM together with
the evolutionary analysis. We use the same 226 genes
from Example 1 as well as two binary outcome variables:
p53 mutation status and ER status. The analysis is set
up to start from 25 genes (in varin.txt below) and 1
latent factor. The final output includes 200 genes
modelled with 2 response factors and 5 latent factors.
This is a modified version of the p53 example described
in HighDimensional
Sparse Factor Modelling: Applications in Gene
Expression Genomics, by Carvalho et al. The input
files are:
 parameters.txt
 dataset.txt 
the data file (p+2=228 rows, n=251 columns) with the
p53 and ER binary responses in the first two rows,
and the expression data in the remainder.
 H.txt  there is no H.txt file for this example
(it is optional if the model has only an
intercept).
 ymask.txt 
file indicating the missing observations in the
outcome variables.
 varin.txt  the
initial set of 25 genes.
Example
3: This is an example concerning sparse
multivariate analysis of variance and regression for
data arising from a series of oncogene intervention
experiments. The primary study was reported in
Oncogenic pathway signatures in human cancers as a
guide to targeted therapies (Bild et al 2006) and
some additional aspects of analysis appear in Sparse
Statistical Modelling in Gene Expression Genomics
(Lucas et al 2006). The example has n=97 DNA microarray
samples with p=8509 genes. The samples comprise two
sets of controls created at different times, and nine
sets of replicate samples corresponding to the nine
treatment groups  each treatment represents
upregulation of one of nine selected oncogenes. The
design layout is:
Sample index 
Oncogene 
110 
GFP1  control 
1115 
GFP2  control 
1626 
MYC 
2632 
SRC 
3341 
Beta Catenin 
4250 
E2F3 
5160 
RAS 
6169 
P63 
7078 
AKT alpha 
7988 
E2F1 
8997 
P110 


The design matrix H has a first column of 1's for
the control group GFP1 defining baseline expression,
and then 10 columns with 0/1 entries, the 1s in column
2 indicating the samples corresponding to samples in
the second control group GFP2, those in column 3
indicating samples in the "MYC upregulated" group, and
so forth. This Anova layout defines the first 11
columns of H. The remaining 8 columns are assay
artifact covariates used as candidate predictors, gene
by gene (the assay artifact method is a very useful
general "genesample" specific normalisation method for
microarrays, here customised to the Affymetrix arrays
of this experiment, introduced and developed by this
group as reported in the papers above.)
The input files are:
 parameters.txt
 dataset.txt 
the expression data matrix (p=8509 rows, n=97
columns).
 H.txt  the design
matrix as described above.
 ymask.txt  an
empty file since no data is missing or heldout of
the analysis.
 varin.txt  the
list 1:8509 to include all genes in the
analysis.
