The program can be run
directly (Windows, unix or Mac) or from within Matlab
or R, taking text file inputs and producing summary
text file outputs. The code is self-contained, requires
no libraries and has been optimized for speed. A script
file interface is available so BFRM can be run from a
command window as well as from within Matlab or
R. Running:Command line calls
are:
- on Windows, bfrm.exe parameters.txt >
printout.txt
- on unix, ./bfrm64 parameters.txt
>printout.txt
- (or the 32bit
version)
where parameters.txt is the required input
file of parameters for the particular analysis
(described below and fully detailed on the Inputs page.), and
printout.txt records a runnining commentary of
the analysis and its iterations. Default is to screen
output if there is no redirection to a
printout.txt file. Running in
Matlab: !bfrm.exe
parameters.txt
Input
files:
- parameters.txt - input parameters/setup:
each row is a parameter (name, value) pair for a
predefined set of parameters, described in detail on
the Inputs page. Many
can be left as defaults in initial exploratory
analyses, but entries related to the specific data
set are of course required.
-
dataset.txt - rows are variables, columns
are samples. If response variables are included in
the model they should be in the initial rows in the
order
- binary responses (if any),
- categorical responses (if any),
- survival responses (if any),
- continuous responses (if any),
followed by rows of the X variables. So, if
fitting a factor model to X data alone, then the
first row will be the first X variable, and so on.
- ymask.txt - 0/1 indicators of missing
and/or censored observations in responses Y, with
rows as response variables and columns as samples.
Values: 0 for observed, 1 for missing (non-observed)
and 2 for censored.
- varin.txt - Integer index file of X
variables to be included in the initial model.
- H.txt - Design matrix: rows are
observations and columns are covariates. The first
column must be a column of ones (intercept). The
columns in order thereafter are fixed covariate
vectors for regression or Anova covariates.
Output
files: Currently, all outputs are simply
sets of posterior means (based on the MCMC analysis
that delivers Monte Carlo approximations) of most model
parameters, latent factors, indicators and so forth.
Future revisions of BFRM will add options to output
subsamples of MCMC streams, and other posterior
summaries.
- mAlpha.txt - Posterior mean for the total
mass parameter alpha of the Dirichlet process model
for the latent factors.
-
mA.txt - Posterior mean of the coefficient
matrix of the regression and factor model. Rows
represent variables (in the order in the
dataset.txt file) and columns represent parameters.
The order of columns is as follows:
- first column: posterior means of intercepts,
then
- posterior means of regression/anova
parameters (as laid out in H), then
- posterior means of loadings on the latent
response factors (if any response variables are
in the model), then
- posterior means of loadings on the latent
factors (if any).
For example, In a model with 100 X variables,
2 binary Y responses, 2 covariates and 5 latent
factors, mA.txt will have 102 rows and 10 columns
(1 intercept + 2 covariates + 2 response factors +
5 latent factors)
- mF.txt - Posterior means of the values of
response factors and latent factors, with rows as
factors and columns as samples. The first row is
always a row of 1's (intercept).
- mPib.txt - Posterior means of the
base-rate probability of inclusion in each factor
(rho's in the referenced paper). The first element is
always 1 (intercept).
- mPostPib.txt - Matrix of posterior means
for inclusion probabilities. The rows represent
variables (Y and X as in dataset.txt) and the columns
refer to the intercept, covariates, response factors
and latent factors (in that order) - the format and
order of columns is precisely the same as that of
mA.txt.
- mPsi.txt - Posterior means for the
residual variances for all variables (Y and X).
- mTau.txt - Posterior means of the
variances in the prior distributions of factor or
effect parameters. The first element is always zero
(variance of the intercept), and is followed by
values corresponding to the covariate effects,
response factors and latent factors, again in the
order as in mA.txt.
- mVariablesIn.txt - Integer indices of the
X variables included in the final model. If
evolutionary factor analysis has been performed,
these variables are listed in the order of their
inclusion in the model analysis (following the
initial list varin.txt of those included by choice
initially).
- mExternalProb.txt - Approximate posterior
predicted probabilities of non-zero X variable-factor
loadings for all X variables NOT included in the
factor model analysis. This is most useful when
running an evolutionary factor analysis that
terminates with some subset of the full set of X
variables, since it provides a posterior assessment
of the association between the remaining variables
and the estimated factors in the model. The first
column is an integer index list of variables
"outside" the model, and the columns represent the
response and latent factors (same order as in
mA.txt).
Example
1: This is an example concerning a factor
analysis of a subset of the breast cancer data analyzed
in High-Dimensional
Sparse Factor Modelling: Applications in Gene
Expression Genomics, by Carvalho et al. The
analysis has: n=251 observations, p=226 total genes
(GeneNames.txt),
and the example sets up and runs the evolutionary
factor model for all 226 genes, starting with 5
factors. The final output is still a model with 5
factors. The point of this example is to illustrate how
BFRM can be used for a fix set of variables and a
pre-specified number of factors. The input files are:
- parameters.txt
- dataset.txt --
the expression data matrix (p=226 rows, n=251
columns).
- H.txt -- the design
matrix containing a column of 1's plus 3 control
variables.
- ymask.txt -- an
empty file since no data is missing or held-out of
the analysis.
- varin.txt --
list of variables in the model (in this case a list
from 1 to 226)
Example
2: This examples illustrates the use of
the factor regression component of BFRM together with
the evolutionary analysis. We use the same 226 genes
from Example 1 as well as two binary outcome variables:
p53 mutation status and ER status. The analysis is set
up to start from 25 genes (in varin.txt below) and 1
latent factor. The final output includes 200 genes
modelled with 2 response factors and 5 latent factors.
This is a modified version of the p53 example described
in High-Dimensional
Sparse Factor Modelling: Applications in Gene
Expression Genomics, by Carvalho et al. The input
files are:
- parameters.txt
- dataset.txt --
the data file (p+2=228 rows, n=251 columns) with the
p53 and ER binary responses in the first two rows,
and the expression data in the remainder.
- H.txt -- there is no H.txt file for this example
(it is optional if the model has only an
intercept).
- ymask.txt --
file indicating the missing observations in the
outcome variables.
- varin.txt -- the
initial set of 25 genes.
Example
3: This is an example concerning sparse
multivariate analysis of variance and regression for
data arising from a series of oncogene intervention
experiments. The primary study was reported in
Oncogenic pathway signatures in human cancers as a
guide to targeted therapies (Bild et al 2006) and
some additional aspects of analysis appear in Sparse
Statistical Modelling in Gene Expression Genomics
(Lucas et al 2006). The example has n=97 DNA microarray
samples with p=8509 genes. The samples comprise two
sets of controls created at different times, and nine
sets of replicate samples corresponding to the nine
treatment groups -- each treatment represents
up-regulation of one of nine selected oncogenes. The
design layout is:
Sample index |
Oncogene |
1-10 |
GFP1 - control |
11-15 |
GFP2 - control |
16-26 |
MYC |
26-32 |
SRC |
33-41 |
Beta Catenin |
42-50 |
E2F3 |
51-60 |
RAS |
61-69 |
P63 |
70-78 |
AKT alpha |
79-88 |
E2F1 |
89-97 |
P110 |
|
|
The design matrix H has a first column of 1's for
the control group GFP1 defining baseline expression,
and then 10 columns with 0/1 entries, the 1s in column
2 indicating the samples corresponding to samples in
the second control group GFP2, those in column 3
indicating samples in the "MYC upregulated" group, and
so forth. This Anova layout defines the first 11
columns of H. The remaining 8 columns are assay
artifact covariates used as candidate predictors, gene
by gene (the assay artifact method is a very useful
general "gene-sample" specific normalisation method for
microarrays, here customised to the Affymetrix arrays
of this experiment, introduced and developed by this
group as reported in the papers above.)
The input files are:
- parameters.txt
- dataset.txt --
the expression data matrix (p=8509 rows, n=97
columns).
- H.txt -- the design
matrix as described above.
- ymask.txt -- an
empty file since no data is missing or held-out of
the analysis.
- varin.txt -- the
list 1:8509 to include all genes in the
analysis.
|