BFRM Software: Bayesian Factor Regression Modelling

BFRM 2.0

1. PreliminariesBefore running BFRM 2.0 the user must create a data file in the predefined format. This is a text-based parameter file containing all the information about the model, data, prior specification, etc (described in detail later). With this parameter file in place, the user can run BFRM 2.0 by typing at the command line:

Bfrm parameter.txt

2. How to generate a default parameter file?Here is a default default.parameters.txt file, to be used as a template. The examples on this site give specific examples for three selected data analyses. 3. What is the parameter file format?

The parameter file is a text file that can be generated by any text editor with each line representing a parameter name/value pair. Only predefined names are accepted and the program will report an error and quit if any unknown parameter/name is found. Empty lines are accepted and ignored. Any line staring with “#” will be treated as a comment and therefore ignored.

At each line the name/value pair takes the format

Parameter name = value

The value can either be a number (integer/double) or a character string.

Parameter names are not case sensitive and white spaces are allowed within the name if that is more convenient to the user. As an example, the following parameter names are valid and represent the same information:

ResponseMaskFile
response mask file
responsemaskfile

The parameters are not ordered and any order convenient to the user will be accepted. If a parameter name appears more than once, the last appearance will be used.

There are default values for all parameters used by BFRM 2.0. If a parameter name is not specified in the parameter file, the default value will automatically be used.

4. What are the parameters?The parameters defined in the parameter file are used to indicate the data settings, the model settings, the prior information and print controls. The default parameter file is organized into “sections” as follows:

Nobservations: Integer. The total number of samples (observations) in the dataset.

NVariables: Integer. The total number of variables in the dataset, including X variables (genes) and Y variables (response variables).

NbinaryResponses: Integer. The total number of binary response variables in the model.

NcategoricalResponses: Integer. The total number of categorical response variables in the model.

NsurvivalResponses: Integer. The total number of survival response variables in the model.

NcontinuousResponses: Integer. The total number of continuous response variables in the model.

NDesignVariables: Integer. The number of design covariates (not including any assay-artifact control variables – see next item), including the intercept. If the user wants to fit a model without intercept, this value should be set to 0. The default value for this one is 1, just the intercept.

NcontrolVariables: Integer. The total number of “assay-artifact” control covariates to be used in analysis of Affymetrix expression data, based on the housekeeping genes on the array. The default value is 0.

NlatentFactors: Integer. This parameter has two possible interpretations:

if choosing to fit a static factor model to a specified set of variables, this is the number of latent factor in the model;
if choosing the evolutionary variable selection and factor model search method, this represents the starting number of latent factors in the model. The default value for this parameter is 0.

DataFile: String. The name of the file that contains the “data” Y and X (in this order). This must be a flat text file with:

each line representing a variable and each column representing an observation,
tabs separating observations within a line,
fields that are numeric and with no string values of any kind are allowed,
missing values in the dataset indicated by a specific numeric value (such as 0 or -999 – a second input file discussed later is used to indicate which the missing observations are)

HFile: String. The name of the file that contains the ( intercept, design, covariate ) data for the analysis. This file is a flat text file with each line representing an observation and each column representing a variable. The columns in H file must be in the order of intercept, design and control variables. If the NdesignVariables is set to 1, which means no other design/control variable other than an intercept, Hfile can be omitted.

ResponseMaskFile: String. The name of the text file (mask) indicating the missing and/or censored observations in the responses Y. It is only necessary if at least one response variable has missing or censored (in the survival case) observations. This file is a flat text file with each row representing a response variable and each column representing the status of each observation. Each observation should take value 0 for observed, 1 for missing (non-observed) and 2 for censored.

XmaskFile: String. The name of the text file (mask) indicating the missing observations in the X variables. It is only necessary if there are missing observations in X. This file is a flat text file with each row representing a variable and each column representing the status of each observation. Each observation should take 0 for observed and 1 for missing values.

5. Prior section

ShapeOfB: Integer. This parameter defines the constraints placed on the factor loadings matrix B. It takes either 0 (no constraint) or 2 (upper triangular of B set to zero) as its value. For identification purposes 2 is the default value for this parameter.

NonGaussianFactors: Integer. This parameter indicates whether a Gaussian model (0) or Dirichlet Process (1) model is used to model the latent factors. The default value is 1, which means a DP model will be used.

PriorPsia, PriorPsib: double. Hyper-parameter values for the inverse-Gamma(a,b) prior for elements of Psi, the vector of residual variables for all X variables. The default values are (2,0.005) for Affymetrix data under the standard analysis of RMA (log base 2) expression indices.

PriorSurvivalPpsia,PriorSurvivalPpsiab: double. Hyper-parameter values for the inverse-gamma(a,b) prior for residual variances of an included survival response variable; right censored survival data are modelled as log-normal, linear regressions. The default values are (2,0.5).

PriorRhoN, PriorRhoMean: double. Hyper-parameter values for the Beta(PriorRhoMean* PriorRhoN, (1-PriorRhoMean)*PriorRhoN) prior for the sparsity base rate parameters -- the elements of the vector Rho. The default values are (0.001, 200).

PriorPiMean, PriorPiN: double. Hyper-parameter values for the Beta(PriorPiMean* PriorPiN, (1-PriorPiMean)*PriorPiN) prior for the hierachical components of the prior on non-zero inclusion probabilities. The default values are (0.9, 10.0).

PriorTauDesigna,PriorTauDesignb: double. Hyper-parameter values for the inverse-Gamma(a,b) prior for the variances Tau of the design/control factor effects. The default values are (5,1).

PriorTauResponseBinarya,PriorTauResponseBinaryb: double. Hyper-parameter values for the inverse-Gamma(a,b) prior for the variances Tau of the binary response factors. The default values are (5,1). This is only necessary if binary responses are present in the model.

PriorTauResponseCategoricala,PriorTauResponseCategoricalb: double. Hyper-parameter values for the inverse-Gamma(a,b) prior for the variances Tau of the categorical response factors. The default values are (5,1). This is only necessary if categorical responses are present in the model.

PriorTauResponseSurvivala,PriorTauResponseSurvivalb: double. Hyper-parameter values for the inverse-Gamma(a,b) prior for the variances Tau in the of the survival response factors. The default values are (5,1). This is only necessary if survival responses are present in the model.

PriorTauResponseContinuousa,PriorTauResponseContinuousb: double. Hyper-parameter values for the inverse-Gamma(a,b) prior for the variances Tau of the continuous response factors. The default values are (5,1). This is only necessary if continuous responses are present in the model.

PriorTauLatenta,PriorTauLatentb: double. Hyper-parameter values for the inverse-Gamma(a,b) prior of the variances Tau for the latent factors. The default values are (5,1).

PriorInterceptMean, PriorInterceptVar: Prior mean and variance for the intercept (baseline level) of X variables. The default values are (8,100) based on the prototype of Affymetrix gene expression X variables.

PriorContinuousMean, PriorContinuousVar: Prior mean and variance for the intercept (baseline) of any continuous response variables. The default values are (0,1) consistent with standardised response data.

PriorSurvivalMean, PriorSurvivalVar: Prior mean and variance for the intercept (baseline) of any survival response variables. The default values are (2,10) consistent with standardised response data.

6. Evolutionary variable and factor model search section

Evol: Integer. This parameter takes either 0 or 1. Setting evol to 1 activates the evolving mode in BFRM. The default value is 0.

EvolVarIn: Integer. This parameter is only necessary if Evol is set to 1. It indicates the number of variables (elements of X) used to initialize the evolutionary analysis.

EvolVarInFile: String. The indices of the variables (of X) that are included in this initializing set (the first X variable is indexed by 1, and so on). If this file is missing, then the indices default to 1, i.e. only the first X variable is assumed to be in the initial model.

EvolIncludeVariableThreshold: Double. This parameter sets the threshold for bringing a new variable into the model. In considering whether to add in new variables (genes) at a given evolutionary analysis step, variables are ranked according to their approximate posterior probability of inclusion at that stage. One of the two elements of the decision to include some of the most highly ranked variables is then a threshold on this posterior inclusion probability – variables with probabilities below that threshold will not be included. The default value is 0.75.

EvolMaxiumVariablesPerIteration: Integer. This parameter sets the maximum number of variables that can be added to the model at each iteration. The default value is 5. If the most highly ranked A variables currently exceed EvolIncludeVariableThreshold, then the most highly ranked min{ A, EvolMaxiumVariablesPerIteration } are added. This may be zero, which is one way the evolutionary analysis may terminate.

EvolIncludeFactorThreshold: Double. This parameter sets the threshold for adding a new latent factor into the model. A new latent factor will be added if and only if at least this number of variables (genes) for that factor have posterior probability of association with the factor that exceed this probability threshold. The default value is 0.75.

EvolMinumVariablesInFactor: Integer. This parameter sets the minimum number of variables (genes) showing significant association with a factor in order that the factor be included in the model. The default value is 5.

EvolMaximumVariablesPerFactor: Integer. This parameter sets the maximum number of variables that can be weighted on any one factor in the evolutionary inclusion steps. This allows the user to limit the number of variables brought into the model for each factor and hence to explore more effectively other factor dimensions. The default value is 15.

EvolMaximumFactors: Integer. This parameter sets the maximum number of latent factors that the final model can have. The default value is 5.

EvolMaximumVariables: Integer. This parameter sets the maximum number of variables the final model can have. The default value is 100.

7. MCMC section

BurnIn: Integer. The number of burn-in iterations in the MCMC. Default 2000.

NMCSamples:Integer. The number of MCMC iterations. Default 5000.

8. Monitoring section

PrintIteration: Integer. A number defining how often a MCMC iteration is printed to the screen. Default 100.

9. Dirichlet Process parameters

PriorAlphaa, PriorAlphab: doubles. Prior parameters for the Gamma prior for Alpha. Default (1,1).