SSS 2.0


Downloads, Installation and Running SSS: Parallel version


The program is a parallel computing version of SSS and must be run on a unix cluster, taking a specified text input file and producing summary text file outputs. The download consists of source code that must be compiled on your cluster, some example data and example input files, and R and Matlab files providing examples of output summarization.


Download the SSS zip archive
This includes all the files, as follows:

-----------------
Code and Scripts: rmsss.cpp - main SSS program
calc.cpp - helper functions
marglik.cpp/h - functions for marginal likelihood calculations/approximaxtions
Model.cpp/h - a class defining the regression model
node.cpp - functions for creating/storing model neighborhoods
newrun.cpp/h - class for uniform random number generation
run_rmsss_sge.q - a script file describing how we run the program
  using the SGE Queuing system. In general, details of
  submitting/running a job will depend on your local setup.
run_rmsss_pbs.q - a script file we use to run the program using the PBS Queuing System
-----------------
Inputs: xdata.txt - predictor data for examples
(Examples directory) ybinarydata.txt - response data for binary regression example
ylineardata.txt - response data for linear regression example
ysurvtimedata.txt - response data for survival regression example
relapsedata.txt - observed (1) versus censored (0) data for survival example
wdata.txt - indicator data for observations to be used in analysis
binary.setup.txt - setup/input file for binary regression example
linear.setup.txt - setup/input file for linear regression example
survival.setup.txt - setup/input file for survival regression example
-----------------
Matlab: examples.m - commands to load and run the three examples
binarysummary.m - commands to summarise, plot aspects of binary example
linearsummary.m - commands to summarise, plot aspects of linear example
survivalsummary.m - commands to summarise, plot aspects of survival example
show.m, showtv.m - matlab utilities for graphs
scattertv.m, km.m - matlab utilities for graphs
pairstv.m - matlab utility for graphs
std_rows.m, ranktrf.m - matlab utiliies
-----------------
R: examples.r - commands to load and run the three examples
binarysummary.r - commands to summarise, plot aspects of binary example
linearsummary.r - commands to summarise, plot aspects of linear example
show.r, showtv.r - utilities for graphs
scattertv.r - utilities for graphs
-----------------



Compiling and Running SSS

Compiling

  1. Be sure all the file privileges are set correctly.
  2. The program requires the MPI (message passing interface for parallel computing) library.
  3. Commands for compiling the code (that work on our machines) can be found at the top of the rmsss.cpp file. You may need to change directory names, etc, depending on your setup.
Running: How you run the program depends somewhat on the particular cluster environment in which you are working. If you want to run the program directly from the command line, you might try something like:

mpirun -machinefile nodes -np 21 ./rmsss.exe ./Examples/binary.setup.txt > output.txt

The file "nodes" would contain the names of the machines you wish to use. The value "21" means that you want to use 21 processors: 1 head node that manages communication and controls the algorithm and 20 additional compute nodes. The files "binary.setup.txt" is an input file described below.

If your cluster requires the use of a queuing system you will need to submit your job under that specific framework. The files "run_rmsss_xxx.q", where xxx corresponds to "pbs" or "sge" (two particular queueing systems we have used) provide examples of how we submit jobs on our machines at the command line:

qsub run_rmsss_xxx.q

Make sure that the correct input/setup file (described below) is specified in the run_rmsss_xxx.q file.



Setup file of input parameters

The parameter setup/input file (e.g. binary.setup.txt above) is a flat text file with each line representing a parameter (name, value) pair for a predefined set of parameters. The order of parameters in the file is not important. User can also comment out a line by adding # at the beginning. Each line is in the format

ParameterName = Value

where ParameterName is one of the names described bellow and Value takes a string or numberic values depending on the nature of given parameter. When a path is used as a parameter value, the spaces in the path will be ignored. SSS will NOT work with a path that has spaces in it.

See also the description in the README file contained in the download directory.

-----------------
Inputs: NOBSERVATIONS n = total sample size
NVARIABLES p = total number of predictor variables
DATAFILE tab delimited predictor data (n rows, p columns)
RESPONSEFILE response variable (n values, tab delimited row or single column)
WEIGHTSFILE weight vector (n values of 0/1; 1 indicates samples to be used in model fit)
CENSORFILE 0/1 indicators of right-censoring (0) versus observed (1) in case of survival data
-----------------
Output: OUTFILE list of selected best models
ITEROUT details of models visited at each SSS iteration
NULLFILE contains the score for the null model -- the model with no predictors
SUMMARYFILE summary of the models with parameter estimates
LOOCV [0/1]: 1 if the scores and parameter estimates for the top models are to be recomputed (default 0)
one at a time, each time holding out a different observation
LOOCVFILE If LOOCV=1, this is the base filename for the files containing the recomputed scores
and parameter estimates. For example, if observation 3 is being held out and the
base filename is "loocv", then the file will be called "loocv.y3".
DEBUGOUT [0/1]: 1 if iteration information is printed to stdout (default 1)
-----------------
Model/Search: MODTYPE Model Type: 1=linear, 2=binary/logit, 3=Weibull survival (default 2: binary)
DSTART Initial Model Size: number of predictors for model to start SSS (default 2)
PMAX Maximum Model Size: maximum number of predictors in any model (default 20)
PRIORMEANP Prior mean of number of included variables: this is the key sparsity control parameter (PRIORMEANP=v means that each variable is "in the model" with probability v/p. PRIORMEANP must be in the interval (0, NVARIABLES). (default 4.0)
NBEST Number of Best Models to be saved/recorded (default 10000)
ITERS Total Number of SSS iterations (default 10000)
ONEVAR [0/1]: 1 includes all 1-variable models in search (default 1)
 
See the README file in the download directory for additional optional inputs.
-----------------
Annealing*: (replace, innerAnneal1) annealing parameter for variable replacement (default 0.6)
(delete, innerAnneal2) annealing parameter for variable deletion (default 1.0)
(add, innerAnneal3) annealing parameter for variable addition (default 0.8)
(outer, outerAnneal) annealing parameter for second level model selection (default 0.4)
-----------------
*Annealing parameters can generally be left at the suggested defaults in the example files here. See the paper for additional discussion.



Output files

The key output file is the SUMMARYFILE -- a flat txt file summarising the posterior distributions within and across the models summarised. See the examples in the Matlab (or R) examples file examples.m (or examples.r) and the three summary support .m (or .r) script files for these examples. The SSS search explores linear models with standardized y and x data and so the output summaries relate to the standardized models with no intercept. In contrast the binary and survival models include intercepts.

The OUTFILE contains some of the same information as the SUMMARYFILE. Models are ordered by decreasing posterior probability. The first column contains the iteration at which the model was first found; the second column indicates the number of variables in the model, and the remaining columns give the indices of the p variables in the model.

The SUMMARYFILE output has the following information. Each row is one model of the "top models" ordered in decreasing order of posterior probability. The number of columns is defined by the largest model, and entries are NA/NaN for models smaller than the largest In each model/row the entries are as follows:

Linear regression models:

  • element 1 - dimension of the model = number of predictors p for this model
  • element 2 - log posterior probability of this model (the "score")
  • elements 3:(2+p) - the indices of the p variables in this model
  • elements (3+p):(2+2p) - posterior mode (also the mean) of the regression parameter vector beta (no intercept)
  • elements (3+2p):(2+2p+p*p) - posterior variance matrix of beta in vectorised form (no intercept)
  • final two elements: - (s,d), the residual SD estimate (MAP estimate), and the posterior degees of freedom

Binary (logit) regression models:

  • element 1 - dimension of the model = number of predictors p for this model
  • element 2 - log posterior probability of this model (the "score")
  • elements 3:(2+p) - the indices of the p variables in this model
  • elements (3+p):(3+2p) - posterior mode of the regression parameter vector beta (includes intercept)
  • elements (4+2p):(4+4p+p*p) - estimated posterior variance matrix of beta in vectorised form (includes intercept)

Survival (Weibull) regression models:

  • element 1 - dimension of the model = number of predictors p for this model
  • element 2 - log posterior probability of this model (the "score")
  • elements 3:(2+p) - the indices of the p variables in this model
  • element 3+p: - the posterior mode of the Weibull index parameter alpha in this model
  • elements (4+p):(4+2p) - posterior mode of the regression parameter vector beta (includes intercept)
  • elements (5+2p):(8+6p+p*p) - estimated posterior variance matrix of (alpha,beta) (beta includes intercept) in vectorised form

For a more detailed description of all output files, please see the README file in the download.