The program is a parallel computing version of SSS and must be run on a
unix cluster, taking a specified text input file and producing summary text
file outputs. The download consists of source code that must be compiled
on your cluster, some example data and example input files, and R and Matlab
files providing examples of output summarization.
Download the zip archive
This includes all the files, as follows:
 
Code and Scripts: 
rmsss.cpp 
 main SSS program 

calc.cpp 
 helper functions 

marglik.cpp/h 
 functions for marginal likelihood calculations/approximaxtions 

Model.cpp/h 
 a class defining the regression model 

node.cpp 
 functions for creating/storing model neighborhoods 

newrun.cpp/h 
 class for uniform random number generation 

run_rmsss_sge.q 
 a script file describing how we run the program 


using the SGE Queuing system. In general, details of



submitting/running a job will depend on your local setup. 

run_rmsss_pbs.q 
 a script file we use to run the program using the PBS Queuing
System 
 
Inputs: 
xdata.txt 
 predictor data for examples 
(Examples directory) 
ybinarydata.txt 
 response data for binary regression example 

ylineardata.txt 
 response data for linear regression example 

ysurvtimedata.txt 
 response data for survival regression example 

relapsedata.txt 
 observed (1) versus censored (0) data for survival example 

wdata.txt 
 indicator data for observations to be used in analysis 

binary.setup.txt 
 setup/input file for binary regression example 

linear.setup.txt 
 setup/input file for linear regression example 

survival.setup.txt 
 setup/input file for survival regression example 
 
Matlab:
 examples.m 
 commands to load and run the three examples 

binarysummary.m 
 commands to summarise, plot aspects of binary example 

linearsummary.m 
 commands to summarise, plot aspects of linear example 

survivalsummary.m 
 commands to summarise, plot aspects of survival example 

show.m, showtv.m 
 matlab utilities for graphs 

scattertv.m, km.m 
 matlab utilities for graphs 

pairstv.m 
 matlab utility for graphs 

std_rows.m, ranktrf.m 
 matlab utiliies 
 
R:
 examples.r 
 commands to load and run the three examples 

binarysummary.r 
 commands to summarise, plot aspects of binary example 

linearsummary.r 
 commands to summarise, plot aspects of linear example 

show.r, showtv.r 
 utilities for graphs 

scattertv.r 
 utilities for graphs 
 
Compiling and Running SSS
Compiling
 Be sure all the file privileges are set correctly.
 The program requires the
MPI (message passing
interface for parallel computing) library.
 Commands for compiling the code (that work on our machines) can be
found at the top of the rmsss.cpp file. You may need to change
directory names, etc, depending on your setup.
Running:
How you run the program depends somewhat on the particular cluster
environment in which you are working. If you want to run the program
directly from the command line, you might try something like:
mpirun machinefile nodes np 21 ./rmsss.exe ./Examples/binary.setup.txt > output.txt
The file "nodes" would contain the names of the machines you wish to
use. The value "21" means that you want to use 21 processors: 1 head node
that manages communication and controls the algorithm and 20 additional
compute nodes. The files "binary.setup.txt" is an input file described below.
If your cluster requires the use of a queuing system you will need
to submit your job under that specific framework. The files "run_rmsss_xxx.q",
where xxx corresponds to "pbs" or "sge" (two particular queueing systems
we have used) provide examples of how we submit jobs on our machines at the
command line:
qsub run_rmsss_xxx.q
Make sure that the correct input/setup file (described below) is
specified in the run_rmsss_xxx.q file.
Setup file of input parameters
The parameter setup/input file (e.g. binary.setup.txt above) is a
flat text file with each line representing a parameter (name, value) pair
for a predefined set of parameters. The order of parameters in the file is
not important. User can also comment out a line by adding # at the beginning.
Each line is in the format
ParameterName = Value
where ParameterName is one of the names described bellow and Value takes a
string or numberic values depending on the nature of given parameter. When
a path is used as a parameter value, the spaces in the path will be
ignored. SSS will NOT work with a path that
has spaces in it.
See also the description in the README file contained in the
download directory.
 
Inputs: 
NOBSERVATIONS 
n = total sample size 

NVARIABLES 
p = total number of predictor variables 

DATAFILE 
tab delimited predictor data (n rows, p columns) 

RESPONSEFILE 
response variable (n values, tab delimited row or single column) 

WEIGHTSFILE 
weight vector (n values of 0/1; 1 indicates samples to be used in
model fit) 

CENSORFILE 
0/1 indicators of rightcensoring (0) versus observed (1) in case
of survival data 
 
Output: 
OUTFILE 
list of selected best models 

ITEROUT 
details of models visited at each SSS iteration 

NULLFILE 
contains the score for the null model  the model with no
predictors 

SUMMARYFILE 
summary of the models with parameter estimates


LOOCV 
[0/1]: 1 if the scores and parameter estimates for the top models
are to be recomputed (default 0) 


one at a time, each time holding out a different observation 

LOOCVFILE 
If LOOCV=1, this is the base filename for the files containing the
recomputed scores 


and parameter estimates. For example, if observation 3 is being
held out and the 


base filename is "loocv", then the file will be called "loocv.y3". 

DEBUGOUT 
[0/1]: 1 if iteration information is printed to stdout (default 1) 
  
  
  
  
 
Model/Search: 
MODTYPE 
Model Type: 1=linear, 2=binary/logit, 3=Weibull survival
(default 2: binary) 

DSTART 
Initial Model Size: number of predictors for model to start SSS
(default 2) 

PMAX 
Maximum Model Size: maximum number of predictors in any model
(default 20) 

PRIORMEANP 
Prior mean of number of included variables: this is the key sparsity
control parameter (PRIORMEANP=v means that each variable is "in the
model" with probability v/p. PRIORMEANP must be in the interval
(0, NVARIABLES). (default 4.0) 

NBEST 
Number of Best Models to be saved/recorded (default 10000) 

ITERS 
Total Number of SSS iterations (default 10000) 

ONEVAR 
[0/1]: 1 includes all 1variable models in search (default 1) 




See the README file in the download directory for
additional optional inputs. 

 
Annealing*: 
(replace, innerAnneal1) 
annealing parameter for variable replacement (default 0.6) 

(delete, innerAnneal2) 
annealing parameter for variable deletion (default 1.0) 

(add, innerAnneal3) 
annealing parameter for variable addition (default 0.8) 

(outer, outerAnneal) 
annealing parameter for second level model selection (default 0.4) 
  
  
 
*Annealing parameters can generally be left at the suggested defaults in
the example files here. See the paper for additional discussion.
Output files
The key output file is the SUMMARYFILE  a flat
txt file summarising the posterior distributions within and across the
models summarised. See the examples in the Matlab (or R) examples file
examples.m (or examples.r) and the three summary support .m
(or .r) script files for these examples. The SSS search explores linear
models with standardized y and x data and so the output summaries relate
to the standardized models with no intercept. In contrast the binary and
survival models include intercepts.
The OUTFILE contains some of the same information as the SUMMARYFILE. Models
are ordered by decreasing posterior probability. The first column contains
the iteration at which the model was first found; the second column indicates
the number of variables in the model, and the remaining columns give the
indices of the p variables in the model.
The SUMMARYFILE output has the following
information. Each row is one model of the "top models" ordered in decreasing
order of posterior probability. The number of columns is defined by the
largest model, and entries are NA/NaN for models smaller than the largest
In each model/row the entries are as follows:
Linear regression models:
 element 1  dimension of the model = number of predictors p for
this model
 element 2  log posterior probability of this model (the "score")
 elements 3:(2+p)  the indices of the p variables in this model
 elements (3+p):(2+2p)  posterior mode (also the mean) of the
regression parameter vector beta (no intercept)
 elements (3+2p):(2+2p+p*p)  posterior variance matrix of beta in
vectorised form (no intercept)
 final two elements:  (s,d), the residual SD estimate (MAP estimate),
and the posterior degees of freedom
Binary (logit) regression models:
 element 1  dimension of the model = number of predictors p for
this model
 element 2  log posterior probability of this model (the "score")
 elements 3:(2+p)  the indices of the p variables in this model
 elements (3+p):(3+2p)  posterior mode of the regression parameter
vector beta (includes intercept)
 elements (4+2p):(4+4p+p*p)  estimated posterior variance matrix of
beta in vectorised form (includes intercept)
Survival (Weibull) regression models:
 element 1  dimension of the model = number of predictors p for
this model
 element 2  log posterior probability of this model (the "score")
 elements 3:(2+p)  the indices of the p variables in this model
 element 3+p:  the posterior mode of the Weibull index parameter
alpha in this model
 elements (4+p):(4+2p)  posterior mode of the regression parameter
vector beta (includes intercept)
 elements (5+2p):(8+6p+p*p)  estimated posterior variance matrix of
(alpha,beta) (beta includes intercept) in vectorised form
For a more detailed description of all output files, please see the
README file in the download.