The program is a parallel computing version of SSS and must be run on a
unix cluster, taking a specified text input file and producing summary text
file outputs. The download consists of source code that must be compiled
on your cluster, some example data and example input files, and R and Matlab
files providing examples of output summarization.
Download the zip archive
This includes all the files, as follows:
----------------- |
Code and Scripts: |
rmsss.cpp |
- main SSS program |
|
calc.cpp |
- helper functions |
|
marglik.cpp/h |
- functions for marginal likelihood calculations/approximaxtions |
|
Model.cpp/h |
- a class defining the regression model |
|
node.cpp |
- functions for creating/storing model neighborhoods |
|
newrun.cpp/h |
- class for uniform random number generation |
|
run_rmsss_sge.q |
- a script file describing how we run the program |
|
|
using the SGE Queuing system. In general, details of
|
|
|
submitting/running a job will depend on your local setup. |
|
run_rmsss_pbs.q |
- a script file we use to run the program using the PBS Queuing
System |
----------------- |
Inputs: |
xdata.txt |
- predictor data for examples |
(Examples directory) |
ybinarydata.txt |
- response data for binary regression example |
|
ylineardata.txt |
- response data for linear regression example |
|
ysurvtimedata.txt |
- response data for survival regression example |
|
relapsedata.txt |
- observed (1) versus censored (0) data for survival example |
|
wdata.txt |
- indicator data for observations to be used in analysis |
|
binary.setup.txt |
- setup/input file for binary regression example |
|
linear.setup.txt |
- setup/input file for linear regression example |
|
survival.setup.txt |
- setup/input file for survival regression example |
----------------- |
Matlab:
| examples.m |
- commands to load and run the three examples |
|
binarysummary.m |
- commands to summarise, plot aspects of binary example |
|
linearsummary.m |
- commands to summarise, plot aspects of linear example |
|
survivalsummary.m |
- commands to summarise, plot aspects of survival example |
|
show.m, showtv.m |
- matlab utilities for graphs |
|
scattertv.m, km.m |
- matlab utilities for graphs |
|
pairstv.m |
- matlab utility for graphs |
|
std_rows.m, ranktrf.m |
- matlab utiliies |
----------------- |
R:
| examples.r |
- commands to load and run the three examples |
|
binarysummary.r |
- commands to summarise, plot aspects of binary example |
|
linearsummary.r |
- commands to summarise, plot aspects of linear example |
|
show.r, showtv.r |
- utilities for graphs |
|
scattertv.r |
- utilities for graphs |
----------------- |
Compiling and Running SSS
Compiling
- Be sure all the file privileges are set correctly.
- The program requires the
MPI (message passing
interface for parallel computing) library.
- Commands for compiling the code (that work on our machines) can be
found at the top of the rmsss.cpp file. You may need to change
directory names, etc, depending on your setup.
Running:
How you run the program depends somewhat on the particular cluster
environment in which you are working. If you want to run the program
directly from the command line, you might try something like:
mpirun -machinefile nodes -np 21 ./rmsss.exe ./Examples/binary.setup.txt > output.txt
The file "nodes" would contain the names of the machines you wish to
use. The value "21" means that you want to use 21 processors: 1 head node
that manages communication and controls the algorithm and 20 additional
compute nodes. The files "binary.setup.txt" is an input file described below.
If your cluster requires the use of a queuing system you will need
to submit your job under that specific framework. The files "run_rmsss_xxx.q",
where xxx corresponds to "pbs" or "sge" (two particular queueing systems
we have used) provide examples of how we submit jobs on our machines at the
command line:
qsub run_rmsss_xxx.q
Make sure that the correct input/setup file (described below) is
specified in the run_rmsss_xxx.q file.
Setup file of input parameters
The parameter setup/input file (e.g. binary.setup.txt above) is a
flat text file with each line representing a parameter (name, value) pair
for a predefined set of parameters. The order of parameters in the file is
not important. User can also comment out a line by adding # at the beginning.
Each line is in the format
ParameterName = Value
where ParameterName is one of the names described bellow and Value takes a
string or numberic values depending on the nature of given parameter. When
a path is used as a parameter value, the spaces in the path will be
ignored. SSS will NOT work with a path that
has spaces in it.
See also the description in the README file contained in the
download directory.
----------------- |
Inputs: |
NOBSERVATIONS |
n = total sample size |
|
NVARIABLES |
p = total number of predictor variables |
|
DATAFILE |
tab delimited predictor data (n rows, p columns) |
|
RESPONSEFILE |
response variable (n values, tab delimited row or single column) |
|
WEIGHTSFILE |
weight vector (n values of 0/1; 1 indicates samples to be used in
model fit) |
|
CENSORFILE |
0/1 indicators of right-censoring (0) versus observed (1) in case
of survival data |
----------------- |
Output: |
OUTFILE |
list of selected best models |
|
ITEROUT |
details of models visited at each SSS iteration |
|
NULLFILE |
contains the score for the null model -- the model with no
predictors |
|
SUMMARYFILE |
summary of the models with parameter estimates
|
|
LOOCV |
[0/1]: 1 if the scores and parameter estimates for the top models
are to be recomputed (default 0) |
|
|
one at a time, each time holding out a different observation |
|
LOOCVFILE |
If LOOCV=1, this is the base filename for the files containing the
recomputed scores |
|
|
and parameter estimates. For example, if observation 3 is being
held out and the |
|
|
base filename is "loocv", then the file will be called "loocv.y3". |
|
DEBUGOUT |
[0/1]: 1 if iteration information is printed to stdout (default 1) |
| | |
| | |
| | |
| | |
----------------- |
Model/Search: |
MODTYPE |
Model Type: 1=linear, 2=binary/logit, 3=Weibull survival
(default 2: binary) |
|
DSTART |
Initial Model Size: number of predictors for model to start SSS
(default 2) |
|
PMAX |
Maximum Model Size: maximum number of predictors in any model
(default 20) |
|
PRIORMEANP |
Prior mean of number of included variables: this is the key sparsity
control parameter (PRIORMEANP=v means that each variable is "in the
model" with probability v/p. PRIORMEANP must be in the interval
(0, NVARIABLES). (default 4.0) |
|
NBEST |
Number of Best Models to be saved/recorded (default 10000) |
|
ITERS |
Total Number of SSS iterations (default 10000) |
|
ONEVAR |
[0/1]: 1 includes all 1-variable models in search (default 1) |
|
|
|
|
See the README file in the download directory for
additional optional inputs. |
|
----------------- |
Annealing*: |
(replace, innerAnneal1) |
annealing parameter for variable replacement (default 0.6) |
|
(delete, innerAnneal2) |
annealing parameter for variable deletion (default 1.0) |
|
(add, innerAnneal3) |
annealing parameter for variable addition (default 0.8) |
|
(outer, outerAnneal) |
annealing parameter for second level model selection (default 0.4) |
| | |
| | |
----------------- |
*Annealing parameters can generally be left at the suggested defaults in
the example files here. See the paper for additional discussion.
Output files
The key output file is the SUMMARYFILE -- a flat
txt file summarising the posterior distributions within and across the
models summarised. See the examples in the Matlab (or R) examples file
examples.m (or examples.r) and the three summary support .m
(or .r) script files for these examples. The SSS search explores linear
models with standardized y and x data and so the output summaries relate
to the standardized models with no intercept. In contrast the binary and
survival models include intercepts.
The OUTFILE contains some of the same information as the SUMMARYFILE. Models
are ordered by decreasing posterior probability. The first column contains
the iteration at which the model was first found; the second column indicates
the number of variables in the model, and the remaining columns give the
indices of the p variables in the model.
The SUMMARYFILE output has the following
information. Each row is one model of the "top models" ordered in decreasing
order of posterior probability. The number of columns is defined by the
largest model, and entries are NA/NaN for models smaller than the largest
In each model/row the entries are as follows:
Linear regression models:
- element 1 - dimension of the model = number of predictors p for
this model
- element 2 - log posterior probability of this model (the "score")
- elements 3:(2+p) - the indices of the p variables in this model
- elements (3+p):(2+2p) - posterior mode (also the mean) of the
regression parameter vector beta (no intercept)
- elements (3+2p):(2+2p+p*p) - posterior variance matrix of beta in
vectorised form (no intercept)
- final two elements: - (s,d), the residual SD estimate (MAP estimate),
and the posterior degees of freedom
Binary (logit) regression models:
- element 1 - dimension of the model = number of predictors p for
this model
- element 2 - log posterior probability of this model (the "score")
- elements 3:(2+p) - the indices of the p variables in this model
- elements (3+p):(3+2p) - posterior mode of the regression parameter
vector beta (includes intercept)
- elements (4+2p):(4+4p+p*p) - estimated posterior variance matrix of
beta in vectorised form (includes intercept)
Survival (Weibull) regression models:
- element 1 - dimension of the model = number of predictors p for
this model
- element 2 - log posterior probability of this model (the "score")
- elements 3:(2+p) - the indices of the p variables in this model
- element 3+p: - the posterior mode of the Weibull index parameter
alpha in this model
- elements (4+p):(4+2p) - posterior mode of the regression parameter
vector beta (includes intercept)
- elements (5+2p):(8+6p+p*p) - estimated posterior variance matrix of
(alpha,beta) (beta includes intercept) in vectorised form
For a more detailed description of all output files, please see the
README file in the download.