SSS Software: Shotgun Stochastic Search in Regression

SSS 2.0

Downloads, Installation and Running SSS: Parallel version

The program is a parallel computing version of SSS and must be run on a unix cluster, taking a specified text input file and producing summary text file outputs. The download consists of source code that must be compiled on your cluster, some example data and example input files, and R and Matlab files providing examples of output summarization.

Download the SSS zip archive
This includes all the files, as follows:

-----------------
Code and Scripts:	rmsss.cpp	- main SSS program
	calc.cpp	- helper functions
	marglik.cpp/h	- functions for marginal likelihood calculations/approximaxtions
	Model.cpp/h	- a class defining the regression model
	node.cpp	- functions for creating/storing model neighborhoods
	newrun.cpp/h	- class for uniform random number generation
	run_rmsss_sge.q	- a script file describing how we run the program
		using the SGE Queuing system. In general, details of
		submitting/running a job will depend on your local setup.
	run_rmsss_pbs.q	- a script file we use to run the program using the PBS Queuing System
-----------------
Inputs:	xdata.txt	- predictor data for examples
(Examples directory)	ybinarydata.txt	- response data for binary regression example
	ylineardata.txt	- response data for linear regression example
	ysurvtimedata.txt	- response data for survival regression example
	relapsedata.txt	- observed (1) versus censored (0) data for survival example
	wdata.txt	- indicator data for observations to be used in analysis
	binary.setup.txt	- setup/input file for binary regression example
	linear.setup.txt	- setup/input file for linear regression example
	survival.setup.txt	- setup/input file for survival regression example
-----------------
Matlab:	examples.m	- commands to load and run the three examples
	binarysummary.m	- commands to summarise, plot aspects of binary example
	linearsummary.m	- commands to summarise, plot aspects of linear example
	survivalsummary.m	- commands to summarise, plot aspects of survival example
	show.m, showtv.m	- matlab utilities for graphs
	scattertv.m, km.m	- matlab utilities for graphs
	pairstv.m	- matlab utility for graphs
	std_rows.m, ranktrf.m	- matlab utiliies
-----------------
R:	examples.r	- commands to load and run the three examples
	binarysummary.r	- commands to summarise, plot aspects of binary example
	linearsummary.r	- commands to summarise, plot aspects of linear example
	show.r, showtv.r	- utilities for graphs
	scattertv.r	- utilities for graphs
-----------------

Compiling and Running SSS

Compiling

Be sure all the file privileges are set correctly.
The program requires the MPI (message passing interface for parallel computing) library.
Commands for compiling the code (that work on our machines) can be found at the top of the rmsss.cpp file. You may need to change directory names, etc, depending on your setup.

Running: How you run the program depends somewhat on the particular cluster environment in which you are working. If you want to run the program directly from the command line, you might try something like:

mpirun -machinefile nodes -np 21 ./rmsss.exe ./Examples/binary.setup.txt > output.txt

The file "nodes" would contain the names of the machines you wish to use. The value "21" means that you want to use 21 processors: 1 head node that manages communication and controls the algorithm and 20 additional compute nodes. The files "binary.setup.txt" is an input file described below.

If your cluster requires the use of a queuing system you will need to submit your job under that specific framework. The files "run_rmsss_xxx.q", where xxx corresponds to "pbs" or "sge" (two particular queueing systems we have used) provide examples of how we submit jobs on our machines at the command line:

qsub run_rmsss_xxx.q

Make sure that the correct input/setup file (described below) is specified in the run_rmsss_xxx.q file.

Setup file of input parameters

The parameter setup/input file (e.g. binary.setup.txt above) is a flat text file with each line representing a parameter (name, value) pair for a predefined set of parameters. The order of parameters in the file is not important. User can also comment out a line by adding # at the beginning. Each line is in the format

ParameterName = Value
where ParameterName is one of the names described bellow and Value takes a string or numberic values depending on the nature of given parameter. When a path is used as a parameter value, the spaces in the path will be ignored. SSS will NOT work with a path that has spaces in it.

See also the description in the README file contained in the download directory.

-----------------
Inputs:	NOBSERVATIONS	n = total sample size
	NVARIABLES	p = total number of predictor variables
	DATAFILE	tab delimited predictor data (n rows, p columns)
	RESPONSEFILE	response variable (n values, tab delimited row or single column)
	WEIGHTSFILE	weight vector (n values of 0/1; 1 indicates samples to be used in model fit)
	CENSORFILE	0/1 indicators of right-censoring (0) versus observed (1) in case of survival data
-----------------
Output:	OUTFILE	list of selected best models
	ITEROUT	details of models visited at each SSS iteration
	NULLFILE	contains the score for the null model -- the model with no predictors
	SUMMARYFILE	summary of the models with parameter estimates
	LOOCV	[0/1]: 1 if the scores and parameter estimates for the top models are to be recomputed (default 0)
		one at a time, each time holding out a different observation
	LOOCVFILE	If LOOCV=1, this is the base filename for the files containing the recomputed scores
		and parameter estimates. For example, if observation 3 is being held out and the
		base filename is "loocv", then the file will be called "loocv.y3".
	DEBUGOUT	[0/1]: 1 if iteration information is printed to stdout (default 1)




-----------------
Model/Search:	MODTYPE	Model Type: 1=linear, 2=binary/logit, 3=Weibull survival (default 2: binary)
	DSTART	Initial Model Size: number of predictors for model to start SSS (default 2)
	PMAX	Maximum Model Size: maximum number of predictors in any model (default 20)
	PRIORMEANP	Prior mean of number of included variables: this is the key sparsity control parameter (PRIORMEANP=v means that each variable is "in the model" with probability v/p. PRIORMEANP must be in the interval (0, NVARIABLES). (default 4.0)
	NBEST	Number of Best Models to be saved/recorded (default 10000)
	ITERS	Total Number of SSS iterations (default 10000)
	ONEVAR	[0/1]: 1 includes all 1-variable models in search (default 1)

	See the README file in the download directory for additional optional inputs.
-----------------
*Annealing:**	(replace, innerAnneal1)	annealing parameter for variable replacement (default 0.6)
	(delete, innerAnneal2)	annealing parameter for variable deletion (default 1.0)
	(add, innerAnneal3)	annealing parameter for variable addition (default 0.8)
	(outer, outerAnneal)	annealing parameter for second level model selection (default 0.4)


-----------------

*Annealing parameters can generally be left at the suggested defaults in the example files here. See the paper for additional discussion.

Output files

The key output file is the SUMMARYFILE -- a flat txt file summarising the posterior distributions within and across the models summarised. See the examples in the Matlab (or R) examples file examples.m (or examples.r) and the three summary support .m (or .r) script files for these examples. The SSS search explores linear models with standardized y and x data and so the output summaries relate to the standardized models with no intercept. In contrast the binary and survival models include intercepts.

The OUTFILE contains some of the same information as the SUMMARYFILE. Models are ordered by decreasing posterior probability. The first column contains the iteration at which the model was first found; the second column indicates the number of variables in the model, and the remaining columns give the indices of the p variables in the model.

The SUMMARYFILE output has the following information. Each row is one model of the "top models" ordered in decreasing order of posterior probability. The number of columns is defined by the largest model, and entries are NA/NaN for models smaller than the largest In each model/row the entries are as follows:

Linear regression models:

element 1 - dimension of the model = number of predictors p for this model
element 2 - log posterior probability of this model (the "score")
elements 3:(2+p) - the indices of the p variables in this model
elements (3+p):(2+2p) - posterior mode (also the mean) of the regression parameter vector beta (no intercept)
elements (3+2p):(2+2p+p*p) - posterior variance matrix of beta in vectorised form (no intercept)
final two elements: - (s,d), the residual SD estimate (MAP estimate), and the posterior degees of freedom

Binary (logit) regression models:

element 1 - dimension of the model = number of predictors p for this model
element 2 - log posterior probability of this model (the "score")
elements 3:(2+p) - the indices of the p variables in this model
elements (3+p):(3+2p) - posterior mode of the regression parameter vector beta (includes intercept)
elements (4+2p):(4+4p+p*p) - estimated posterior variance matrix of beta in vectorised form (includes intercept)

Survival (Weibull) regression models:

element 1 - dimension of the model = number of predictors p for this model
element 2 - log posterior probability of this model (the "score")
elements 3:(2+p) - the indices of the p variables in this model
element 3+p: - the posterior mode of the Weibull index parameter alpha in this model
elements (4+p):(4+2p) - posterior mode of the regression parameter vector beta (includes intercept)
elements (5+2p):(8+6p+p*p) - estimated posterior variance matrix of (alpha,beta) (beta includes intercept) in vectorised form

For a more detailed description of all output files, please see the README file in the download.