mbsc - R-functions for model-based subspace clustering
mbsc is a set of R functions
that fit a multivariate Dirichlet-process mixture model
to identify clusterings based on differences in mean and variance at
subsets of attributes,
as described in this document.
- Old code to perform clusterings based on differences in means only can be found here.
- Code to analyze matrices of binary data can be found here.
- Usage: mbsc(Y,...)
- Data arguments: Y is a matrix. Each row represents the
attribute measurements of a single case.
Missing (at random) data is allowed.
- Optional arguments (with defaults):
- groups=rep(1,dim(Y)[1]) : the starting value of the cluster membership
function.
- ps2=cbind(rep(1/2,dim(Y)[2]),apply(Y,2,var)/2) : parameters
of the inverse-gamma prior for the error variance.
- pmu=cbind(apply(Y,2,mean),apply(Y,2,var)) : parameters
of the normal prior for the baseline means.
- peta=c(2,2) : parameters of the inverse-gamma prior for
the average squared magnitude of the mean shifts.
- ptheta=c(1,1) : parameters in beta prior probability of relevance.
- palpha=c(1,1) : parameters in beta prior for alpha/(1+alpha) .
- pwm = c(2,2) : parameters in gamma prior for the mean of 1/omega^2.
- pwv = c(2,2) : parameters in gamma prior for the variance of 1/omega^2.
- nscan=1000 : number of scans of the Markov chain.
- verb=T : whether to printout the output as the chain progresses.
- odens=max(1,round(nscan/1000)) : How often to save output.
- seed=1 : random seed.
- ngb=dim(Y)[1] : number of object-specific clustermemberships
to update per scan (using Gibbs sampling).
- nsm=0 : number of split-merge proposals to make per scan.
If your data provide a roughly unimodal clustering
this can be set to zero. If the chain is having trouble mixing
between modes this should be nonzero.
- plt=F : Make a trace plot hyperparameters at each saved scan.
- ofile ="CLUSTERS" : R image file within which to save the output.
- saveres=T : Save results in ofile?
- Output: A list with the following objects:
- OUT : values of scan number, log-likelihood, K, alpha, lambda,
eta, a, b,
saved every
odens scans.
- GROUPS , MU, S2: posterior samples of the group membership function,
baseline mean and basline variance saved every odens scans.
- uRm, uDm, uWm : if ngb+nsm=0 (i.e. you are keeping the grouping fixed),
these are the posterior means of the cluster-specific parameters r, delta, and omega.
Installation:
- Download the text files
mbsc.r and mbsc.c .
- Start an R-session with mbsc.r
in the directory and type
source("mbsc.r")
- Assuming you have a data matrix Y you want to
cluster, type clustering<-mbsc(Y)
- See some examples below for some ideas as to how to analyze the output.
Running the MCMC may take a long time, so you might want to
do it in batch mode.
Examples:
- Example 1: A small example (n=50, m=100) using a simulated dataset.
- Example 2: A small example (n=50, m=100) using a simulated dataset, where there is no clustering.
Feedback: Let me know if you use this package, have suggestions,
or encounter bugs. The more feedback I get, the more I will feel
compelled to improve the software.
email: hoff@stat.washington.edu