Model based subspace clustering

mbsc - R-functions for model-based subspace clustering

mbsc is a set of R functions that fit a multivariate Dirichlet-process mixture model to identify clusterings based on differences in mean and variance at subsets of attributes, as described in this document.

Old code to perform clusterings based on differences in means only can be found here.
Code to analyze matrices of binary data can be found here.

Usage: mbsc(Y,...)
Data arguments: Y is a matrix. Each row represents the attribute measurements of a single case. Missing (at random) data is allowed.
Optional arguments (with defaults):
- groups=rep(1,dim(Y)[1]) : the starting value of the cluster membership function.
- ps2=cbind(rep(1/2,dim(Y)[2]),apply(Y,2,var)/2) : parameters of the inverse-gamma prior for the error variance.
- pmu=cbind(apply(Y,2,mean),apply(Y,2,var)) : parameters of the normal prior for the baseline means.
- peta=c(2,2) : parameters of the inverse-gamma prior for the average squared magnitude of the mean shifts.
- ptheta=c(1,1) : parameters in beta prior probability of relevance.
- palpha=c(1,1) : parameters in beta prior for alpha/(1+alpha) .
- pwm = c(2,2) : parameters in gamma prior for the mean of 1/omega^2.
- pwv = c(2,2) : parameters in gamma prior for the variance of 1/omega^2.
- nscan=1000 : number of scans of the Markov chain.
- verb=T : whether to printout the output as the chain progresses.
- odens=max(1,round(nscan/1000)) : How often to save output.
- seed=1 : random seed.
- ngb=dim(Y)[1] : number of object-specific clustermemberships to update per scan (using Gibbs sampling).
- nsm=0 : number of split-merge proposals to make per scan. If your data provide a roughly unimodal clustering this can be set to zero. If the chain is having trouble mixing between modes this should be nonzero.
- plt=F : Make a trace plot hyperparameters at each saved scan.
- ofile ="CLUSTERS" : R image file within which to save the output.
- saveres=T : Save results in ofile?
Output: A list with the following objects:
- OUT : values of scan number, log-likelihood, K, alpha, lambda, eta, a, b, saved every odens scans.
- GROUPS , MU, S2: posterior samples of the group membership function, baseline mean and basline variance saved every odens scans.
- uRm, uDm, uWm : if ngb+nsm=0 (i.e. you are keeping the grouping fixed), these are the posterior means of the cluster-specific parameters r, delta, and omega.

Installation:

Download the text files mbsc.r and mbsc.c .
Start an R-session with mbsc.r in the directory and type source("mbsc.r")
Assuming you have a data matrix Y you want to cluster, type clustering<-mbsc(Y)
See some examples below for some ideas as to how to analyze the output.

Running the MCMC may take a long time, so you might want to do it in batch mode.

Examples:

Example 1: A small example (n=50, m=100) using a simulated dataset.
Example 2: A small example (n=50, m=100) using a simulated dataset, where there is no clustering.

Feedback: Let me know if you use this package, have suggestions, or encounter bugs. The more feedback I get, the more I will feel compelled to improve the software.

email: hoff@stat.washington.edu