Clustered Dirichlet Process Mixture Modelling


CDP home Downloads Inputs Outputs Examples

Inputs: The CDP software requires a single parameter file to specify hyperparameters, file locations, initial values, etc. This file is called parameters.txt.

To get a default parameters.txt file, from a command prompt in the same directory as the executable, issue the command "cdp -default", where cdp is the name of your CDP executable. Note that this file will be called default.parameters.txt. You will need to rename the file parameters.txt in order to conduct an analysis.

The format of each line of the file is simply NAME = VALUE, and lines beginning with a # sign are ignored as comments. The order of the entries does not matter, and the values are not case sensitive.

In your own analyses, it is only necessary to specify values for parameters whose values you want to change from the default setting appearing in the default.parameters.txt file (see the Examples page for examples of this). In particular, we recommend using the default settings for the sampling steps (all parameters sampled) and initial values sections (initial values are obtained by sampling from the prior distributions).

The values to be specified are the following:

Data Section

  • N: Integer, the number of observations/data points
  • D: Integer, the dimension of each observation
  • DataFile: String, the path to a file containing the data. Each row of this file should correspond to a single observation. Entries should be separated by spaces, not commas. For example, if you have 10 observations of a 2-dimensional response, the file should have 10 rows with each row consisting of two numbers separated by a space.

Prior Section

  • J: Integer, the maximum number of clusters, i.e. the truncation point of the countably infinite top-layer mixture.
  • T: Integer, the maximum number of normal components per cluster (i.e. the truncation points of the countably infinite bottom-layer mixtures)
  • m0: D Doubles separated by spaces, the mean of the normal prior on cluster location parameters m_j (i.e. m_j ~ N(m0,Phi0))
  • phi0: Double, specifies the diagonal entries of the covariance of the normal prior on cluster location parameters m_j (i.e. Phi0 = phi0*I)
  • lambda0: Double, specifies the diagonal entries of the scale matrix parameter of the Wishart prior on cluster shape parameters Phi_j (i.e. Phi_j ~ Wishart(nu0+D,lambda0*I/(nu0+D)). Note that under this parametrization of the Wishart distribution, lambda0*I is the expected value of each Phi_j).
  • nu0: Integer, the positive degrees of freedom of the Wishart prior on cluster shape parameters Phi_j.
  • gamma: Positive Double, part of the normal-inverse-Wishart prior on bottom level component locations and shapes. Specifies how spread out about the cluster location m_j the components within cluster j can be: mu_j,t ~ N(m_j,gamma*Sigma_j,t)
  • nu: Integer, the positive degrees of freedom of the Inverse Wishart prior on bottom level component shape parameters Sigma_j (i.e. Sigma_j,t ~ Inv-Wishart(nu+2,nu*Phi_j). Note that under this parametrization of the Inv-Wishart distribution, Phi_j is the expected value of Sigma_j,t)
  • e0: Double, shape parameter of the gamma prior on top-level DP scale parameter alpha0 (alpha0 ~ Gamma(e0,f0)). Note that higher values of alpha0 result in greater numbers of clusters.
  • f0: Double, scale parameter of the gamma prior on top-level DP scale parameter alpha0 (alpha0 ~ Gamma(e0,f0)). Note that higher values of alpha0 result in greater numbers of clusters.
  • ee: Double, shape parameter of the gamma priors on bottom-level DP scale parameters alpha_j (alpha_j ~ Gamma(ee,ff)). Note that higher values of alpha_j result in greater numbers of normal components in cluster j.
  • ff: Double, scale parameter of the gamma priors on bottom-level DP scale parameters alpha_j (alpha_j ~ Gamma(ee,ff)). Note that higher values of alpha_j result in greater numbers of normal components in cluster j.

MCMC Section

  • burnin: Integer, the number of initial MCMC iterations to be discarded
  • iter: Integer, the number of MCMC iterations to be collected after the burnin phase.
  • seed: Integer, the random number seed (for repeatability)

The following two sections of the file contain advanced/debugging options that most users will not need to alter. The Sampling steps section allows you to specify which parameters are to be sampled, and which are to be held at fixed values. The initial values section allows you to specify initial values for all sampled parameters. This is useful for extending a previous MCMC.

Sampling Steps Section

  • samplem: Binary Integer (1|0), specifies whether (1) or not (0) the top level location variables should be sampled.
  • samplePhi: Binary Integer (1|0), specifies whether (1) or not (0) the top level shape variables should be sampled.
  • samplew:Binary Integer (1|0), specifies whether (1) or not (0) the top level component membership variables should be sampled.
  • sampleq:Binary Integer (1|0), specifies whether (1) or not (0) the top level component weights weights should be sampled.
  • samplealpha0:Binary Integer (1|0), specifies whether (1) or not (0) the top level DP scale parameter shoudl be sampled.
  • samplemu:Binary Integer (1|0), specifies whether (1) or not (0) the bottom level location variables should be sampled.
  • sampleSigma:Binary Integer (1|0),specifies whether (1) or not (0) the bottom level shape variables should be sampled.
  • samplek:Binary Integer (1|0), specifies whether (1) or not (0) the bottom level membership variables should be sampled.
  • samplep:Binary Integer (1|0), specifies whether (1) or not (0) the bottom level component weights should be sampled.
  • samplealpha:Binary Integer (1|0), specifies whether (1) or not (0) the bottom level DP scale parameters should be sampled.

Initial Values Section (see the Outputs page for the proper formatting of these files)

  • Alpha0file: String, file containing an initial value for the top level DP scale parameter.
  • Mfile: String, file containing initial values for the top level location parameters.
  • Phifile: String, file containing initial values for the top level shape parameters.
  • Wfile: String, file containing initial values for the top level membership variables.
  • Qfile: String, file containing initial values for the top level component weights.
  • qVfile: String, file containing initial values for the top level strick breaking parameters.
  • Alphafile: String, file containing initial values for the bottom level DP scale parameters.
  • Mufile: String, file containing initial values for the bottom level component locations parameters.
  • Sigmafile: String, file containing initial values for the bottom level component shape parameters.
  • Kfile: String, file containing initial values for the bottom level component membership variables.
  • Pfile: String, file containing initial values for the bottom level component weights.
  • pVfile: String, file containing initial values for the bottom level stick breaking parameters.

CDP code developed by: Dan Merl & Quanli Wang

More software from the West group