CDP Software: Clustered Dirichlet Process Mixture Modelling

Clustered Dirichlet Process Mixture Modelling

Outputs: The CDP software creates a number of output files in the directory in which it is run. These are the following:

Sampled Mixture Model Parameters

One important feature of the model is the posterior predictive distribution on top-level components. This distribution is essentially a mixture of T distributions dependent on the sampled values of m, Phi, and q (see discussion paper for more details).

postm.txt: posterior samples of the top-level location variables. For each of the iter MCMC iterations, all J cluster location parameters (i.e. all m_j) are printed. Thus the first J rows contain m_1 ... m_J for iteration 1, the next J rows contain m_1 ... m_J for iteration 2, and so on.
In R, you can load these values into an iter x J x D dimensional array using
m = aperm(array(scan("postm.txt"),c(D,J,iter)),3:1).
postPhi.txt: posterior samples of the top-level shape variables (i.e. Phi_j). Each row consists of a D x D matrix printed out in row major order. The first J rows contain Phi_1 ... Phi_J for iteration 1, the next J rows contain Phi_1 ... Phi_J for iteration 2, and so on.
In R, you can load these values into an iter x J x D x D dimensional array using
Phi = aperm(array(scan("postPhi.txt"),c(D,D,J,iter)),4:1)
postq.txt: posterior samples of the top-level weights. The first row consists of J weights (q_1, q_2, ..., q_J) for iteration 1, and so on.
In R, you can load these values into a iter x J dimensional array using
q = aperm(array(scan("postq.txt"),c(J,iter)),2:1)

Another important feature of the model is the posterior predictive distribuiton based on the bottom-level components. This distribution is essentially a mixture of normal distributions, based on the sampled values of mu, Sigma, and p (see dicussion paper for more details).

postmu.txt: posterior samples of the bottom-level location variables. For each iteration of MCMC, all J x T component location parameters parameters are printed. The first J x T rows contain mu_1,1, ... mu_1,T, mu_2,1, ... mu_2,T, ... mu_J,1,...mu_J,T for the first iteration of MCMC and so on.
In R, you can load these values into an iter x JT x D dimensional array using
mu = aperm(array(scan("postmu.txt"),c(D,J*T,iter)),3:1)
postSigma.txt: posterior samples of the bottom-level shape variables (i.e. all Sigma_j,t). Each row consists of a D x D matrix printed out in row major order. The first J x T rows contain Sigma_1,1, ...Sigma_1,T, Sigma_2,1, ... Sigma_2,T, ... Sigma_J,1, ... Sigma_J,T from the first iteration of MCMC, and so on.
In R, you can load these values into an iter x JT x D x D dimensional array using
Sigma = aperm(array(scan("postSigma.txt"),c(D,D,J*T,iter)),4:1)
postp.txt: posterior samples of the bottom-level component weights. Each row consists of the T weights associated with a particular top level component. The first J x T rows contain p_1,1 ... p_1,T, p_2,1 ... p_2,T, ... p_J,1 ... p_J,T for the first iteration of MCMC, and so on.
Note: these weights have been scaled so that p_1,1 + p_1,2 + ... p_J,T = 1. Or in other words, the values printed in this file are actually p_j,t' where p_j,t' = p_j,t * q_j. This was done to facilitate the interpretation of the predictive distribution at specific iteration of MCMC as a J x T component mixture of normal distributions.
In R, you can load these values into an iter x JT dimensional array using
p = aperm(array(scan("postp.txt"),c(J*T,iter)),2:1)

Another useful summary of the model are the observation-specific mean cluster locations and the observation-specific mean component locations. These may be regarded as compressed representations of the original data (see discussion paper for more details).

postxmbar.txt: Mean cluster location for each observation. Each row i contains the mean value of m_{w_i} for observation i, averaged over the course of the MCMC.
In R, you can load these values into an N x D dimensional array using
xm = read.table("postxmbar.txt")
postxmubar.txt: Mean component location for each observation. Each row i contains the mean value of mu_{w_i,k_k} for observation i, averaged over the course of the MCMC.
In R, you can load these values into an N x D dimensional array using
xmu = read.table("postxmubar.txt")

Last Parameter Values

The following files contain the last sampled values of all model parameters. The format of these files is the same as that expected when specifying initial values of the MCMC as described on the Inputs page.

lastm.txt: J x D matrix, with row j representing m_j.
lastPhi.txt: J x (D*D) matrix, with row j representing Phi_j in row-major order.
lastq.txt: A single row of J values, representing q_1 ... q_J.
lastqV.txt: A single row of J values, representing the draws from the Beta distribution from which the weights q are derived.
lastw.txt: A single column of N values w_1 ... w_N, indicating the association of an observation i with a top-level mixture component w_i (i.e. if w_i = 2, observation i is assumed to be a realization from g_2).
lastmu.txt: (J*T) x D dimensional matrix, with row T*(j-1)+t representing mu_j,t.
lastSigma.txt: (J*T) x (D*D) dimensional matrix, with row T*(j-1)+t representing Sigma_j,t.
lastk.txt: A single column of N values k_1 ... k_N, indicating the association of an observation i with mixture component k_i in cluster w_i.
lastp.txt: A J x T matrix, with row j containing the T component weights for top-level mixture j (i.e. p_j,1, ... p_j,T).
Note Unlike the values printed in postp.txt, these values are not preprocessed, so each p_j,1 + p_j,2 + ... + p_j,T = 1.
lastpV.txt:A J x T matrix, with row j containing the T draws from the Beta distribution from which the weights p_j,1 ... p_j,T are derived.
lastalpha.txt: A single row of J values, representing the cluster-specific DP scale parameters alpha_1 ... alpha_j.
lastalpha0.txt: A single value representing the top level DP scale parameter alpha_0.

CDP code developed by: Dan Merl & Quanli Wang

More software from the West group