INTERNATIONAL WORKSHOP ON MIXTURES

Aussois, France, September 17th-21st 1995

CONFERENCE PROCEEDINGS

Abstracts of presentations and links to relevant papers

Author: Murray Aitkin
Title: Nonparametric Maximum Likelihood Estimation of the Mixing Distribution in Mixed Exponential Family Overdispersion and Variance
Email: aitkin@maths.uwa.edu.au
Abstract: This presentation is in two parts. The first part presents the NPML approach to modelling overdispersed exponential family regression models by representing the overdispersion as due to an omitted variable in the regression, and estimating its distribution as a discrete distribution on a finite number of mass-points using finite mixture ML, concurrently with the regression model parameters. This approach is generally regarded as computationally intensive, the location of the mass-points being a particular difficulty.
A simple way of representing the mass-points as part of the GLM is described, allowing straightforward ML estimation without parametric model assumptions for the omitted variable. The approach is thus competitive with quasi-likelihood approaches.
A GLIM4 implementation is briefly described, and illustrated with a number of examples of overdispersed GLMs.
In the second part the approach in the first part is extended to two- level variance component models in the exponential family. Essentially the same computational approach gives the NPML estimate of the random- effect distribution and so gives full ML estimation of the GLM parameters in the variance component model without any model assumption for the random effect distribution. The approach is thus competitive with GEE approaches, though it is a conditional rather than a marginal approach.
A GLIM4 implementation is briefly described and illustrated with several examples of variance component and longitudinal analysis with binary data.

Author: Shotaro Akaho
Title: Mixture Model for Image Understanding and the EM Algorithm
Email: akaho@etl.go.jp
Abstract: In this talk, we focus an attention to applying mixture model to the image understanding task. Suppose images are given as the distribution of points, we can fit a probability distribution model to those points. Especially, mixture model is appropriate to be fitted in the case that there are multiple objects in an image.\par I propose the EM algorithm for the mixture model in which each component is the form $a g(a x + b)$, where $g(x)$ is an arbitrary smooth probability density function, $a$ is a scale parameter and $b$ is a shift parameter. The function $g(x)$ corresponds to the model of an object, and the model can be scaled by the parameter $a$ and shifted by the parameter $b$.
The outline of the estimation can be described as follows. The model distribution $g(x)$ is approximated in advance by the mixture model $\sum_i\pi_i h_i(x,\theta_i)$ with enough number of components. This model is hierarchical structure of the mixture model, that is to say, $h$ is a substructure of $g$. However, the estimation of $a$ and $b$ is not trivial because they are common parameters among $h_i$'s. If we adopt the normal distribution as $h_i$ and optimize $a$ and $b$ separately, we can get explicit recurrence formulae for $a$ and $b$.
An another aspect of image understanding is attention. If there are too many objects in an image, we should restrict the region to estimate based on some basic features of the image. In that case, the data outside the region become censored data. We can also apply the EM algorithm for that missing value problem. One point that is different >from the original EM principle is that the number of the missing data is also missing. Namely, the number of censored data which is related to the objects inside the region is unknown. Therefore, we estimate the number of missing data in each EM step. Suppose the current fitted model is $p(x,\theta_0)$, the number of estimated missing data can be approximated by $n(1-p)/p$, where $n$ is the number of data in the region and $p=\int_C p(x,\theta_0)\,dx$ integrated in the region. That algorithm corresponds to the combination of Hartley's algorithm (1958) and the EM algorithm.

Author: Christophe Ambroise
Title: Constraints Clustering and EM Algorithm
Email: ambroise@hds.univ-compiegne.fr
Co-authors: Gerard Govaert
Abstract: A classical tool for computing the maximum likelihood estimates of a finite mixture model is the EM algorithm (1977). Hathaway (1986) has shown that the EM algorithm can be interpreted as minimizing an effective energy function which is the sum of two terms: the Fuzzy Classification Maximum Likelihood criterion and an entropy term. Clustering methods with spatial contiguity constraints seek a partition (or a hierarchy) where the classes are geographical regions made of adjacent sites. It happens that respecting all contiguity constraints may produce an unrealistic partition. We present a clustering method where the constraints are respected 'as best as possible' according to some criterion. We propose to penalized the energy function exhibited by Hathaway with a term taking into account spatial contiguity constraints. The structure of the EM algorithm may be used for maximizing the proposed criterion. The Maximization step is then unchanged and the Expectation step becomes iterative.The efficiency of the new clustering algorithm has been tested with images and compared with other clustering techniques such as constrained hierarchical clustering and unsupervised image segmentation algorithms.

Author: Christophe Biernaki
Title: Penalized Criteria and Model Selection
Email: biernac@hds.univ-compiegne.fr
Co-authors: Christophe Ambroise
Abstract: For Gaussian mixture models, Banfield and Raftery (1993) have considered a parameterization of the variance matrix $\Sigma_k$ of the cluster $P_k$ based on its spectral decomposition $\Sigma_k = \lambda_k D_k A_k D_k'$, where $\lambda_k$ defines the volume of $P_k$, $D_k$ is an orthogonal matrix which defines its orientation and $A_k$ is a diagonal matrix with determinant 1 which defines its shape. Such a decomposition allows to control easily and efficiently meaningful features of $\Sigma_k$. Celeux and Govaert (1995) derived the maximum likelihood estimates from the simplest model (spherical clusters with equal volumes) to the most complex one (different volumes, orientations and shapes for all clusters). So far, most of works dealing with the number of clusters problem assumed to be known both these features of the variance matrix (spherical, diagonal, general, \ldots) and of the cluster mixing proportions (equal or different). With this end in view, various information criteria appeared : penalized likelihood (AIC, BIC, ...), MIR, NEC, etc. But Celeux and Govaert showed by simulations that the lowest clustering error rate came from the model underlying the data structure. Consequently, detecting both the number of clusters, and features of $\Sigma_k$ as well as the mixing proportions seems to be advantageous. Recent works have already dealt with : Bensmail, Celeux, Raftery and Robert (1995) with a Bayesian approach, M\^o and Govaert by fitting classical penalized criteria, Bryant from the MDL approach of Rissanen. Our originality is to study this from a clustering point of view.

Author: Esmail Amiri
Title: A Convergence Diagnostic Method for MCMC
Email: mspamiri@swansea.ac.uk
Abstract: There is a great debate about the stopping time of sampling in Marko Chain Monte Carlo (MCMC) methods.Users of MCMC methods in practice apply dagnostic tools to the output produced by MCMC methods. We have developed a diagnostic tool which is based on the idea of Ritter and Tanner (1992). This dignostic tool can be applied to the Gibbs sampler and the Metropolis-Hastings algorithm.

Author: Philippe Barbe
Title: Estimation in Mixture Models
Email: barbe@cict.fr
Abstract: Let ${\cal P} := \{ P_\theta : \theta\in\Theta \}$ be a family of probability measures (p.m.'s) on a Polish space. To each p.m. $\mu$ on $\Theta$, we associate the p.m. $P_\mu := \int_\Theta P_\theta d\mu (\theta )$. Let $X_1,\ldots ,X_n$ be a sample from a p.m. $P_{\mu^*}$. Knowing the family ${\cal P}$, we give a general theory to estimate $P_{\mu^*}$ and $\mu^*$ on the basis of $X_1,\ldots , X_n$.
The most striking feature of our general results is that the study of the asymptotic behaviour of the estimators is equivalent to the study of the approximation of an operator cannonically associated to the family ${\cal P}$, and is also equivalent to a classical approximation problem in functional analysis.
This great generality allows us tu prove central limit theorem type for our estimators. We emphasize specially on necessary and sufficient conditions for the estimators to be efficient. In practical cases, the estimators are easy to compute using linear programming. The general theory is shown to be widely applicable through 11 examples, including mixtures of exponentials, uniforms, normals, gammas, etc.

Author: M.J. Bayarri
Title: Planning Successful Replications
Email: Susie.Bayarri@uv.es
Co-authors: A. Mtnez-Mayoral
Abstract: Replication is a very important tool in the development of experimental sciences. It is unclear, however, what should be understood as "successful" replication. In this paper, several goals for replication are explored. Bayesian predictive distributions allow for a natural quantification of the probability of success when replicating. Mixtures enter the model when replicating a t-test experiment. In this case, it is far more convenient to model the non-central t distribution for the t statistic as a mixture of normal distributions. Mixtures enter also the prior distribution through the hierarchical models used to quantify the different possible relationships among the experiments.

Author: James Berger
Title: Default Bayesian Analysis of Mixture Models
Email: jberger@stat.purdue.edu
Co-authors: Chimei Shui
Abstract: One of the difficulties in Bayesian analysis of mixture models is that sophisticated noninformative priors (such as reference or Jeffreys priors) are very difficult to determine, while simple choices (such as the constant density) will typically lead to improper posterior distributions. It also is not effective to simply choose proper priors that are diffuse, as the answers can depend markedly on the degree of diffuseness.
A natural line of attack upon this problem is through the "intrinsic Bayes factor" or "fractional Bayes factor" methodologies. These methodologies were designed for scenarios in which the same difficulty - that standard noninformative priors cannot be used - is encountered. The application of these approaches to the mixture model problem will be discussed, with computational issues being emphasized.

Author: Hamparsum Bozdogan
Title: Mixture-Model Cluster Analysis using Model Selection Criteria and a New Informational Measure of Complexity
Email: BOZDOGAN@UTKVX.UTCC.UTK.EDU
Abstract: Separation of data sets into clusters by means of the model of a mixture of distributions, called mixture-model cluster analysis, has been one of the most difficult problems in statistics. But theoretical work, coupled with the development of new computational tools in the past ten years, has made it possible to overcome some of the intractable technical and numerical issues that have limited the widespread applicability of the mixture-model cluster analysis to complex real-word problems. The development of new objective analysis techniques had to wait the emergence of information- based model selection procedures to overcome difficulties with conventional techniques within the context of the mixture-model cluster analysis.
This paper is based on the extended work of Bozdogan (1981, 1983), where the information-theoretic approach via Akaike's (1973) Information Criterion (AIC) was first introduced and proposed in choosing the number of component clusters in the mixture-model cluster analysis. Therefore, this paper considers the problem of choosing the number of component clusters of individuals within the context of the standard mixture of multivariate normal distributions, and presents some new results.
A common problem in all clustering techniques is the difficulty of deciding on the number of clusters present in a given data set, cluster validity, and the identification of the approximate number of clusters. How do we determine what variables best discriminate between the clusters as we simultaneously estimate the number of component clusters? How do we determine the outliers or extreme observations across the clustering alternatives? These are some fundamental questions confronting practitioners and research workers in classification and clustering.
Our objective here is to identify and describe the class distribution using a sample drawn from the mixture-model, and estimate K, the number of clusters such that k We give a real numerical example of the mixture-model cluster analysis using these new model-selection procedures by illustrating the results on an actual data set of medical importance. In this example, we show how to cluster "overt diabetic", "chemical dia-betic", and "normal" subjects without being told of their a priori classification, so that we can compare the results of the mixture-model cluster analysis with the classifications obtained by current medical criteria based on the five variables measured in the original study of Reaven and Miller (1979) in 145 non-obese adult subjects.
We also give numerical examples based on simulated multivariate normal data sets with a known number of clusters to illustrate the significance of model selection criteria in choosing the number of clusters and the best fitting model. These procedures take into account simultaneously the lack-of-fit of a cluster, the number of parameters, the sample size, and the complexity of the increased number of clusters to achieve the best fit.

Author: Karl W. Broman
Title: Analysis of A T Cell Frequency Assay Using A Mixture Model
Email: kbroman@stat.berkeley.edu
Co-authors: T. P. Speed
Abstract: We are studying an assay whose aim is to assess a subject's cellular immune response to two viral antigens. (An antigen is a protein or other molecule on the surface of a foreign element, such as a virus.) In particular, the assay will be used to determine the effectiveness of a pair of vaccines designed to increase the response to these antigens.
One aspect of the immune response involves the recognition of an antigen in the blood by T cells whose surfaces contain receptors which are complementary, in a suitable sense, to part of that specific antigen. The T cells respond by emitting chemical signals, stimulating other cells to replicate, or replicating themselves.
The assay under study seeks to estimate the number of T cells in a blood sample which respond to each of the two test antigens, by measuring cell replication in a set of aliquots from the blood sample. The aliquots are placed into the wells of a microtiter plate. Some of the wells are treated with one of the two test antigens, some are treated with an antigen from tetanus, which serves as a positive control antigen, and some are treated with the chemical PHA, which, in causing nearly all cells to begin replication, serves as a positive control in the assay. A set of wells receives neither antigen nor PHA; these wells serve as negative controls and allow us to estimate the background level of cell replication. Cell replication in the wells is observed by measuring, using a scintillation counter, the amount of radioactively-labelled thymidine which is incorporated into the cells' DNA.
The traditional method of analyzing data from this type of assay involves declaring each well on the plate to be either positive or negative, according to whether its scintillation count exceeds some chosen cutoff. The average number of responding cells per well is then estimated by -log( # negative wells / # wells ). A key problem with this approach can be the objective determination of a suitable cutoff for the counts corresponding to negative wells. In addition, the above estimator for the average number of responding cells per well may be shown to be inconsistent, under quite general assumptions.
We have proposed a new method of analyzing the data from this assay, modelling the scintillation counts as coming from a mixture distribution. A central idea in our approach is that the size of the scintillation count for a well contains information about the number of responding cells it contains, and not just about whether or not it contains any responding cells.
Let y denote a suitable transformation of the scintillation count for a well, and let k denote the (unobserved) number of responding cells in that well. Our model is that k is Poisson distributed, and that y, given k, is normally distributed with mean a + b k and variance sigma^2, where a, b, and sigma^2 are constant across the wells on a plate. We seek estimates of the underlying Poisson means for the group of wells which received no antigen, and for the groups which received one of the test antigens.
The effectiveness of this new approach has been demonstrated using two dilution series, in which a blood sample is diluted to a sequence of decreasing cell densities, with the assay being applied to each dilution.
Areas for further research include the robustness of the normality assumption and the determination of a suitable transformation of the scintillation counts. We are also looking at other, quite different approaches.

Author: M Salomé Cabral
Title: Testing in multivariate normal mixtures
Email: salome@cc.fc.ul.pt
Abstract: We propose a test to identify the existence of a mixture of two or more multivariate (or univariate) normal distributions with common covariance matrix. Based on the test and the estimation algorithm EM a procedure is derived to identify the number of components in the mixture. The asymptotic power function of the test is given and a simulation study is also made. An example with real data is analysed.

Author: Claudia Cargnoni
Title: Modeling Series Outliers with Markov Variance-Shift Models
Email: cargnoni@hal.stat.unipd.it
Abstract: Key Words: Conditionally Gaussian models; outliers; multivariate time series; mixture model.
We analyze a data set concerning flows (i.e. drop out, repeat and pass) in the Italian school system over the period 1963 -- 1992. The data takes the form of I related multinomial time series, one for each grade i=1,...,I, with a hierarchy defined by different school types, gender, region etc. To implement inference and prediction for this data we propose a class of conditionally Gaussian dynamic models for the I multinomial time series y_it, i=1,...,I, t=1,...,N. As a prior probability model for the corresponding (transformed) multinomial parameters p_it we use a multivariate normal dynamic linear model.
The I series, p_it, i=1,..., I, are assumed to have similar shapes. This is modeled using a parameterization which chooses one series, say i=1, as the reference series and modeling all the other series in terms of p_1t plus a shift r_it relative to it: E(p_it | th_t) = E(p_1t | th_t) + r_it, for i=2, ... , I. Here th_t is the total parameter vector.
It may happen, due to changes in the school system, political interventions, changes on the labour market, or other outside events, that one particular series deviates from the general trend for some time intervals; the series thus becomes an ``outlier'' at those time intervals. We introduce a sequence of latent indicator variables s_it with s_it = 1 if series i is an outlier at time t and s_it = 0 else. Whenever the series becomes an outlier, i.e. s_it = 1, we increase the variance-covariance matrix in the evolution equation for r_it by scaling it with a parameter v_i. The indicator variables s_it are a priori a sequence of stationary Bernoulli random variables with Markovian evolution. The scale parameters v_i, as well as the parameters of the Bernoulli distributions for the s_it series are assumed to be stochastic, with prior distribution opportunely specified in the model.
We implement inference in the proposed model by Markov chain Monte Carlo simulation. The scheme is similar to the one used in current research by Cargnoni, Mueller and West, except for the additional latent variables, s_it, v_i and the Bernoulli parameters, which require additional layers to the Gibbs sampler used there. A description of the modified scheme after the inclusion of the new parameters is given in detail.

Author: Raymond J. Carroll
Title: Segmented Regression with Errors in Predictors
Email: carroll@stat.tamu.edu
Co-authors: Helmut Kuechenhoff
Abstract: We investigate functional and structural approaches to estimating a change-point in a continuous segmented regression model when a predictors is measured with error. The most common functional approach, regression calibration, is shown to be badly biased. The SIMEX method has much smaller asymptotic bias. Structural methods are derived, with flexibility achieved using a mixture of normals distribution for the latent variable. The structural estimator is shown to be much more efficient than the SIMEX estimator.

Author: Didier Dacunha Castelle
Title: Th\'eorie g\'en\'erale de la vraisemblance pour des m\'elanges finis(A General Likelihood Theory for Finite Mixtures)
Email: dacunha@stats.matups.fr
Abstract: On determine une theorie generale de la vraisemblance pour des melanges en nombre fini,sans les hypotheses de separation, et les parametres estimables lorsque l'ordre est inconnu. On etudie le comportement assymptotique des tests de vraisemblance, propose une determination directe de l'ordre. L'expose etend ensuite ces resultats aux modeles d'ordre infini.

Author: Didier Chauveau and Jean Diebolt
Title: Comparison of Stochastic Versions of the EM Algorithm for the Identification of Finite Mixture
Email: chauveau@math.univ-mlv.fr and jed@ccr.jussieu.fr
Co-authors: Gilles Celeux
Abstract: We first present several stochastic versions of the EM Algorithm : Stochastic EM, Simulated Annealing EM, Monte Carlo EM, and explain how they proceed for finite mixture analysis. Then, we mention several theoretical results about the asymptotic behavior of these algorithms applied to finite mixtures. Finally, we give a detailed report of intensive/extensive Monte Carlo simulations where we have compared EM, two versions of Stochastic EM, Simulated Annealing EM and a version of Monte Carlo EM for several univariate Gaussian mixtures. We also mention trials on a real data example involving bivariate Gaussian mixtures. One of our conclusions is that the stationary distribution of Stochastic EM provides a contrasted, simplified and smoothed picture of the erratic likelihood function. This allows for a satisfactory exploration of the regions of interest of the parameter space.

Author: Siddhartha Chib
Title: Mixture models for Time Series and Discrete Data Applications
Email: chib@simon.wustl.edu
Abstract: This paper is concerned with the some recent advances in the use of mixture models in econometric problems. A review is presented of markov mixture models (hidden Markov models) that have proved useful in isolating regime shifts in time series economic data. A Bayesian inference procedure for such models is described and illustrated with real data. The use of mixture models for modeling volatility in financial data is also discussed. Finally, the paper describes a recent application of mixture models to multinomial discrete response-longitudinal data. This model is illustrated with data on a large set of households.

Author: Merlise Clyde
Title: Curve Fitting with Wavelets by Model Mixing
Email: clyde@isds.duke.edu
Co-authors: Giovanni Parmigiani, Brani Vidakovic
Abstract: The problem of dimension reduction in orthonormal bases for function spaces, such as a wavelet basis, is an important issue in statistical modelling. Using all elements of the basis results in fitting the data exactly and dimension reduction by eliminating insignificant coefficients is important. Selecting the single best model or subspace ignores model uncertainty and may not lead to satisfactory inferences. In addition, there may be several models with similar posterior model probabilities. Bayesian methods offer a very effective and conceptually appealing alternative: decisions can be based on a mixture of plausible models rather than a single model, where each model contributes to the decision proportionally to the support it receives >from the observed data. As the number of possible models is typically large ($2^n$), stochastic or deterministic search methods to identify models with high posterior probability are needed. We describe an approximation to the posterior model probablility that can be used in importance sampling or a proposed deterministic algorithm to efficiently and quickly identify models with high posterior probability to be used in the mixture. The proposed Bayes solution uses the estimated mixture model and results in nonlinear shrinkage of the wavelet coefficients. Hard thresholding rules, where some coefficients are set to zero, can be based on posterior probabilities from the mixture distribution.

Author: Olga I. Cordero-Brana
Title: Minimum Hellinger Distance Estimation for Mixture Model
Email: sltjv@sunfs.math.usu.edu
Co-authors: Adele Cutler
Abstract: We consider the problem of parameter estimation for finite mixture models when the exact forms of the component densities are unkown in detail but thought to be close to members of some parametric family. The estimate is chosen to minimize the Hellinger distance between the parametric model and a nonparametric density estimator. Theoretical properties such as existence, continuity, consistency and asymptotic normality of the estimator are discussed. We propose a new algorithm which is similar to the EM algorithm in that it requires the solution of a sequence of weighted one-component problems with weights that depend on the current iterates. A specialized density estimate is proposed and its performance is compared empirically to that of a kernel density estimate. Results from an extensive Monte Carlo study will be reported.

Author: Petros Dellaportas
Title: Cluster Analysis with Mixed Mode Data, Missing Values and Error in Variables
Email: pdel@isosun.ariadne-t.gr
Co-authors: Dimitris Karlis
Abstract: We present a sample-based approach to reconstruct the mixing distribution of a certain class of compound densities. The suggested procedure is based on the method of moments, and its applicability spreads across discrete and continuous response models, empirical Bayes estimation, and prior specifications in Bayesian Statistics. We illustrate our approach using examples with real and artificial data.

Author: Charles Elkan
Title: Motif Discovery in DNA or Protein Sequences
Email: elkan@cs.ucsd.edu
Co-authors: Tim Bailey
Abstract: MEME is a new computational biology tool that discovers patterns in a set of DNA or protein sequences. A MEME pattern, called a motif, is a probabilistic representation of an approximately conserved subsequence. MEME is an unsupervised learning method since input DNA or protein sequences are not labeled as to which or how many motifs they contain. MEME identifies for each motif and sequence how many times the motif appears in the sequence. In particular, for each motif MEME recognizes which sequences are spurious, i.e. do not contain any instance of the motif. This makes MEME an exciting tool for creative exploration of sequence data. Algorithmically, MEME is an application and extension of expectation maximization. Some representative experiments with protein sequence sets show that MEME can characterize structural and functional features precisely in a large diverse family (the alcohol dehydrogenase enzymes), while it can also perform multiple sequence alignment and identify highly subtle motifs given a family of just five proteins with very low homology (the five most divergent lipocalins).

Author: Dominique Fourdrinier
Title: Sur Les Estimateurs de Bayes Minimax
Email: dcf@univ-rouen.fr
Co-authors: William E. Strawderman and Martin T. Wells
Abstract: Dans le cadre de l'estimation de la moyenne d'une loi normale $k$- multidimensionnelle, nous recherchons des estimateurs $\delta$ qui sont \`a la fois minimax et de Bayes propres. A la diff\'erence de Strawderman (1971) et de Faith (1978) qui m\`enent cette recherche via l'\'etude des fonctions de risques, ce travail repose sur l'estimateur sans biais du risque (cf. Stein 1981) $k+2{\rm div}\ \gamma +\|\gamma\|^2$ o\`u $\gamma(x)=\delta(x)-x$.\par Il est bien connu que, dans le cas gaussien, tout estimateur de Bayes $\delta_\nu$ de la moyenne, associ\'e \`a la loi a priori $\nu$ v\'erifie $\delta_\nu(x)=x+\nabla\ {\rm Log}\ m(x)$, o\`u $m$ est la densit\'e marginale. Dans ce cas l'estimateur sans biais du risque vaut $k+4 {\Delta\sqrt{m} \over \sqrt{m}}$ et une condition suffisante de minimaxit\'e pour $\delta_\nu$ est $\Delta\sqrt{m}\le 0$, autrement dit $\sqrt{m}$ est une fonction surharmonique sur $\R^k$.\par Nous exploitons cette condition de surharmonicit\'e $\Delta\sqrt{m} \le 0$ pour la classe des lois a priori \`a sym\'etrie sph\'erique (remarquons que la condition plus forte $\Delta m\le 0$ ne peut conduire \`a des loi a priori $\nu$ propres si l'on exige que $\delta_\nu$ soit minimax). Dans le cas particulier o\`u $\nu$ est un m\'elange de lois normales, de densit\'e de m\'elange $h$, la condition $\Delta\sqrt m \le 0$ se traduit par une in\'egalit\'e diff\'erentielle mettant en jeu la transform\'ee de Laplace de la fonction $t \rightarrow t^{k/2-2}h\left( {1-t\over t}\right) \uniset_{]0,1[}(t)$. Dans le cas de la classe enti\`ere des lois \`a sym\'etrie sph\'erique, le r\^ole de cette transformation int\'egrale est jou\'e par une transform\'ee li\'ee \`a la transform\'ee de Hankel (transform\'ee I). La r\'esolution de ces in\'egalit\'es diff\'erentielles conduit \`a une m\'ethode constructive des lois a priori correspondant aux estimateurs de Bayes propres minimax recherch\'es. \bigskip e.mail : dcf@univ-rouen.fr \end

Author: Dominique Fourdrinier
Title: Un paradoxe concernant les estimateurs de type James-Stein : un param\`etre d'\'echelle inconnu peut devoir \^etre remplac\'e par une estimation
Email: dcf@univ-rouen.fr
Co-authors: William E. Strawderman
Abstract: Nous \'etudions le probl\`eme d'estimation de la moyenne $\theta$ d'une loi \`a sym\'etrie sph\'erique o\`u le param\`etre d'\'echelle est connu alors qu'un vecteur r\'esiduel est disponible. Plus pr\'ecis\'ement, le vecteur al\'eatoire observ\'e $(X,U)'$ est sph\'eriquement distribu\'e autour du param\`etre de position $(\theta,0)'$, la dimension des sous-vecteurs $X$ et $\theta$ \'etant $p$, alors que celle du sous-vecteur r\'esiduel $U$ et de 0 est $k$. La fonction de cou\^ut est quadratique : $\|\delta-\theta\|^2$ .\par Nous montrons, dans le contexte des estimateurs de type Stein, que l'utilisation du vecteur r\'esiduel pour estimer la variance dans le facteur de r\'etr\'ecissement peut \^etre sup\'erieure \`a celle de la valeur connue de la variance. Ce ph\'enom\`ene semble paradoxal en ce sens que le comportement du risque d'un estimateur peut \^etre am\'elior\'e en substituant une estimation \`a une quantit\'e connue.\par Nous donnons les expressions des risques de l'estimateur de James-Stein $\delta_{JS}(X)=\left (1- {a\over X'X}\right)X$ et de l'estimateur robuste de James-Stein $\delta_{JSR}(X)=\left (1- {(p-2)U'U\over (k+2)X'X}\right)X$ ainsi que des bornes sup\'erieure et inf\'erieure de ces risques. Ces derni\`eres permettent de donner des conditions suffisantes de domination de l'estimateur robuste $\delta_{JSR}(X)$ sur l'estimateur $\delta_{JS}(X)$. Trois cas sont envisag\'es : tout d'abord quand $\theta$ est au voisinage de 0 ; puis quand $\theta$ est au voisinage de l'infini ; enfin des conditions de domination uniforme en $\theta$ sont donn\'ees. Nous consid\'erons aussi le contexte o\`u un tel ph\'enom\`ene ne peut survenir : essentiellement le cas gaussien.\par Finalement, divers exemples de lois illustrent le ph\'enom\`ene. Comme les calculs qui pr\'esident aux r\'esultats pr\'ec\'edents reposent sur le conditionnement par rapport au rayon, les exemples consistent en un choix judicieux de la loi de m\'elange du rayon.\par \bigskip e.mail : dcf@univ-rouen.fr\par \end

Author: Chris Fraley
Title: Computations in Model-Based Clustering
Email: fraley@statsci.com; fraley@stat.washington.edu
Abstract: This talk addresses some of the most important considerations in the design and implementation of software for model-based clustering. Issues particular to hierarchical clustering will be discussed, as well as computation of likelihood functions arising from models typically used for clustering, generally relevant to all clustering methods. Special attention will be paid to Gaussian models based on the eigenvalue decomposition of the covariance matrix, since the associated methods are among the most promising as well as the most difficult computationally. We present strategies for dealing with inherent computational difficulties, propose some alternatives suggested by current limitations, and outline improvements to the `mclust' software available in S-PLUS and on StatLib.

Author: Bernard Garel
Title: Detecting a Univariate Normal Mixture With Two Components
Email: garel@cict.fr
Co-authors: Abdellatif Berdai
Abstract: Identifying how many components are present in a mixture remains a difficult problem. Generalized likelihood ratio test is a general statistical procedure that it is tempting to use. Unfortunately a number of specific problems arise in this context, and the classical theory fails to be true. We consider the case of gaussian univariate distributions and mixtures, with unknown means , unknown proportion, and common but unknown variance. We derive the likelihood ratio statistic to detect a mixture of two components. We give the asymptotic distribution and propose a tabulation.We comment on chi-square approximations often proposed in the literature. The performence of the test is investigated by simulations.

Author: Ed George
Title: Predicting Failure Rates Using Incomplete Age Distribution Data
Email: egeorge@mail.utexas.edu
Co-authors: Ming Luo
Abstract: This paper considers the problem of predicting current failure rates for the population of a particular type of machine which undergoes heavy usage. Each month a number of new machines are produced (born), and a number of machines in use fail (die). Consequently, the failure rate in a given month is a mixture of the failure rates of the machines in use according to their ages. The available data is limited and consists only of (1) total monthly production (2) total monthly failures without age recorded, and (3) a sample of the failures over the last twelve months where the age at failure was recorded. In spite of the substantial uncertainty about the age distribution of the machines at risk each month, the implementation of a semiparametric Bayesian analysis using the Gibbs sampler allows us to simulate the unknown failure rates from the appropriate posterior.

Author: Mark N Gibbs
Title: Permutation Matrices and Mixture Modelling
Email: mng10@mraosb.ra.phy.cam.ac.uk
Co-authors: David J C Mackay, Chris Pickard
Abstract: We have a set of N functions {y_n(x)} (n = 1,N) and a set of K observations points {x_k} (k = 1,K). At each observation point, noisy versions of each function are generated

t_n = y_n(x) + g_n
(n = 1,N) where g_n is random noise. The data we receive, for each observation point, is the set {t_n} with the labels n removed. Hence at each observation point, we know that each function produces a data point but we do not know which data point corresponds to which function.
Using approximate permutation matrices combined with a mixture model, we infer the functions {y_n} between the observation points {x_k}. We then apply this framework to the problem of band structure determination. This produces dramatically improved densities of states using limited data.

Author: Costas Goutis
Title: Modelling Paired Poisson Data by Mixtures
Email: costas@stats.ucl.ac.uk
Co-authors: Rex F. Galbraith
Abstract: In fission track analysis, the dating of a sample of crystal grains is based on counts of spontaneous and induced tracks. The ratio of the mean densities is a direct function of the age of the crystal. It is often reasonable to assume that the counts follow a Poisson distribution. Typically there exists a non-zero correlation between counts of spontaneous and induced tracks at the same crystal and, furthermore, the variances of counts tend to be larger than those expected from a Poisson distribution. We propose a mixture model that allows for both correlation and overdispersion, assuming that the mean densities follow a bivariate Wishart distribution. We propose a maximum likelihood estimation method, based on a stochastic implementation of the EM algorithm and we derive asymptotic standard errors of the parameter estimates. We discuss estimation of crystal age from this model. We examine relationship with other methods as well as the special cases of our model. We present an implementatin of our method for dating zircon crystals.

Author: G\'erard Govaert
Title: Cluster Analysis With a Mixture Model: Determination of An Optimal Model
Email: govaert@utc.univ-compiegne.fr and celeux@imag.fr
Co-authors: Gilles Celeux
Abstract: Basing cluster analysis on Gaussian mixture models has become a classical and powerful approach. Data ${\bf x}_{1},\ldots,{\bf x}_{n}$ in ${\bf R}^{d}$ are assumed to arise from a random vector with density \begin{equation} f({\bf x})= \sum_{k=1}^{K} p_{k}\Phi({\bf x}| {\bf \mu}_{k}, \Sigma_{k}) \label{eq:densite} \end{equation} where the $p_{k}$'s are the mixing proportions $(0<p_{k}<1$ for all $k=1,\ldots,K$ and $\sum_{k}p_{k}=1)$ and $\Phi({\bf x}|{\bf \mu}, \Sigma)$ denotes the density of a Gaussian distribution with mean vector $\mu$ and variance matrix $\Sigma$. However, applying those methods requires some preliminary knowledge about the underlying model of the data, namely the number of clusters, and which parameters are unknown for each distribution. For instance, with a hypothesis of normal distributions, a type of model could be one assuming equal proportions and equal variance matrices, while another could be one assuming different variance matrices. As such knowledge is rarely available, it would be interesting to automatically detect the underlying model. Previous research work on this subject addressed a part of the problem, the determination of the number of clusters, and suggested using information criteria (Bozdogan 1992, Cutler and Windham 1993, Celeux and Soromenho 1994). These criteria are generally based on the maximum likelihood that can be obtained with the model. Those studies however assumed that the type of mode l was known. The present work proposes to take a complementary viewpoint: try to determine the type of model, while the number of clusters is known, using as before the information criteria. The first part of the numerical experiments worked on Monte-Carlo simulations of various normal mix ture models, in order to analyze the behavior of the information criteria of typical situations. Differents types of models and separation of components were simulated for 2 components. The EM algorithm was used to estimate the parameters of the mixture model through a maximization of the likelihood. The simulations showed interesting performances for some of the selected information criteria in the detection of an appropriate type of model.

Author: Peter Green and Sylvia Richardson
Title: Modelling and Computation for Mixture Problems (Compressed Postscript)
Email: P.J.Green@bristol.ac.uk and richardson@sinaps.vjf.inserm.fr
Abstract: Bayesian approaches to mixture modelling have proved extremely fruitful, whether the components of the mixture are of interest {\em per se}, or the mixture is used in a semi-parametric fashion to approximate distributions which do not fit easily in standard parametric families. Hitherto, the two versions of the mixture modelling problem, that with a fixed, predetermined number of components, and that with a variable number, have been approached differently. The former does not arise as a natural special case of the latter. This divergence might be traced to the limited range of tractable (conjugate) families of hierarchical models, and of computational methods for handling them. The advent of and wide interest in Markov chain Monte Carlo (MCMC) methods of computation has made practical implementation of the Bayesian approach possible with varied distributional forms for the components, but has not liberated mixture modellers noticeably in the case of an unknown number of components. The existing approaches have remained centred on the use of Dirichlet process priors which are relatively inflexible and favour very unequal allocation between the different components.
In this talk, an alternative comprehensive framework for mixture problems will be proposed, which allows fixed or variable numbers of components, and also embraces related clustering problems. MCMC samplers will be described for the models, drawing largely on recent work on a new class of reversible Markov chain samplers that can jump between parameter subspaces of differing dimensionality. The methods are flexible and entirely constructive, and are promising generally for computation in Bayesian model determination problems.
Both models and samplers will be compared with existing statistical and computational methodology, and the methods applied to various illustrative examples, including standard datasets and data arising from enzymatic activity, in which the possible components of the mixture have substantive interpretation in terms of genetic polymorphism.

Author: Dominique Haughton and Jonathan Haughton
Title: Statistical Mixtures and Son Preference in Viet Nam
Email: DHAUGHTON@BENTLEY.EDU
Abstract: In econometrics, attempting to fit a mixture of multiple regressions to data has become a popular option, to deal with issues of heterogeneity: a recent example is work by Morduch and Stern (1994) who investigate whether there is evidence of gender bias in health outcomes in Bangladesh. The problem, roughly speaking, is that some families discriminate against girls, and some don't, so that a single model is not suitable for all families. But of course it is not known which families discriminate, and which don't. If one attempts to fit one model to all families, the discrimination evidence can be blurred, while a mixture model allows for a much clearer picture; moreover, as a by-product, one obtains an estimate of the proportion of families with discriminating behavior.
The same problem arises in investigating statistical evidence of boy preference in Asian countries, such as Viet Nam. By boy preference, it is meant here that parents prefer a boy to a girl, and will tend to try to have children until the desired boy is born. Such behavior is known to be prevalent in China, among other countries. There are a number of ways of testing for boy preference; some statistical evidence that this behavior is indeed prevalent in Viet Nam has been shown in Haughton and Haughton (1994). One of the approaches is to build a logistic model for "parity progression" in completed families: for example, one would model the probability that a woman with two children will give birth to an additional one (parity three). If the existence of a son in the family has a negative (statistically significant) effect on the probability of an additional child (when other variables, such as education, income, and so on, are controlled for), this constitutes evidence for boy preference. But, as in the Murdoch and Stern work, perhaps some families do exhibit some boy preference behavior, and others don't. So a single logistic regression model might not be suitable for all women with completed families.
We present here an analysis of data on parity progression and contraceptive use from the Viet Nam Living Standard Survey involving a mixture of two logistic regressions. The question of whether a single logistic regression model is preferable to a model with two logistic regressions will be discussed as well.

Author: Jenq-Neng Hwang, Shyh-Rong Lay and Alan Lippman
Title: Nonparametric Multivariate Density Estimation: A Comparative Study Between Gaussian Mixtures and Projection Pursuit
Email: hwang@pierce.ee.washington.edu
Abstract: This paper algorithmically and empirically studies two major types of nonparametric multivariate density estimation techniques, where no assumption is made about the data being drawn from any of known parametric families of distribution. The first type is the popular kernel mixture method (and several of its variants) which uses locally tuned radial basis (e.g., Gaussian) functions to interpolate the multi-dimensional density; the second type is based on an exploratory projection pursuit technique which interprets the multi-dimensional density through the construction of several one-dimensional densities along highly ``interesting'' projections of multidimensional data.
The adaptive mixture density estimator (AMDE) constructs a density by placing kernels at all of the observed data. Unlike a fixed-width mixture density estimator (FMDE) that uses kernels of fixed width, an AMDE allows the widths of kernels to vary from one point to another. Although the AMDE slightly improves the estimation capability of an FMDE, it does not reduce the high cost incurred in computation and memory storage commonly required in an FMDE. To overcome the problem of high cost in computation and memory storage, a robust radial basis function (RBF) based mixture density estimator can be used. In the construction of an RBF density estimator, sequential and batch clustering algorithms are commonly used in determining the parameters associated with the deployed mixtures (e.g., the mean vectors and covariance matrices of Gaussian mixtures). These clustering algorithms perform poorly in the presence of probabilistic outlying data or data of large variations of dynamic range among dimensions, the latter imposing high sensitivity to the selection of distance measures in the clustering. To overcome these difficulties involved in constructing an RBF density estimator, statistical data sphering technique combined with a centroid splitting generalized Lloyd clustering technique (also known as the LBG algorithm) can be used in the robust RBF density estimator construction.
Although the robust RBF density estimator construction technique can overcome some of the difficulties encountered in using conventional RBF density estimators, it still can not overcome the drawback of the estimators' performance being too sensitive to the settings of some control parameters, e.g., the number of kernel mixtures used, the locations of kernels, the orientation of kernels, the kernel smoothing parameters, the excluding threshold radius for data sphering, the size of training data, etc. We are thus motivated to study the statistical projection pursuit density estimation technique. In contrast to the locally tuned mixture kernel methods, where data are analyzed directly in high dimensional space around the vicinity of the kernel centers, a projection pursuit method globally projects the data onto one- or two-dimensional subspaces, and analyzes the projected data in these low dimensional subspaces to construct the multivariate density. More specifically, the projection pursuit first defines some index of interest of a projected configuration (instead of using the variance adopted by the principal component analysis) and then uses a numerical optimization technique to find the projections of most interest. The projection index adopted for density estimation is the degree of departure of the projection data from normality.
Performance evaluations using training data from mixture Gaussian and mixture Cauchy densities are presented. The results show that the curse of dimensionality and the sensitivity of control parameters have a much more adverse impact on the mixture density estimators than on the projection pursuit density estimator.

Author: Katsuaki Iwaki
Title: Bayesian Analysis of the Poly-t Distribution
Email: haruko@stat.purdue.edu
Abstract: The finite mixture model is fruitful as a flexible parametric model. However, the poly-t distribution is also useful for the same purpose. We will analyse this distribution from Bayesian point of view.

Author: Nicholas P. Jewell
Title: Double Censoring, Current Status Data and Mixture Models
Email: jewell@stat.berkeley.edu
Abstract: Doubly censored data refers to the situation in survival analysis where there is incomplete information on both the origin and endpoint of a failure time random variable. On the other hand, current status data arises when the only information available on individuals is their survival status at a single monitoring time. We will describe these data structures and generalizations in terms of mixture models and consider examples of these data structures that arise in epidemiological studies of HIV disease. Attention will be focused on nonparametric estimation of a distribution function and semiparametric regression problems.

Author: Mark S. Kaiser
Title: Joint Mixing Distributions for Random Parameter Models
Email: mskaiser@iastate.edu
Co-authors: Noel Cressie
Abstract: The use of models with random parameters at the level of the data has increased as computational abilities render their analysis feasible. Random parameter models lead to marginal data distributions in the form of general mixtures, but their analysis is complicated if the mixing random variables are not independent. Examples include random coefficient regression models that should allow for dependence among parameters within an observational unit, and spatial models that allow spatial dependence to be expressed, in the distribution of data model parameters, according to their spatial location. It is difficult to model dependence in mixing distributions because simultaneous multivariate specifications are not available for other than the Gaussian case, or longer-tailed versions of it. We explore the use of conditional specifications for modeling random parameters with dependence, which leads naturally to the consideration of Markov Random Field forms of distributions. We examine the construction of such models, potential restrictions placed on the type of dependence allowed, and empirical Bayes estimation using MCMC methods. The concepts introduced are illustrated through the analysis of a spatial model for forest damage in the eastern United States.

Author: Lynn Kuo
Title: Bayesian Semiparametric Inference for Accelerated Failure Time Model
Email: LYNN@PLAYFAIR.STANFORD.EDU before August 30, 1995 KUO@UCONNVM.UCONN.EDU after August 30, 1995
Co-authors: Bani Mallic
Abstract: Bayesian semi-parametric inference is considered for a log-linear model. This model consists of a parametric component for the regression coefficients and a nonparametric component for the unknown error distribution. Bayesian analysis is studied for the case of a parametric prior on the regression coefficients and a mixture-of-Dirichlet-processes prior on the unknown error distribution. A Markov chain Monte Carlo (MCMC) method is developed to compute the features of the posterior distribution. A model selection method for obtaining a more parsimonious set of predictors is studied. The method adds indicator variables to the regression equation. The set of indicator variables represents all the possible subsets to be considered. A MCMC method is developed to search stochastically for the best subset. These procedures are applied to two examples, one with censored data.

Author: David H. Laidlaw
Title: Bayesian Mixture Classification of MRI Data for Geometric Modeling and Visualization
Email: dhl@gg.caltech.edu
Co-authors: Kurt W. Fleischer and Alan H. Barr
Abstract: We present an application of Bayesian mixture classification: creating geometric models and visualizations from volumetric magnetic resonance imaging (MRI) data. Our computational pipeline separates into four stages: data collection, tissue classification, model-building, and visualization. Within this pipeline we have developed a Bayesian framework for creating tissue mixture ``classifiers'' based on the chemical properties of the materials, on the characteristics of the MRI data collection process and on the geometric properties of the objects we are modeling.
We present new methods with the following features:
1. we define new mixture basis functions,
2. we use Bayesian inference to locate pieces of surfaces within voxels,
3. we describe an object-oriented framework that allows us to derive classifiers with varying accuracies and computational costs.
The methods use the assumption that our object has ``unmixed'' solid volumes of each specific material at the scale of our sampling rate. The mixtures arise from the sampling process and are most evident near boundaries between materials. The schemes differ in how the mixing is parameterized.
The tissue classification techniques treat a voxel as a small region of space that can have different values at different positions. Each voxel may contain several materials. Based on the sampling theorem, we reconstruct a continuous function, F, from the sampled data and calculate a local histogram, H, of F over each voxel. For each voxel we fit parameterized mixture basis functions to the local histogram and use Bayesian techniques to find a set of parameters that best matches the histogram. The parameters describe the mixture of materials within a voxel given the assumptions of the model.
One method is faster and parameterizes each histogram as a linear combination of normal distributions representing pure materials and mixture basis functions representing the regions near boundaries between pure materials.
A second method is geometrically more accurate and parameterizes the histogram model by the distance from a voxel to a material boundary. When the distance is large, the histogram is approximately a normal distribution for one of the constituent materials. As the distance approaches zero, the histogram model lies between the normal distributions for the constituent materials. The most-likely combination of materials is chosen from the set of possible pairs.
The results of the classification step are tailored to make extraction of surface boundaries between solid object parts more accurate. This framework allows users to more easily create geometric models with internal structure and with a high level of detail. We demonstrate this with a series of geometric models and images created from MRI data. Applications exist in a variety of fields including computer graphics modeling, biological modeling, anatomical and physiological studies, medical diagnosis, CAD/CAM, robotics, and computer animation.

Author: Michael D. Larsen
Title: Bayesian Model Monitoring in Two Mixture Problems
Email: larsen@fitted.harvard.edu
Abstract: Mixture models can be used to search for groups in a set of subjects. The groups that are identified can be compared to prior classifications of the units. Usually, locating groups by fitting models and comparing them to existing classes is used only as a descriptive tool. Tests for the number of classes and for the appropriateness of models within each class can be performed in the context of Bayesian model monitoring (Rubin, 1984). These methods do not rely on asymptotic approximations, such as those employed in traditional likelihood ratio tests. Computation of probabilities can be accomplished through iterative simulation. And there is great flexibility in developing problem-specific test procedures.
Bayesian model monitoring has been applied to two mixture problems. At the U.S.\ census bureau, computerized record linkage is used in the evaluation of the decennial census. The observed data are a ten-dimensional contingency table which really are a combination of observations from three types of households. The mixture approach is extremely effective in separating these groups. Bayesian methods allow estimation of error rates and comparisons of alternative log linear models.
The second problem arises in a psychological study. Individuals, who were classified as inhibited or uninhibited when 21 months old, have been measured again at 13 years. The goal is to find variables among the new set that delineate the two original groups. The data are modeled as a mixture of multivariate normals with covariance structures suggested by the psychologists. Bayesian methods are important because sample sizes small and there is some missing data and because model monitoring ideas can be used to assess the groups found utilizing various models.

Author: Marc Lavielle
Title: On a Stochastic Approximation Version of the EM Algorithm
Email: lavielle@stats.matups.fr
Co-authors: Eric Moulines
Abstract: The Expectation-Maximization (EM) algorithm is a very popular tool for computing maximum likelihood estimates in incomplete data models. In certain situations however, this approach is not applicable, because the expectation step cannot be performed in closed-form. To deal with these problems, stochastic versions of the EM algorithm have been proposed by several authors. In particular,Wei and Tanner have proposed to use the so-called Monte-Carlo EM algorithm, where the expectation step is carried out using Monte-Carlo integration. The MCEM algorithm shares most of the attractive features of the EM, yet the convergence of this scheme cannot be guaranteed.
In this contribution, an alternative approach is presented; in place of the Monte-Carlo approximation for the E-step, a stochastic approximation procedure is used, leading to the Stochastic Approximation EM algorithm (SAEM). Given the current guess of the parameters, missing data are simulated from the conditional predictive distribution. These simulated missing data are then used to update the expected value of the complete data log-likelihood. functions. Moreover it is proved that, under mild additional conditions, the attractive stationary points correspond to the maxima of the incomplete-likelihood: convergence toward saddle-points are avoided with probability one. Simple illustrative examples of applications of the SAEM algorithm and of the associated convergence results are presented to support our findings.

Author: Mary L. Lesperance
Title: On Calculating the Information in a Semiparametric Mixture Model
Email: mlespera@sol.uvic.ca
Co-authors: Bruce G. Lindsay and Kathryn Roeder
Abstract: We consider the problem of calculating the information about psi in the model: f(y_i | phi_i , psi), where y_i is vector-valued. In the fixed effects model, phi_1, phi_2, ... is assumed to be an arbitrary sequence of nuisance parameters, whereas in the random effects model, this sequence is assumed to be i.i.d. with unknown distribution G. We extend the projective scores methodology of Lindsay & Waterman (1995) >from the fixed effects to the random effects setting with the goal of finding readily calculable standard errors in this semiparametric model. The effectiveness of our approach is demonstrated using asymptotic theory and examples including an errors-in-variables model.

Author: Luning Li
Title: Hidden Markov Model
Email: luningti@ucar.edu
Abstract: Autoregressive hidden Markov models have great advantage in modelling climatology data. Under mild conditions, the maximum likelihood estimation of the parameters is shown consistent. The author uses a "backward-forward" similiar algorithm on a Denver rainfall data and a simulated data set. There is a difficulty in maximum likelihood estimation. The author is also working on the MCMC solution.

Author: Bruce G. Lindsay
Title: Likelihood Ratio Tests in Mixture Models
Email: BGL@PSUVM.PSU.EDU
Co-authors: Yong Lin
Abstract: Finding the limiting distribution for likelihood ratio tests in mixture models is not an easy task. The difficulty can be explained best in a geometric sense, as the standard parameterizations mask the identifiability problems. A careful analysis of the testing problem of one versus two components shows that the problem can be described as determining the distribution of the projection of a normal vector onto a curved surface. As a consequence, the limiting distribution, while neither chi-square nor chi-bar squared, can be approximated using a chi-bar distribution by using results of Hotelling on the volumes of tubes on the sphere.

Author: Steven N. MacEachern
Title: Estimation of Bayes Factors in Mixture of Dirichlet Process Models
Email: snm@osustat.mps.ohio-state.edu
Abstract: Mixtures of Dirichlet processes provide a flexible set of models that are often described as mixture models with an infinite number of components. This talk addresses the Bayes factor comparing the MDP model to a corresponding parametric model. Efficient estimation of the Bayes factor is discussed and illustrated with examples.
To calculate the Bayes factor, the parametric model is assigned probability zero under the nonparametric model. However, when we restrict consideration to the finite sample of data that is observed, we find that the joint marginal distribution of the data under the parametric model is assigned positive probability under the MDP model. Since the Bayes factor comparing the two models is calculated based on the finite sample of data actually obtained, we can treat the models as nested models in which the smaller receives positive probability. The Bayes factor is given by a simple expression and can be computed with an evaluation of the posterior under the MDP model.
Due to the extraordinarily large number of components in the mixture induced by the Dirichlet process, the MDP models are fit with modern Monte Carlo Markov chain methods. The output of such a chain may be used in several ways to estimate the Bayes factor between the two models. In the case of a continuos base measure for the Dirichlet process, the simplest estimate would be based on a tabulation of how often the mixture has a number of components equal to the number of data values. Such an estimator provides quite poor performance because this event is assigned very small probability under both the prior and posterior. We are then faced with the classic problem of estimating a small probability, a difficult task even with a long run of the chain. More sophisticated estimates are based on the notion of Rao-Blackwellization. The trick to creating a good estimator is in creating an effective Rao-Blackwelliztion.
In some instances, the Markov chain used to fit the MDP model may be sped through integration over portions of the state space. When this is feasible, the problem of fitting the MDP model may take place on a finite state space Markov chain. In this case, the tabulation method would count the number of times that a particular, single state is visited. The idea of Rao-Blackwellization as used here is to partition the state space so that expectations can be evaluated over elements of the partition, rather than evaluating the function at a single point. The version of Rao-Blackwellization used here splits the single state of interest, attaching a portion of it to each of the other states in the state space. This technique is being investigated in other circumstances by MacEachern and Peruggia. Preliminary investigations show estimates based on this Rao-Blackwellization to be extremely accurate.

Author: B. Mangin
Title: Detection of Genes Acting on A Quantitative Trait
Email: mangin@toulouse.inra.fr
Co-authors: B. Goffinet, A. Rebai and C. Cierco
Abstract: Mapping genes involved in the variation of quantitative traits is a very important issue for animal and plant breeders. Mixture models are useful tools for modeling the effects of such genes, and a maximum likelihood ratio test (MLRT) can be used to detect the presence of a putative gene in progeny.
In progeny, such as back-cross or F2, the Mendelian theory of gene segregation gives the theoretical proportion of the mixture, but leads to a non-regular likelihood. The asymptotic distribution of the MLRT, under the null hypothesis, is obtained by Taylor expansion up to the classical second order. Contiguity arguments can be used to obtain the asymptotic power for well chosen local alternatives.
The rapid advancement of molecular technology, has made possible to use genetic markers to map genes on the chromosome. The Mendelian theory allows one to obtain for the conditionnal distribution of the data given the markers information, a mixture of distributions in which the parameters of interest are the location and the effect of the putative gene (called Quantitative Trait Loci, QTL).
We focus attention on the asymptotic distribution of MLRT for QTL having small effect, that is QTL detected with power ranging from 20% to 90%. For these local alternatives, approximate thresholds for dense or sparse map are obtained and an unbiased confidence interval for the QTL location is constructed using asymptotically similar statistics.

Author: Geoff McLachlan
Title: On the Number Of Components In Normal Mixture Models
Email: gjm@math.uq.oz.au
Abstract: Usually an important consideration in a cluster analysis in practice is deciding on the number of clusters in the data. With a mixture model-based approach to clustering, this problem can be approached by testing for the smallest number of components in the mixture model compatible with the data. This can be undertaken using the likelihood ratio statistic $\lambda$ to test for g versus g+1 components, proceeding sequentially from g=1. As is well-known now, regularity conditions do not hold for -2 log lambda to have its usual chi-squared null distribution. One possible way for assessing the P-value of the test is to adopt a resampling approach. We describe an algorithm that automatically carries out the resampling to enable the subsequent assessment of the P-value in the case of normal mixture models.

Author: Xavier Milhaud
Title: L'Estimateur du Maximum de Vraisemblance Non Param\'etrique (N.P.M.L.E.). Le cas des Mixtures
Email: milhaud@cict.fr
Abstract: L'objet de cette communication est la pr\'esentation de l'Estimateur du Maximum de Vraisemblance Non Param\'etrique (N.P.M.L.E.). Nous arborderons pour cela le probl\`eme de la vitesse de suites de structures statistiques semi--param\'etriques et le comportement asymptotique du N.P.M.L.E. dans certaines situations de m\'elanges.

Author: Seokyong Moon and Jenq-Neng Hwang
Title: Inversion of Continuous Density Multi-Mixture Hidden Markov Models for Robust Speech Recognition
Email: hwang@pierce.ee.washington.edu
Abstract: Given a continuous density multi-mixture (CDMM) hidden Markov model (HMM), the model parameters consist of five major components Lambda = {Pi, A, Mu, Sigma, W}, where Pi={ pi_i } denotes the set of initial state probabilities, A={ a_ij } is the set of state transition probabilities, Mu={ mu_ik } is the set of mean vectors of Gaussian mixtures, Sigma= { Sigma_{ik} } is the set of covariance matrices of Gaussian mixtures , and W={ w_ik } is the set of intensity weighting of Gaussian mixtures.
The HMM output probability (likelihood) P(S | Lambda) is defined to be a function of model parameters Lambda and speech inputs S, i.e., P(S | Lambda)= Psi(S, Lambda). Based on the functional dependencies of the HMM's likelihood to the model parameters { Lambda } and inputs { S }, the Baum-Welch inversion of HMM can be derived. More specifically, the Baum-Welch reestimation of an HMM finds the model parameters Lambda based on a fixed set of speech inputs { S }, while the inversion of an HMM finds speech inputs { S } that optimize some criterion with given model parameters { Lambda }. The Baum-Welch inversion is a dual procedure to the Baum-Welch reestimation algorithm. Both Baum-Welch inversion and reestimation try to maximize the model probability P(S | Lambda) while one estimates the input { S } and the other estimates the model parameters { Lambda }.
The proposed Baum-Welch HMM inversion is applied to robust speech recognition tasks by moving (reestimating) the input speech features toward the means of Gaussian mixtures of the hypothesized (or target) class with appropriate constraints. Among the many possible constraints, the {\em robustness bound constraint}, which analytically bounds the allowable movements of LPC cepstral coefficients, is adopted for "robust" Baum-Welch inversion classification. Although the robustness bound constraint somewhat prevents problems caused by the "affine phenomenon", which promotes the high similarity of geometrical shapes between the newly moved noisy speech features and the Gaussian mixture means of any target class, it can't prevent the affine phenomenon completely. More specifically, the original temporal and correlated structure of the noisy speech features can be destroyed during the movement guided by the Gaussian mixtures, which do not represent any specific meaningful speech utterance due to the strong temporal and ensemble averaging in the Baum-Welch reestimation training. To overcome these problems, scaled robust Baum-Welch HMM inversion, which prescales the cepstral coefficients to compensate the frequently encountered norm shrinkage effect, is proposed for minimal and critical use of Baum-Welch inversion.
Furthermore, it is our observation that the MINIMAX classification technique is good for quick adjusting the model to the global shape of testing speech while inversion classification is good for gradual adjusting the testing speech to finer details of the target model. The "batch" and "sequential" combination techniques are proposed to take advantage of both MINIMAX and inversion properties. In the batch combination technique, completion of MINIMAX optimization is followed by inversion optimization and a model which yields the highest likelihood is claimed as a winner. In the sequential combination technique, each iteration consists of one step of MINIMAX optimization and one step of inversion optimization. This latter one can be interpreted as an Expectation-Maximization (EM) procedure commonly used for solving maximum likelihood problems with "incomplete" data. The performance of the proposed HMM inversion, in conjunction with HMM reestimation, for robust speech recognition under additive noise corruption and microphone mismatch conditions is favorably compared with other noisy speech recognition techniques, such as the projection-based first-order cepstrum normalization (FOCN) and the MINIMAX classification technique.

Author: Peter Mueller
Title: Bayesian Models for Non-Linear Auto-Regressions
Email: pm@isds.duke.edu
Co-authors: Mike West and Steve MacEachern
Abstract: We report on classes of Bayesian mixture models for non-linear auto-regressive times series, based on developments in semi-parametric Bayesian density estimation in recent years. The development involves formal classes of multivariate discrete mixture distributions, providing flexibility in modelling arbitrary non-linearities in time series structure and a formal inferential framework within which to address the problems of inference and prediction. The models relate naturally to existing kernel and related methods, threshold models and others, though offer major advances in terms of parameter estimation and predictive calculations. Theoretical and computational aspects are developed here, the latter involving efficient simulation of posterior and predictive distributions. Various examples illustrate our perspectives on identification and inference using this mixture approach.

Author: Michael A Newton
Title: On a Mixture of Dirichlet Processes for Bayesian Analysis
Email: newton@stat.wisc.edu
Abstract: In certain semiparametric sampling models, it is convenient to invoke a variation of the Dirichlet process that has fixed location and scale. One such variation can be constructed by extending the median-zero process of Doss. The new process is a Dirichlet mixture and inherits a number of properties from the Dirichlet, including closure under sampling. As a prior distribution, it supports identifiable sampling models, and has several advantages over the standard Dirichlet process prior. Posterior computation is no more complex than for other mixture Dirichlet problems. I will illustrate the prior in a binary regression problem, and I will review available limit results. Some recent developments in Markov chain computation will also be presented.

Author: Agostino Nobile
Title: Bayesian Predictive Analysis of Finite Mixture Distributions
Email: nobile@isds.duke.edu
Abstract: A Bayesian hierarchical model on finite mixture distributions from a parametric family is considered, with the number of components in the mixture, the mixture weights and the components' parameters all treated as random variables.
Even though the model is parametric, parameters are chiefly used as a device: a predictive approach is taken and the analysis' conclusions are stated in terms of the predictive distribution of the next observable, unconditional on the number of components in the mixture. Instrumental to its computation is the derivation of the posterior distribution of the number of mixture components, of interest in itself in some applications.
The quantities of interest cannot be computed exactly, because they involve sums of $k^n$ terms, with $n$ the data sample size and $k$ the number of mixture components. They are estimated by means of Markov chain Monte Carlo methods. Specifically, the posterior distribution of the unobserved membership vector, specifying which of $k$ components generated each of the $n$ observations, is sampled employing a Gibbs sampling algorithm modified with a Metropolis step designed to swiftly move the sampling chain around the $k!$ modes of the posterior. The marginal density of the observed data, conditional on the number of components, is estimated by applying a novel technique to the simulation results. A subsequent application of Bayes theorem yields an estimate of the posterior distribution of the number of components. Special attention is paid to the variability of the Monte Carlo estimates.
Some examples illustrate the flexibility of the model.

Author: Jacek L. Noga
Title: Texture Classification using Partitioned Mixture Distribution Networks
Email: jln@eng.cam.ac.uk
Co-authors: William J. Fitzgerald and Stephen P. Luttrell
Abstract: Texture analysis plays an important role in a number of areas in image processing and computer vision, including textile manufacturing and remote sensing.
Here we present the application of a density model of an image, based on a Partitioned Mixture Distribution (PMD) network, to the task of texture classification.
The PMD network consists of a number of partially overlapping mixture distributions. However, the main difference between PMD networks and mixture distributions, and what makes the PMD networks attractive in image processing, is that PMD networks scale well to large images.
The problem addressed here is the following: given a training set of N images from K texture classes, train a texture classifier to classify a test image into one of the K texture classes. This is a type of supervised learning problem, as both the training image (input) and the corresponding class membership (output) are available in the training set.
We have here adopted a predictive approach to texture classification. The network is trained using a re-estimation type algorithm, which maximises the log predictive probability of the training image belonging to the given texture class.
To illustrate the effectiveness of partitioned mixture distribution networks in modelling texture statistics, we test the performance of the PMD network on classification of Brodatz textures.
In addition, we demonstrate how the network-type structure of the PMD can be exploited to produce probability images, which can be used as a crude form of anomaly detector.

Author: Tony Orlando
Title: Applications of Normal Mixture Models to Clinical Studies
Email: billm@austin.aus.sig.net
Abstract: In large scale clinical studies it is common that data are not normally distributed. This may be due to a number of causes: sub-populations of patients may be non-responders, the measurement or rating process may vary from site to site, genetic polymorphism may be operant, unknown contaminating factors may be present, etc. This paper presents a methodology for the analysis of non-normal data using three examples: treadmill exercise tolerance time (ETT) data (interval scale) from a congestive heart failure study, total symptom score (TSS) data (ordinal scale) from a rhinitis study, and percent sulphixide recovery (SR) metabolite urine data (ratio scale) from a genetic polymorphism study. The problem is to determine a treatment or covariate effect when the response is characterized by a non-normal distribution.
The normal regression mixture model analysis was performed using software which employs the EM algorithm for parameter estimation and the Gauss-Newton method for standard error estimation.

Author: Nandini Raghavan
Title: Optimal Sampling from Mixture Distributions
Email: raghavan@stat.mps.ohio-state.edu
Co-authors: Dennis D. Cox
Abstract: ``Partially improper'' Gaussian priors are considered for Bayesian inference in logistic regression. These include generalized smoothing spline priors that are used for nonparametric inference about the logit, and also priors that correspond to generalized random effect models. The resulting posterior distributions are rather complicated and intractable. Monte Carlo importance sampling is used for sampling from the posterior and evaluating posterior functionals. Based on considerations of the tail behavior [Raghavan, Cox], we construct a mixture of a Gaussian and a tail-dominating density as the importance sampling density. The choice of the mixing probability is guided by the requirement that the resulting mixture be a good approximation to the posterior. However, if the behavior of interest is manifested in a relatively low probability region, one may not wish to be restricted to sampling in accordance with this proportion. We use a strategy that allows us to decouple the two goals, by generating the components of the mixture independently and using a two-stage sampling scheme to arrive at the optimal mixture density, and optimizing over both the mixing probability and the allocation to the mixture components.

Author: Mehdi Razzaghi
Title: Probits of Mixtures for Clustered Binary Outcomes
Email: razzaghi@planetx.bloomu.edu
Abstract: To investigate the relationship between the response probability and covariates in many biological experiments the use of the probit model is a common practice. This entails the use of the binomial distribution in which the probability of response is related to the dose by a suitable dose-response relationship. However, the population under study can, very often, be a rather heterogeneous group being comprised of two or more subpopulations. Hence a single dose response model may not be appropriate to describe the distribution of responses. At the same time, the use of the binomial distribution is based on the assumption of independence of responses. This assumption is very often, such as in teratological and ophtamological experiments, non-realistic since observations may be grouped into clusters, exhibiting an extra binomial variation. Here, a model is proposed to analyze clustered binary data when the underlying population is non-homogeneous. The model uses the beta-binomial distribution along with a mixture dose-response relationship. Applications of the model in different fields are described and examples are given to provided further illustration.
Key words: Mixture, dose-response, binary outcomes, extra binomial

Author: Christian P. Robert
Title: Reparameterisation Issues and Component Estimation in Mixture Modelling - paper 1
Reparameterisation Issues and Component Estimation in Mixture Modelling - paper 2
Related papers: Reply to a discussion of paper 1 above, and a draft manuscript Mixtures of distributions: inference and estimation.
Email: robert@bayes.ensae.fr
Co-authors: Marie-Anne Gruet, Kerrie Mengersen, and Robert Wolpert
Abstract: The need for efficient mixture estimation is getting more and more obvious as the number of applications calling for mixtures as modelling tools exponentially increases. The purpose of this talk is twofold: first, we discuss in details a Bayesian noninformative approach for the estimation of normal mixtures which relies on a reparameterisation of the secondary components of the mixture in terms of divergence from the main component. Although innocuous at the modelling stage, this reparameterisation has important bearing on both the prior distribution and the performances of the Gibbs sampler. In particular, we compare two possible reparameterisations extending Mengersen and Robert (1995) and show that the reparameterisation which does not link the secondary components together is associated with poor convergence properties of the Gibbs sampler. The second part of the talk is more tentative and deals with the important issue of estimating the number of components of a given mixture. While essential to an application of mixture modelling to noninformative setups, this inferencial issue is plagued with problems: (a) it can be either perceived as a regular estimation setup or as a sequence of embedded tests, the choice of the approach leading to different answers; (b) from an estimation point of view, it involves a parameter set of infinite dimension, one of the parameters being the dimension of the remaining parameters and, due to this peculiarity, Gibbs sampling cannot be directly implemented; (c) from a testing point of view, regular asymptotic approximations are no longer valid, parameterisation is a key issue and improper priors cannot be used. We propose a solution based on a two-step decomposition of the problem (estimation of the number of components then estimation of the component parameters) which takes full advantage of the reparameterisation introduced in the first part and of the freedom of relabeling the components allowed by identification constraints.

Author: Kathryn Roeder
Title: Bayesian Estimation of Normal Mixture Models
Email: roeder@stat.cmu.edu
Co-authors: Larry Wasserman
Abstract: Normal mixtures, with an unspecified number of components, provide a flexible class of models which are widely used in statistical modeling. Interest often lies in estimating the number of components in the mixture, as well as estimating the density itself. Although Markov chain Monte Carlo methods make it easy to take a Bayesian approach to fitting such models, there are several problems that complicate such an approach. First, the standard reference prior leads to an improper posterior for the density; second, because the reference prior is improper, the posterior for the number of components in the mixture is not well defined; and third, the posterior simulation does not provide a direct estimate of the posterior for the number of components. We present some methods for coping with these problems.
For the location parameters, we choose a prior that is ``partially proper'' in the sense that the marginal distribution of each component mean is flat, but the conditional distribution, given the means of the other components, is proper. Consequently, no subjective information is required for the overall location of the component means. A similar technique is used to obtain conditionally proper priors for the scale parameters which are marginally equal to the usual reference prior.
To find the posterior density for the number of components in the mixture we use the Schwarz criterion. Although not formally justifiable by the methods of Kass and Wasserman, our simulations suggest that the approximation is quite good.
Simulations demonstrate that the method performs remarkably well and preliminary investigations indicate the resulting density estimates are consistent for a class of models much broader than the class of finite normal mixture models.

Author: Tamas Rudas and Clifford. C. Clogg
Title: Mixture Methods for Measuring the Fit of a statistical Model
Email: rudas@tarki.hu
Abstract: For a statistical model H, and distribution P, the mixture index of fit is defined as p'(P,H)=inf{p: (1-p)R+pS, where R is in H and S is an unrestricted distribution}. This index of fit was introduced in the context of contingency tables and discrete distributions by Rudas, Clogg, Lindsay (1994, JRSS(B)). Methods for obtaining maximum likelihood estimates of the mixture index of fit, and confidence bounds were suggested. It was also shown, that in case of a bivariate normal distribution and the model of independence, the mixture index of fit is a simple function of the correlation coefficient. The talk will generalize the mixture index of fit to arbitrary distributions, and will consider the related minimum distance estimation problem: for fixed P and H, find R(P) in H, such that P=(1-p'(P,H))R(P)+p'(P,H)S. Minimum mixture estimation is closely related to minimum distance estimation in the supremum norm, and the minimum mixture estimate can be found without first finding p'(P,H). Once the minimum mixture estimate is found, the computation of the mixture index of fit p'(P,H) is straightforward. Minimum mixture estimation will be applied in the talk to estimating the parameters of a normal distribution, and to estimating regression coefficients. In the case of estimation of the regression coefficients, while the usual least squares estimation selects the parameters to minimize the sum of squared deviation, minimum mixture estimation chooses the parameters to minimize the maximum squared deviation between the observed and estimated values.

Author: Edward Susko and Jack Kalbfleisch
Title: Constrained Maximum Likelihood Estimation of Mixing Distribution Functions
Email: ea2susko@barrow.uwaterloo.ca and Jdkalbfl@math.uwaterloo.ca
Co-authors: Jiahua Chen
Abstract: This paper reviews computational techniques available for maximum likelihood estimation of a nonparametric mixing distribution and presents some new ideas. A globally convergent semi-infinite programming algorithm is given and compared, in terms of numerical experiments, to alternative semi-infinite programming approaches and the ISDM algorithm. A penalized generalization of the ISDM algorithm is shown to provide a good approach to the problem of maximum likelihood estimation subject to constraints which allows computation of prorfile likelihood for a functional of the mixing distribution. Convergence results as well as numerical examples are given for this algorithm, and asymptotic results are developed for the likelihood ratio test statistics.

Author: Yehuda Vardi
Title: Network Tomography
Email: vardi@stat.rutgers.edu
Abstract: The problem of estimating the node-to-node traffic intensity from repeated measurements of traffic on the links of a network is formulated, and estimation procedures are proposed under certain Poisson assumptions for two different traffic-routing regimens: Deterministic (fixed, known path between any pair of nodes) and Markovian ( a random path between each directed pair of nodes, determined according to a known Markov chain, fixed for that pair). Applications will be discussed.

Author: Mike West
Email: mw@isds.duke.edu
Title: Hierarchical Mixture Models in Neural Transmission Analysis
Abstract: We report on novel classes of mixture models for data generated in neurological experiments designed to investigate the stochastic nature of neural synaptic transmission in mammalian, and other, nervous systems. The modelling is scientifically driven, with model parameters describing the sensitivity of individual synaptic transmission sites to electro-chemical stimuli, and the extent of their electro-chemical responses when stimulated. Our development involves rather intricate classes of priors for these kinds of underlying quantities, designed to allow for ranges of features and relationships consistent with currently topical hypotheses about fundamental neural function, and hence to permit assessment of such hypotheses. Posterior analysis is implemented via stochastic simulation, which is technically developed here. Several data analyses are described to illustrate the approach and to demonstrate its usefulness in providing neurophysiological insights in some recently generated experimental contexts. Further developments and open questions are noted.

Author: Robert L. Wolpert
Title: Spatial Modeling with Gamma/Poisson Mixtures
Email: wolpert@isds.duke.edu
Co-author: Katja Ickstadt
Abstract: We construct and study hierarchical spatial point process models, using mixtures of inhomogeneous gamma processes for Poisson intensity measures. This class of models is conjugate for local census observations in which the locations of all points within some subregion are recorded. Unlike the usual gamma Poisson models, these feature stochastically dependent intensity measures at different sites and can be used to express prior beliefs about homogeneity, continuity, and similar features for the intensity.
The models and techniques are then applied to two different problems: biodiversity assessment (investigating the density variation of trees in Duke forest by modeling their locations with gamma Poisson mixture models) and nonparametric multivariate density estimation (generalizing recent work of Wolpert and Lavine).

Author: Miao-Dan Wu
Title: A Modified EM Algorithm for Mixture Models
Email: mdw@eng.cam.ac.uk
Co-authors: Jacek L. Noga and William J. Fitzgerald
Abstract: The Bayesian approach to predicting a data vector $\mathbf{z}$ given an observation vector $ \mathbf{d} $ is to maximise the predictive density $ p(\mathbf{z} \mid \mathbf{d})$. For a mixture of $ M $ distributions $ \{ \mathcal{H}_{i}\}, i=1, \ldots, M, $ assumed \textit{a priori} to be adequate to model the underlying stochastic process, $ p(\mathbf{z} \mid \mathbf{d}) $ can be expressed as $ \sum_i \int p( \mathbf{z} \mid \mathbf{\theta}_{i}, \mathcal{H}_{i}, \mathbf{d}) p(\mathbf{\theta}_{i} \mid \mathcal{H}_{i}, \mathbf{d}) \dif{\mathbf{\theta}_{i}} \cdot p(\mathcal{H}_{i} \mid \mathbf{d}) $, where $ \mathbf{\theta}_{i} $ is the parameter vector for the $ i$th distribution. To circumvent the daunting optimisation task, the EM algorithm can be adopted, which gurantees at least local maximisation of the predictive density. Denote $ \mathbf{S} $ to be the ensemble of all the parameters of the mixture model, $ \{ \mathbf{\theta}_{i}, w_{i} \} $ , where $ w_{i} \stackrel{\mathrm{def}}{=} p(\mathcal{H}_{i} \mid \mathbf{d}) $. At each iteration step $ m $, the EM algorithm updates the $m$th estimate $ \mathbf{z}^{(m)}$ to $ \arg \max_{\mathbf{z}} \int \log p(\mathbf{z}, \mathbf{d} \mid \mathbf{S}) p(\mathbf{S} \mid \mathbf{d}, \mathbf{z}^{(m-1)}) \dif{\mathbf{S}} $. For complicated problems, it is often found difficult to obtain a closed form expression for the integral and subsequently to optimise it.
We propose a modified EM algorithm, in which Markov chain Monte Carlo (MCMC) techniques are employed to carry out the updating procedure. At step $ m$, MCMC simulation is performed to generate a sample $ \{ \mathbf{S}_{j}^{(m)} \}$ >from $ p(\mathbf{S} \mid \mathbf{d}, \mathbf{z}^{(m-1)})$. $ \mathbf{z}^{(m)}$ is then updated to be $ \mathbf{z}^{(m-1)}-\epsilon \cdot \sum_{j} \nabla_{\mathbf{z}} \log p(\mathbf{z}, \mathbf{d} \mid \mathbf{S}_{j}^{(m)}) $, where $\epsilon $ is a pre-determined constant. This algorithm is akin to the so-called general EM (GEM) algorithm in the sense that at each iteration the predictive density value is merely increased but not necessarily maximised. The asympototic properties of the EM algorithm can however be shown to remain valid for the GEM algorithm.
There has recently been considerable interest in embedding the principle of mixture models (experts) into the neural network learning paradigm. We present an example in which a mixture distribution network is trained using the proposed algorithm for the purpose of prediction. The results are compared with those obtained using existing methods.

Author: Assaf J. Zeevi
Title: Rates of Convergence in Density Estimation Through Convex Combinations of Densities
Email: azeevi@techunix.technion.ac.il
Co-authors: Ron Meir
Abstract: We consider the problem of estimating an unknown density from a sample of i.i.d observations taking values in R^d, by a convex combination (mixture) of kernel densities. It is well known that for certain choice of kernel densities (i.e gaussian density functions), if the number of mixture components is taken large enough, then any density can be approximated to any degree of accuracy in the L_1 metric. Although reassuring, this statement leaves unanswered some fundamental questions concerning the degree approximation and the effects of parameter estimation, given the sample size. In this work we demonstrate that the overall error between the unknown density and the estimated convex combination of densities is bounded by O(1/n) + O(nd/N) where n is the number of kernel densities in the combination, d is the dimension of the data and N is the sample size. We note that the two contributions to the total risk are the approximation error and the estimation error. The approximation error refers to the distance between the target density function and the ``closest'' density function represented by the model , while the estimation error refers to the discrepancy between the ideal model (representing the ``best'' parametrization available) and the estimated one. The bound for the approximation error is a direct consequence of a recent lemma concerning convex combinations of functions in Hilbert spaces. As for the estimation error, we take a direct approach in analyzing the asymptotic distribution of the estimated parameters of the model , based on the sample set, derived by well known properties of maximum likelihood estimators. In passing we note that for any finite mixture, the convex combination of densities is considered to be a misspecified model, with the exception of the degenerate case in which the true density is itself a mixture of the same type. Thus, the framework of estimation should be more precisely defined as {\it quasi maximum likelihood} estimation, and the ``true'' (best misspecified) parameter is that which minimizes the Kulback-Leibler divergence with respect to the true distribution. In the specific case where the kernel densities belong to the exponential family, there exists a learning algorithm, the so called EM (Expectation Maximization) algorithm, which allows the maximum likelihood estimation procedure to be carried out quite effectively, utilizing the intrinsic structure of the model.
E-mail: azeevi@techunix.technion.ac.il
This paper is not yet available via www or ftp, for further info. please contact: Assaf Zeevi, e-mail: azeevi@techunix.technion.ac.il.