A simple way of representing the mass-points as part of the GLM is described, allowing straightforward ML estimation without parametric model assumptions for the omitted variable. The approach is thus competitive with quasi-likelihood approaches.
A GLIM4 implementation is briefly described, and illustrated with a number of examples of overdispersed GLMs.
In the second part the approach in the first part is extended to two- level variance component models in the exponential family. Essentially the same computational approach gives the NPML estimate of the random- effect distribution and so gives full ML estimation of the GLM parameters in the variance component model without any model assumption for the random effect distribution. The approach is thus competitive with GEE approaches, though it is a conditional rather than a marginal approach.
A GLIM4 implementation is briefly described and illustrated with several examples of variance component and longitudinal analysis with binary data.
The outline of the estimation can be described as follows. The model distribution $g(x)$ is approximated in advance by the mixture model $\sum_i\pi_i h_i(x,\theta_i)$ with enough number of components. This model is hierarchical structure of the mixture model, that is to say, $h$ is a substructure of $g$. However, the estimation of $a$ and $b$ is not trivial because they are common parameters among $h_i$'s. If we adopt the normal distribution as $h_i$ and optimize $a$ and $b$ separately, we can get explicit recurrence formulae for $a$ and $b$.
An another aspect of image understanding is attention. If there are too many objects in an image, we should restrict the region to estimate based on some basic features of the image. In that case, the data outside the region become censored data. We can also apply the EM algorithm for that missing value problem. One point that is different >from the original EM principle is that the number of the missing data is also missing. Namely, the number of censored data which is related to the objects inside the region is unknown. Therefore, we estimate the number of missing data in each EM step. Suppose the current fitted model is $p(x,\theta_0)$, the number of estimated missing data can be approximated by $n(1-p)/p$, where $n$ is the number of data in the region and $p=\int_C p(x,\theta_0)\,dx$ integrated in the region. That algorithm corresponds to the combination of Hartley's algorithm (1958) and the EM algorithm.
The most striking feature of our general results is that the study of the asymptotic behaviour of the estimators is equivalent to the study of the approximation of an operator cannonically associated to the family ${\cal P}$, and is also equivalent to a classical approximation problem in functional analysis.
This great generality allows us tu prove central limit theorem type for our estimators. We emphasize specially on necessary and sufficient conditions for the estimators to be efficient. In practical cases, the estimators are easy to compute using linear programming. The general theory is shown to be widely applicable through 11 examples, including mixtures of exponentials, uniforms, normals, gammas, etc.
A natural line of attack upon this problem is through the "intrinsic Bayes factor" or "fractional Bayes factor" methodologies. These methodologies were designed for scenarios in which the same difficulty - that standard noninformative priors cannot be used - is encountered. The application of these approaches to the mixture model problem will be discussed, with computational issues being emphasized.
This paper is based on the extended work of Bozdogan (1981, 1983), where the information-theoretic approach via Akaike's (1973) Information Criterion (AIC) was first introduced and proposed in choosing the number of component clusters in the mixture-model cluster analysis. Therefore, this paper considers the problem of choosing the number of component clusters of individuals within the context of the standard mixture of multivariate normal distributions, and presents some new results.
A common problem in all clustering techniques is the difficulty of deciding on the number of clusters present in a given data set, cluster validity, and the identification of the approximate number of clusters. How do we determine what variables best discriminate between the clusters as we simultaneously estimate the number of component clusters? How do we determine the outliers or extreme observations across the clustering alternatives? These are some fundamental questions confronting practitioners and research workers in classification and clustering.
Our objective here is to identify and describe the class distribution using a
sample drawn from the mixture-model, and estimate K, the number of clusters
such that k
We also give numerical examples based on simulated multivariate normal data
sets with a known number of clusters to illustrate the significance of model
selection criteria in choosing the number of clusters and the best fitting
model. These procedures take into account simultaneously the lack-of-fit of a
cluster, the number of parameters, the sample size, and the complexity of the
increased number of clusters to achieve the best fit.
One aspect of the immune response involves the recognition of an antigen in the blood by T cells whose surfaces contain receptors which are complementary, in a suitable sense, to part of that specific antigen. The T cells respond by emitting chemical signals, stimulating other cells to replicate, or replicating themselves.
The assay under study seeks to estimate the number of T cells in a blood sample which respond to each of the two test antigens, by measuring cell replication in a set of aliquots from the blood sample. The aliquots are placed into the wells of a microtiter plate. Some of the wells are treated with one of the two test antigens, some are treated with an antigen from tetanus, which serves as a positive control antigen, and some are treated with the chemical PHA, which, in causing nearly all cells to begin replication, serves as a positive control in the assay. A set of wells receives neither antigen nor PHA; these wells serve as negative controls and allow us to estimate the background level of cell replication. Cell replication in the wells is observed by measuring, using a scintillation counter, the amount of radioactively-labelled thymidine which is incorporated into the cells' DNA.
The traditional method of analyzing data from this type of assay involves declaring each well on the plate to be either positive or negative, according to whether its scintillation count exceeds some chosen cutoff. The average number of responding cells per well is then estimated by -log( # negative wells / # wells ). A key problem with this approach can be the objective determination of a suitable cutoff for the counts corresponding to negative wells. In addition, the above estimator for the average number of responding cells per well may be shown to be inconsistent, under quite general assumptions.
We have proposed a new method of analyzing the data from this assay, modelling the scintillation counts as coming from a mixture distribution. A central idea in our approach is that the size of the scintillation count for a well contains information about the number of responding cells it contains, and not just about whether or not it contains any responding cells.
Let y denote a suitable transformation of the scintillation count for a well, and let k denote the (unobserved) number of responding cells in that well. Our model is that k is Poisson distributed, and that y, given k, is normally distributed with mean a + b k and variance sigma^2, where a, b, and sigma^2 are constant across the wells on a plate. We seek estimates of the underlying Poisson means for the group of wells which received no antigen, and for the groups which received one of the test antigens.
The effectiveness of this new approach has been demonstrated using two dilution series, in which a blood sample is diluted to a sequence of decreasing cell densities, with the assay being applied to each dilution.
Areas for further research include the robustness of the normality assumption and the determination of a suitable transformation of the scintillation counts. We are also looking at other, quite different approaches.
We analyze a data set concerning flows (i.e. drop out, repeat and pass) in the Italian school system over the period 1963 -- 1992. The data takes the form of I related multinomial time series, one for each grade i=1,...,I, with a hierarchy defined by different school types, gender, region etc. To implement inference and prediction for this data we propose a class of conditionally Gaussian dynamic models for the I multinomial time series y_it, i=1,...,I, t=1,...,N. As a prior probability model for the corresponding (transformed) multinomial parameters p_it we use a multivariate normal dynamic linear model.
The I series, p_it, i=1,..., I, are assumed to have similar shapes. This is modeled using a parameterization which chooses one series, say i=1, as the reference series and modeling all the other series in terms of p_1t plus a shift r_it relative to it: E(p_it | th_t) = E(p_1t | th_t) + r_it, for i=2, ... , I. Here th_t is the total parameter vector.
It may happen, due to changes in the school system, political interventions, changes on the labour market, or other outside events, that one particular series deviates from the general trend for some time intervals; the series thus becomes an ``outlier'' at those time intervals. We introduce a sequence of latent indicator variables s_it with s_it = 1 if series i is an outlier at time t and s_it = 0 else. Whenever the series becomes an outlier, i.e. s_it = 1, we increase the variance-covariance matrix in the evolution equation for r_it by scaling it with a parameter v_i. The indicator variables s_it are a priori a sequence of stationary Bernoulli random variables with Markovian evolution. The scale parameters v_i, as well as the parameters of the Bernoulli distributions for the s_it series are assumed to be stochastic, with prior distribution opportunely specified in the model.
We implement inference in the proposed model by Markov chain Monte Carlo simulation. The scheme is similar to the one used in current research by Cargnoni, Mueller and West, except for the additional latent variables, s_it, v_i and the Bernoulli parameters, which require additional layers to the Gibbs sampler used there. A description of the modified scheme after the inclusion of the new parameters is given in detail.
(n = 1,N) where g_n is random noise. The data we receive, for each observation point, is the set {t_n} with the labels n removed. Hence at each observation point, we know that each function produces a data point but we do not know which data point corresponds to which function.
Using approximate permutation matrices combined with a mixture model, we infer the functions {y_n} between the observation points {x_k}. We then apply this framework to the problem of band structure determination. This produces dramatically improved densities of states using limited data.
In this talk, an alternative comprehensive framework for mixture problems will be proposed, which allows fixed or variable numbers of components, and also embraces related clustering problems. MCMC samplers will be described for the models, drawing largely on recent work on a new class of reversible Markov chain samplers that can jump between parameter subspaces of differing dimensionality. The methods are flexible and entirely constructive, and are promising generally for computation in Bayesian model determination problems.
Both models and samplers will be compared with existing statistical and computational methodology, and the methods applied to various illustrative examples, including standard datasets and data arising from enzymatic activity, in which the possible components of the mixture have substantive interpretation in terms of genetic polymorphism.
The same problem arises in investigating statistical evidence of boy preference in Asian countries, such as Viet Nam. By boy preference, it is meant here that parents prefer a boy to a girl, and will tend to try to have children until the desired boy is born. Such behavior is known to be prevalent in China, among other countries. There are a number of ways of testing for boy preference; some statistical evidence that this behavior is indeed prevalent in Viet Nam has been shown in Haughton and Haughton (1994). One of the approaches is to build a logistic model for "parity progression" in completed families: for example, one would model the probability that a woman with two children will give birth to an additional one (parity three). If the existence of a son in the family has a negative (statistically significant) effect on the probability of an additional child (when other variables, such as education, income, and so on, are controlled for), this constitutes evidence for boy preference. But, as in the Murdoch and Stern work, perhaps some families do exhibit some boy preference behavior, and others don't. So a single logistic regression model might not be suitable for all women with completed families.
We present here an analysis of data on parity progression and contraceptive use from the Viet Nam Living Standard Survey involving a mixture of two logistic regressions. The question of whether a single logistic regression model is preferable to a model with two logistic regressions will be discussed as well.
The adaptive mixture density estimator (AMDE) constructs a density by placing kernels at all of the observed data. Unlike a fixed-width mixture density estimator (FMDE) that uses kernels of fixed width, an AMDE allows the widths of kernels to vary from one point to another. Although the AMDE slightly improves the estimation capability of an FMDE, it does not reduce the high cost incurred in computation and memory storage commonly required in an FMDE. To overcome the problem of high cost in computation and memory storage, a robust radial basis function (RBF) based mixture density estimator can be used. In the construction of an RBF density estimator, sequential and batch clustering algorithms are commonly used in determining the parameters associated with the deployed mixtures (e.g., the mean vectors and covariance matrices of Gaussian mixtures). These clustering algorithms perform poorly in the presence of probabilistic outlying data or data of large variations of dynamic range among dimensions, the latter imposing high sensitivity to the selection of distance measures in the clustering. To overcome these difficulties involved in constructing an RBF density estimator, statistical data sphering technique combined with a centroid splitting generalized Lloyd clustering technique (also known as the LBG algorithm) can be used in the robust RBF density estimator construction.
Although the robust RBF density estimator construction technique can overcome some of the difficulties encountered in using conventional RBF density estimators, it still can not overcome the drawback of the estimators' performance being too sensitive to the settings of some control parameters, e.g., the number of kernel mixtures used, the locations of kernels, the orientation of kernels, the kernel smoothing parameters, the excluding threshold radius for data sphering, the size of training data, etc. We are thus motivated to study the statistical projection pursuit density estimation technique. In contrast to the locally tuned mixture kernel methods, where data are analyzed directly in high dimensional space around the vicinity of the kernel centers, a projection pursuit method globally projects the data onto one- or two-dimensional subspaces, and analyzes the projected data in these low dimensional subspaces to construct the multivariate density. More specifically, the projection pursuit first defines some index of interest of a projected configuration (instead of using the variance adopted by the principal component analysis) and then uses a numerical optimization technique to find the projections of most interest. The projection index adopted for density estimation is the degree of departure of the projection data from normality.
Performance evaluations using training data from mixture Gaussian and mixture Cauchy densities are presented. The results show that the curse of dimensionality and the sensitivity of control parameters have a much more adverse impact on the mixture density estimators than on the projection pursuit density estimator.
We present new methods with the following features:
The methods use the assumption that our object has ``unmixed'' solid volumes of each specific material at the scale of our sampling rate. The mixtures arise from the sampling process and are most evident near boundaries between materials. The schemes differ in how the mixing is parameterized.
The tissue classification techniques treat a voxel as a small region of space that can have different values at different positions. Each voxel may contain several materials. Based on the sampling theorem, we reconstruct a continuous function, F, from the sampled data and calculate a local histogram, H, of F over each voxel. For each voxel we fit parameterized mixture basis functions to the local histogram and use Bayesian techniques to find a set of parameters that best matches the histogram. The parameters describe the mixture of materials within a voxel given the assumptions of the model.
One method is faster and parameterizes each histogram as a linear combination of normal distributions representing pure materials and mixture basis functions representing the regions near boundaries between pure materials.
A second method is geometrically more accurate and parameterizes the histogram model by the distance from a voxel to a material boundary. When the distance is large, the histogram is approximately a normal distribution for one of the constituent materials. As the distance approaches zero, the histogram model lies between the normal distributions for the constituent materials. The most-likely combination of materials is chosen from the set of possible pairs.
The results of the classification step are tailored to make extraction of surface boundaries between solid object parts more accurate. This framework allows users to more easily create geometric models with internal structure and with a high level of detail. We demonstrate this with a series of geometric models and images created from MRI data. Applications exist in a variety of fields including computer graphics modeling, biological modeling, anatomical and physiological studies, medical diagnosis, CAD/CAM, robotics, and computer animation.
Bayesian model monitoring has been applied to two mixture problems. At the U.S.\ census bureau, computerized record linkage is used in the evaluation of the decennial census. The observed data are a ten-dimensional contingency table which really are a combination of observations from three types of households. The mixture approach is extremely effective in separating these groups. Bayesian methods allow estimation of error rates and comparisons of alternative log linear models.
The second problem arises in a psychological study. Individuals, who were classified as inhibited or uninhibited when 21 months old, have been measured again at 13 years. The goal is to find variables among the new set that delineate the two original groups. The data are modeled as a mixture of multivariate normals with covariance structures suggested by the psychologists. Bayesian methods are important because sample sizes small and there is some missing data and because model monitoring ideas can be used to assess the groups found utilizing various models.
In this contribution, an alternative approach is presented; in place of the Monte-Carlo approximation for the E-step, a stochastic approximation procedure is used, leading to the Stochastic Approximation EM algorithm (SAEM). Given the current guess of the parameters, missing data are simulated from the conditional predictive distribution. These simulated missing data are then used to update the expected value of the complete data log-likelihood. functions. Moreover it is proved that, under mild additional conditions, the attractive stationary points correspond to the maxima of the incomplete-likelihood: convergence toward saddle-points are avoided with probability one. Simple illustrative examples of applications of the SAEM algorithm and of the associated convergence results are presented to support our findings.
To calculate the Bayes factor, the parametric model is assigned probability zero under the nonparametric model. However, when we restrict consideration to the finite sample of data that is observed, we find that the joint marginal distribution of the data under the parametric model is assigned positive probability under the MDP model. Since the Bayes factor comparing the two models is calculated based on the finite sample of data actually obtained, we can treat the models as nested models in which the smaller receives positive probability. The Bayes factor is given by a simple expression and can be computed with an evaluation of the posterior under the MDP model.
Due to the extraordinarily large number of components in the mixture induced by the Dirichlet process, the MDP models are fit with modern Monte Carlo Markov chain methods. The output of such a chain may be used in several ways to estimate the Bayes factor between the two models. In the case of a continuos base measure for the Dirichlet process, the simplest estimate would be based on a tabulation of how often the mixture has a number of components equal to the number of data values. Such an estimator provides quite poor performance because this event is assigned very small probability under both the prior and posterior. We are then faced with the classic problem of estimating a small probability, a difficult task even with a long run of the chain. More sophisticated estimates are based on the notion of Rao-Blackwellization. The trick to creating a good estimator is in creating an effective Rao-Blackwelliztion.
In some instances, the Markov chain used to fit the MDP model may be sped through integration over portions of the state space. When this is feasible, the problem of fitting the MDP model may take place on a finite state space Markov chain. In this case, the tabulation method would count the number of times that a particular, single state is visited. The idea of Rao-Blackwellization as used here is to partition the state space so that expectations can be evaluated over elements of the partition, rather than evaluating the function at a single point. The version of Rao-Blackwellization used here splits the single state of interest, attaching a portion of it to each of the other states in the state space. This technique is being investigated in other circumstances by MacEachern and Peruggia. Preliminary investigations show estimates based on this Rao-Blackwellization to be extremely accurate.
In progeny, such as back-cross or F2, the Mendelian theory of gene segregation gives the theoretical proportion of the mixture, but leads to a non-regular likelihood. The asymptotic distribution of the MLRT, under the null hypothesis, is obtained by Taylor expansion up to the classical second order. Contiguity arguments can be used to obtain the asymptotic power for well chosen local alternatives.
The rapid advancement of molecular technology, has made possible to use genetic markers to map genes on the chromosome. The Mendelian theory allows one to obtain for the conditionnal distribution of the data given the markers information, a mixture of distributions in which the parameters of interest are the location and the effect of the putative gene (called Quantitative Trait Loci, QTL).
We focus attention on the asymptotic distribution of MLRT for QTL having small effect, that is QTL detected with power ranging from 20% to 90%. For these local alternatives, approximate thresholds for dense or sparse map are obtained and an unbiased confidence interval for the QTL location is constructed using asymptotically similar statistics.
The HMM output probability (likelihood) P(S | Lambda) is defined to be a function of model parameters Lambda and speech inputs S, i.e., P(S | Lambda)= Psi(S, Lambda). Based on the functional dependencies of the HMM's likelihood to the model parameters { Lambda } and inputs { S }, the Baum-Welch inversion of HMM can be derived. More specifically, the Baum-Welch reestimation of an HMM finds the model parameters Lambda based on a fixed set of speech inputs { S }, while the inversion of an HMM finds speech inputs { S } that optimize some criterion with given model parameters { Lambda }. The Baum-Welch inversion is a dual procedure to the Baum-Welch reestimation algorithm. Both Baum-Welch inversion and reestimation try to maximize the model probability P(S | Lambda) while one estimates the input { S } and the other estimates the model parameters { Lambda }.
The proposed Baum-Welch HMM inversion is applied to robust speech recognition tasks by moving (reestimating) the input speech features toward the means of Gaussian mixtures of the hypothesized (or target) class with appropriate constraints. Among the many possible constraints, the {\em robustness bound constraint}, which analytically bounds the allowable movements of LPC cepstral coefficients, is adopted for "robust" Baum-Welch inversion classification. Although the robustness bound constraint somewhat prevents problems caused by the "affine phenomenon", which promotes the high similarity of geometrical shapes between the newly moved noisy speech features and the Gaussian mixture means of any target class, it can't prevent the affine phenomenon completely. More specifically, the original temporal and correlated structure of the noisy speech features can be destroyed during the movement guided by the Gaussian mixtures, which do not represent any specific meaningful speech utterance due to the strong temporal and ensemble averaging in the Baum-Welch reestimation training. To overcome these problems, scaled robust Baum-Welch HMM inversion, which prescales the cepstral coefficients to compensate the frequently encountered norm shrinkage effect, is proposed for minimal and critical use of Baum-Welch inversion.
Furthermore, it is our observation that the MINIMAX classification technique is good for quick adjusting the model to the global shape of testing speech while inversion classification is good for gradual adjusting the testing speech to finer details of the target model. The "batch" and "sequential" combination techniques are proposed to take advantage of both MINIMAX and inversion properties. In the batch combination technique, completion of MINIMAX optimization is followed by inversion optimization and a model which yields the highest likelihood is claimed as a winner. In the sequential combination technique, each iteration consists of one step of MINIMAX optimization and one step of inversion optimization. This latter one can be interpreted as an Expectation-Maximization (EM) procedure commonly used for solving maximum likelihood problems with "incomplete" data. The performance of the proposed HMM inversion, in conjunction with HMM reestimation, for robust speech recognition under additive noise corruption and microphone mismatch conditions is favorably compared with other noisy speech recognition techniques, such as the projection-based first-order cepstrum normalization (FOCN) and the MINIMAX classification technique.
Even though the model is parametric, parameters are chiefly used as a device: a predictive approach is taken and the analysis' conclusions are stated in terms of the predictive distribution of the next observable, unconditional on the number of components in the mixture. Instrumental to its computation is the derivation of the posterior distribution of the number of mixture components, of interest in itself in some applications.
The quantities of interest cannot be computed exactly, because they involve sums of $k^n$ terms, with $n$ the data sample size and $k$ the number of mixture components. They are estimated by means of Markov chain Monte Carlo methods. Specifically, the posterior distribution of the unobserved membership vector, specifying which of $k$ components generated each of the $n$ observations, is sampled employing a Gibbs sampling algorithm modified with a Metropolis step designed to swiftly move the sampling chain around the $k!$ modes of the posterior. The marginal density of the observed data, conditional on the number of components, is estimated by applying a novel technique to the simulation results. A subsequent application of Bayes theorem yields an estimate of the posterior distribution of the number of components. Special attention is paid to the variability of the Monte Carlo estimates.
Some examples illustrate the flexibility of the model.
Here we present the application of a density model of an image, based on a Partitioned Mixture Distribution (PMD) network, to the task of texture classification.
The PMD network consists of a number of partially overlapping mixture distributions. However, the main difference between PMD networks and mixture distributions, and what makes the PMD networks attractive in image processing, is that PMD networks scale well to large images.
The problem addressed here is the following: given a training set of N images from K texture classes, train a texture classifier to classify a test image into one of the K texture classes. This is a type of supervised learning problem, as both the training image (input) and the corresponding class membership (output) are available in the training set.
We have here adopted a predictive approach to texture classification. The network is trained using a re-estimation type algorithm, which maximises the log predictive probability of the training image belonging to the given texture class.
To illustrate the effectiveness of partitioned mixture distribution networks in modelling texture statistics, we test the performance of the PMD network on classification of Brodatz textures.
In addition, we demonstrate how the network-type structure of the PMD can be exploited to produce probability images, which can be used as a crude form of anomaly detector.
The normal regression mixture model analysis was performed using software which employs the EM algorithm for parameter estimation and the Gauss-Newton method for standard error estimation.
Key words: Mixture, dose-response, binary outcomes, extra binomial
For the location parameters, we choose a prior that is ``partially proper'' in the sense that the marginal distribution of each component mean is flat, but the conditional distribution, given the means of the other components, is proper. Consequently, no subjective information is required for the overall location of the component means. A similar technique is used to obtain conditionally proper priors for the scale parameters which are marginally equal to the usual reference prior.
To find the posterior density for the number of components in the mixture we use the Schwarz criterion. Although not formally justifiable by the methods of Kass and Wasserman, our simulations suggest that the approximation is quite good.
Simulations demonstrate that the method performs remarkably well and preliminary investigations indicate the resulting density estimates are consistent for a class of models much broader than the class of finite normal mixture models.
The models and techniques are then applied to two different problems: biodiversity assessment (investigating the density variation of trees in Duke forest by modeling their locations with gamma Poisson mixture models) and nonparametric multivariate density estimation (generalizing recent work of Wolpert and Lavine).
We propose a modified EM algorithm, in which Markov chain Monte Carlo (MCMC) techniques are employed to carry out the updating procedure. At step $ m$, MCMC simulation is performed to generate a sample $ \{ \mathbf{S}_{j}^{(m)} \}$ >from $ p(\mathbf{S} \mid \mathbf{d}, \mathbf{z}^{(m-1)})$. $ \mathbf{z}^{(m)}$ is then updated to be $ \mathbf{z}^{(m-1)}-\epsilon \cdot \sum_{j} \nabla_{\mathbf{z}} \log p(\mathbf{z}, \mathbf{d} \mid \mathbf{S}_{j}^{(m)}) $, where $\epsilon $ is a pre-determined constant. This algorithm is akin to the so-called general EM (GEM) algorithm in the sense that at each iteration the predictive density value is merely increased but not necessarily maximised. The asympototic properties of the EM algorithm can however be shown to remain valid for the GEM algorithm.
There has recently been considerable interest in embedding the principle of mixture models (experts) into the neural network learning paradigm. We present an example in which a mixture distribution network is trained using the proposed algorithm for the purpose of prediction. The results are compared with those obtained using existing methods.
E-mail: azeevi@techunix.technion.ac.il
This paper is not yet available via www or ftp, for further info. please contact: Assaf Zeevi, e-mail: azeevi@techunix.technion.ac.il.