abstracts Aug 31

Speaker: Jim Berger

Title: Interacting with Automotive Engineers in Predicting Fuel Efficiencies I

One of the interesting features of case studies in statistics is that they involve extensive contact with non-statisticians at all stages of the statistical analysis. Afterwards, however, it is common to report only a stylized version of the process, with many of the most interesting `interactions' suppressed.

Such a case study was ``Bayesian Estimation of Fuel Economy Potential Due to Technology Improvements,'' by Andrews, Berger and Smith, available from my web page (and in the first CMU Case Studies volume). In these two lectures I will try to walk through the case study, but from the perspective of emphasizing the interactions with nonstatisticians along the way. Issues of problem formulation, modeling, and formal prior elicitation will be discussed.

For more information, see paper.

Sep 2

Speaker: Jim Berger

Title: Interacting with Automotive Engineers in Predicting Fuel Efficiencies II

For more information, see paper.

Sep 7

Speaker: Miklos Csuros (Yale University)

Title: Recovering Evolutionary Trees through Harmonic Greedy Triplets

A main area of computational biology is the development and analysis of evolutionary tree reconstruction algorithms. These algorithms have not only enabled the exploration of evolutionary relationships among species but also led to the discovery of new proteins and improved inference of transmission of viral diseases such as AIDS.

Evolutionary tree reconstruction algorithms use a set of aligned sequences (examples include DNA sequences for corresponding genes in different species, sequences of certain regions of HIV in different patients etc.), to build a binary tree with one leaf for each sequence, which models the sequence evolution leading to the observed sequences through a series of mutations on a common ancestral sequence associated with the tree root. A widely used model of evolution assumes that the mutations occur independently in each sequence position, according to the same mutation mechanism. This mechanism consists of broadcasting a symbol from the root towards the leaves, with possible errors occurring on the edges. The errors are determined stochastically according to mutation matrices assigned to the edges.

For N sequences of length L, it is a natural desire of the computer scientist to have an algorithm that builds a tree in time polynomial in N and L. In addition, since L is not very large in practical cases [500-2000 chars], it also desirable that for any 0 In this talk I will present a general Markov model of evolution along with a summary of widely used algorithms, and a new algorithm, called Harmonic Greedy Triplets (HGT), developed in a joint work with Ming-Yang Kao. HGT builds a tree in O(N^3) time and recovers the topology with at least (1-w) probability from samples of length poly(N,1/w).

A recently developed version of the basic HGT algorithm runs in O(N^2) time with the same bounds on sample length requirements as HGT, and has achieved high success rates in simulated experiments. The experiments were conducted on biologically motivated trees of 135, 500, and 1895 leaves, with high mutation rates [7%-47%].

Sep 9

Speaker: Sir David Cox

Title: Statistical Science; Some Reflections on Past, Present, and Future

Keynote address at the Gertrude Cox Statistics Conference, RTI. Information at http://www.rti.org/gmcoxconf

Sep 14

Speaker: Mark Vangel (National Institute of Standards and Technology and Duke University)

Title: The Analysis of Interlaboratory Study Data

Arriving at a consensus by combining information is a common problem in applied statistics. We consider aspects of this problem relevant to collaborative (or `interlaboratory') studies, which are particularly common in analytical chemistry and engineering. Two situations will be discussed. For the simpler case, measurements are made on a single material by multiple laboratories. The laboratories may differ in the precision with which they make the measurements, as well as in the number of measurements made. A one-way random normal ANOVA model is assumed: unbalanced and with unequal within-group variances. For the second situation, each laboratory makes measurements on m materials, so a two-way normal mixed model is used, again with lack of balance and unequal within-laboratory variances. For both the one- and two-way cases, we are interested in consensus estimates of material means, estimates of between-laboratory variability, and uncertainties in these estimates. A review of some approaches in common use will be followed by new results for maximum-likelihood and Bayesian analyses, illustrated with examples.

Sep 16

Speaker: Cancelled, due to hurricane.

Sep 21

Speaker: Rainer Spang (Deutsches Krebsforschungszentrum Theoretische Bioinformatik)

Place: Bryan Research Building, Room 103

Time: 4:00 - 5:00 p.m

Title: Random Sequence Similarity

One of the most frequently used techniques in computer based molecular sequence analysis is the search for sequence similarity. Striking resemblance of amino acid sequences is assumed to have an evolutionary background and gives a clue for the determination of the function of genes and proteins. But due to the large number of comparisons performed, random similarities are frequently observed and are hard to distinguish from those arising from distant relationship. This talk gives an overview on the statistical properties of random similarities between pairs of amino acid sequences. Existing theory for pairwise sequence comparison is extended to the context of database searches, and simulation experiments are applied to overcome inherent shortcomings in purely analytical approaches.

Sep 23

Speaker: Michael Lavine

Title: Modelling Seedling Mortality

Recruitment of new seedlings to tree populations is probably the principle limitation on population growth rates, and it controls the diversity of forests (Watt **, Grubb **, Tilman and Pacala**, Clark et al. 1998, Hubbell et al. 1999). Rates of seedling establishment are highly variable in space and time (Beatty **, ). Once first year seedlings become established, individuals may reside in the seedling stage for decades (**).

Rates of seedling input and turnover are unknown, because field methods are inefficient, and data collection is laborious. With adequate coverage, the standard practice of marking individual seedlings in sample plots could provide direct estimates of time- and, sometimes, age-specific survivorship. But this level of detail goes beyond the information needed for ecological objectives. Understanding how recruitment limits population growth or community dynamics typically does not demand age-specific data. Beyond the first year, when mortality rates are especially high, vital rates depend mostly on size, and age can often be ignored. Rates of seedling input and turnover in the understory rarely require detailed year-by-year demographic data.

Labor-intensive data collection means that census information is limited. The vast majority of studies last a year or less, and sampling tends to be confined to a single stand. Because this duration is too short and data lack adequate spatial coverage, we now have a body of studies providing unnecessary detail but failing to provide confident estimates (Clark et al.\ 1999). There is need for rapid census methods that can allow for extensive data collection and estimation of turnover (survivorship).

Here we present a modeling approach that yields survival estimates based on a rapid census method. Our method distinguishes only between first year (New) vs.\ older (Old) seedlings, because survival rates are expected to be similar within these two classes, and because Old seedlings can usually be identified by presence of bud scale scars. We apply the method to a five year data set from the southern Appalachians (Clark et al. 1998). We conducted annual censuses of New and Old seedlings in 1m$^2$ quadrats arranged along transects. Individuals were not individually tagged; we simply count individuals of each species in the two classes. We use counts of red maple (Acer rubrum) to demonstrate our analysis.

For more information, see paper.

Sep 28

Speaker: Susie Bayarri (University of Valencia and Duke University)

Title: Examples Relating to Conditional Versus Unconditional Inference

One of the main differences between Bayesian and the common frequentist approach to statistics is the issue of conditioning. Bayesian inferences are conditional on the observed data alone; unobserved data usually have no role in the analysis. In contrast, usual frequentist inference also requires careful consideration of `what might have happened, but didn't.' The differences are reviewed through some standard examples, including situations of considerable practical importance such as sequential testing (for, e.g., clinical trials). Conditioning issues surrounding P-values are also reviewed, with it being illustrated that P-values, although ostensibly conditional, do not accurately reflect actual conditional error rates.

Sep 30

Speaker: Susie Bayarri (University of Valencia and Duke University)

Title: P-values for Bayesian Model Checking

A method for checking the compatibility of a model with the observed data, that does not require specification of an alternative model, is a most useful tool in exploratory stages of a statistical analysis. From a Bayesian perspective, p-values computed either with the prior or posterior predictive distributions are frequently used for this purpose. From a frequentist perspective, plug-in p-values are the usual choice. In this talk, two new proposals for p-values are introduced, and argued to be superior to either the common Bayesian or frequentist choices. The proposals allow use of noninformative prior distributions, avoid incoherencies in previous proposals, are typically computable by common MCMC methods, and have optimal asymptotic performance (from a frequentist perspective). Examples that will be discussed include that of Fisher's exact test.

For more information, see paper.

Oct 5

Speaker: Herbie Lee

Title: Introduction to Neural Networks

Neural networks can be viewed statistically as a method of nonparametric regression. I will briefly discuss some of the history of neural networks, and then show how they fit into modern statistics. Many people claim the parameters are interpretable, but I beg to differ, and I will show a simple example of noninterpretability. No introduction is complete without discussing backpropagation, so I will mention it. Last is a discussion of some proposed priors, including a noninformative one.

Oct 7

Speaker: Herbie Lee

Title: Issues in Bayesian Neural Network Modeling

The motivation for much of my research on neural networks has been the issue of model selection (and model averaging). How can we choose the best size of network and the best subset of explanatory variables? This question is often ignored in the neural network literature. I will present an algorithm for model selection that can be used for neural networks as well as many other problems.

An issue that arises in Bayesian model selection is the estimation of the marginal probability of the data, which is the normalizing constant for the posterior. This is an open area of research, and current methods tend to fail miserably when applied to neural networks.

Finally, I will present some asymptotic consistency results. Frequentist neural networks have been shown to be asymptotically consistent for all nice true regression functions. I extend these results to the Bayesian context, showing asymptotic consistency of the posterior.

Oct 14

Speaker: Simon Godsill (Cambridge University)

Title: Sequential Monte Carlo Inference for Large and Evolving Datasets: the Particle Filter

When performing Bayesian inference for complex statistical models, it is standard practice to apply Monte Carlo methods such as importance sampling or Markov chain Monte Carlo (MCMC). However, in cases where very large datasets are involved, especially when the data arrive point by point (sequentially), it will be impractical to apply such batch-based methods, both computationally and in terms of the computer memory requirements. Instead it will be necessary to devise methods in which a representation of the posterior distribution is allowed to evolve in time as successive data points arrive. I will describe the formulation of state space dynamical models for evolving datasets and review the standard methods for dealing with the problem, such as the classic Kalman filter. I will then go on to describe state of the art methods based on particle filters. In these, a `cloud' of random particles is constructed to approximate a posterior distribution. When a new data point arrives, the cloud is updated using importance sampling methods. I will show how these methods can be refined by adding in elements of Markov chain Monte Carlo, and will outline some applications to time-varying autoregressions and speech data.

Oct 19

Speaker: Neil Shephard (University of Oxford)

Place: Social Sciences Building, Room 111

Title: Analysis of High Dimensional Multivariate Stochastic Volatility Models

This paper is concerned with the fitting and comparison of high dimensional multivariate time series models with time varying correlations. The models considered here combine features of the classical factor model with those of the univariate stochastic volatility model. Specifically, a set of unobserved time-dependent factors, along with an associated loading matrix, are used to model the contemporaneous correlation while, conditioned on the factors, the noise in each factor and each series is assumed to follow independent three-parameter univariate stochastic volatility processes. A complete analysis of these models, and its special cases, is developed that encompasses estimation, filtering and model choice. The centerpieces of our estimation algorithm (which relies on MCMC methods) is (1) a reduced blocking scheme for sampling the free elements of the loading matrix and the factors and (2) a special method for sampling the parameters of the univariate SV process. The sampling of the loading matrix (containing typically many hundreds of parameters) is done via a highly tuned Metropolis-Hastings step. The resulting algorithm is completely scaleable in terms of series and factors and very simulation-efficient. We also provide methods for estimating the log-likelihood function and the filtered values of the time-varying volatilities and correlations. We pay special attention to the problem of comparing one version of the model with another and for determining the number of factors. For this purpose we use MCMC methods to find the marginal likelihood and associated Bayes factors of each fitted model. In sum, these procedures lead to the first unified and practical likelihood based analysis of truly high dimensional models of stochastic volatility. We apply our methods in detail to two datasets. The first is the return vector on 20 exchange rates against the US Dollar. The second is the return vector on 40 common stocks quoted on the New York Stock Exchange.

Oct 21

Speaker: Christina Geyer

Title: Detecting Manipulation in Data Sets Using Benford's Law

Benford's Law is a numerical phenomenon in which the first significant digits of sets of data that are counting or measuring some fact follow a certain multinomial distribution. We will begin by giving a history of Benford's Law and defining which data sets are expected to follow Benford's Law. Next, we will look at two statistical detection methods currently employing Benford's Law, one a classical test of the mean of the first significant digits, the other a Bayesian test of the distribution of first significant digits. We will then discuss a Bayesian test of the mean of the first significant digits and a classical test of the distribution of first significant digits. Finally, we will compare these methods using simulated data.

Oct 26

Speaker: Shanti S. Gupta (Purdue University)

Title: On Empirical Bayes Selection Procedures

The talk will deal with selection of good exponential populations compared with a control. After deriving the Bayes rules, an empirical Bayes rule is constructed. Its asymptotic behavior is studied. It is shown that the empricial Bayes rule has convergence rate of order n^{-r/2} for some r between 0 and 2. Time permitting, some recent work on the empirical Bayes selection of Poisson populations will be presented.

Oct 28

Speaker: Ashish Sanil (National Institute of Statistical Sciences)

Title: Nonparametric Regression Based on L-Splines

Cubic spline smoothing is a popular nonparametric regression technique. The method involves fitting a function to data while penalizing the size of its second derivative. This ensures that the fitted function will have few sharp "wiggles", and therefore appear smooth. L-Spline methods are extensions of the cubic smoothing spline smoothing which allows one to penalize the size of any linear differential operator L, and thereby incorporate prior notions of the behavior of the underlying function into the fitting procedure. This talk is primarily an introduction to L-Spline methodology. I will also talk about the close connection between L-Splines and some other techniques such as Kriging and Gaussian process smoothing. Finally, I will discuss some research issues concerning model-selection, computational algorithms, etc.

Nov 2

Speaker: Viridiana Lourdes

Title: Bayesian Analysis of Longitudinal Data in a Case Study in the VA Hospital System

As part of a long-term concern with measures of "quality-of-care" in the VA hospital system, the VA Management Sciences Group is involved in large-scale data collection and analysis of patient-specific data in many care areas. Among many variable of interest are observed times of "return to follow-up care" of individuals discharged following initial treatment. Follow-up protocols supposedly encourage regular and timely returns, and observed patterns of variability in return time distributions are of interest in connection with questions about influences on return times that are specific to individual patients, care areas, hospitals, and system-wide policy changes. Our study has explored such issues in the area of psychiatric and substance abuse patients across the nationwide system of VA hospitals.

The study is ongoing, and in this talk I will touch on selected aspects of the modelling and data analysis investigations we've developed to date. In particular, I will discuss our studies of discrete duration models that are designed to help us understand and estimate the effects on return times that are specific to individual hospitals -- the primary question for the VA -- in the context of a large collection of potential additional covariates. We adopt logistic regression models to describe discretised representations of the underlying continuous return time distributions. These models take into account categorical covariates related to the several socio-demographic characteristics and aspects of medical history of individual patients, and treat the primary covariate of interest -- the VA hospital factor -- using a random effects/hierarchical prior. Our models are analysed in parallel across a range of chosen "return time cut-offs", providing a nice analogue method of exploring and understanding how posterior distributions for covariate effects and hyperparameters vary with chosen cut-off. This perspective allows us to identify important aspects of the non-proportional odds structure exhibited by this very large and rich data set, by isolating important and interesting interactions between cut-offs and specific covariates. Summarisation of the sets of high-dimensional posterior distributions arising in such an analysis is challenging, and is most effectively done through sets of linked graphs of posterior intervals for covariate effects and other derived parameters.

We explore and exemplify this work with a full year's data, from 1997, and then discuss extended models for multiple year that include time series structure to address dependencies across years. Issues of covariate selection via Bayesian model selection methods, and other practical questions, arise, and will be mentioned as time permits.

This is joint work with Mike West and Jim Burges.

Nov 4

Speaker: Marco Ferreira

Title: Model Selection and the Minimum Description Length Principle

The Minimum Description Length (MDL) principle for model selection will be described. In the MDL framework, the complexity of each description of the observed data is used to decide between models. The MDL principle will be illustrated by its application to selection among linear regression models.

Nov 9

Speaker: Jim Berger

Title: The Controversy Over P-values

Probably the most commonly used statistical tool is the P-value or observed significance level. Statisticians have long struggled against the misuse and misinterpretation of P-values, but a genie can be hard to put back in the bottle. The debate about P-values has now even reached the popular press, with articles appearing that seriously question the value of research findings that are based primarily on P- values.

There is nothing inherently wrong with a P-value, in that it is a statistical measure of something and virtually all statisticians feel that it has at least some valid uses. The problems arise when is it used in the wrong way or interpreted to mean something that it is not. The wrong ways in which P-values are used will be briefly reviewed, but this is a common theme in statistics and science and so will be discussed only briefly.

The less well understood problem with P-values is that their common misinterpretation as some type of error probability (or, even worse, as the probability of an hypothesis) can lead to very erroneous conclusions. As an example, in testing a precise hypothesis, such as the null hypothesis that an experimental drug is negligibly different in effect from a placebo, a P-value of 0.05 is commonly thought to indicate significant evidence against the hypothesis whereas, in reality, it implies that the evidence is essentially balanced between the hypotheses. This will be demonstrated in a variety of ways and the effect illustrated in several examples. Finally, a simple calibration of P-values will be proposed that can, at least, prevent the worst misinterpretations.

Nov 11

Speaker: Peter Müller

Title: An Adaptive Bayesian Dose-Finding Design

We propose a fully decision theoretic design for a dose-finding clinical trial. The proposed solution is based on a probability model for the unknown dose/response curve and a utility function which formalizes the relative preferences over alternative outcomes. The chosen probability model for the dose/response curve is a normal dynamic linear model (NDLM) which allows computationally efficient analytic posterior inference. The proposes utility function formalizes learning about the dose/response curve as minimizing posterior variance in some key parameters of the dose response curve, for example, the unknown response at the unknown ED95 dose. To find the optimal dose to be assigned to the respective next patient we maximize the utility function in expectation, marginalizing over all random variables which are unknown at the time of decision making, including unknown model parameters and still un-observed future responses. The proposed approach is myoptic in the sense that each day we compute the optimal doses as if the patients arriving on that day were the last ones recruited into the trial.

Nov 16

Speaker: Rodney Sparapani

Title: Cardiac Enzymes as a Clinical Outcome

Creatinine-kinase MB (CK-MB) measurements are commonly used in clinical practice to assess myocardial infarction (MI). An MI diagnosis is usually based solely on CK-MB measurements, but may make use of other cardiac enzymes such as Total Creatinine-kinase, Troponin I, Troponin T or electro-cardiograms. Furthermore, CK-MB measurements are proportional to the severity of an MI and the probability of mortality. The efficacy endpoint in many cardiovascular trials is the composite of death and MI in either an event rate comparison or a survival analysis. A problem with composite endpoints is that they ignore the severity of the MI and may subsequently require larger sample sizes due to a loss of information. For example, what if a compound doesn't prevent MI, but instead decreases its severity? In that case, even a larger sample size might miss the treatment effect entirely. A logical candidate for a quantitative, and potentially more powerful, endpoint would be CK-MB measurements. We'll look at how CK-MB measurements might be used as a clinical outcome of a clinical trial. It is assumed that this approach could be extended to other cardiac enzymes as well.

Nov 18

Speaker: Arnaud Doucet (Cambridge University)

Title: Sequential Monte Carlo Methods and Applications to Signal Processing

Sequential Bayesian estimation is important in many applications involving real-time signal processing, where data arrival is inherently sequential. Classical suboptimal methods for sequential Bayesian estimation include the Extended Kalman filter and the Gaussian sum approximation. Although these methods are easy to implement, they also perform poorly for many nonlinear and/or non Gaussian models. Sequential Monte Carlo (SMC) methods -also named particle filters- are a set of simulation-based methods which allow to address these complex problems. I will present a generic algorithm and discuss several improvement strategies. Convergence results will be also briefly reviewed. Finally I will describe the applications of SMC to a few signal processing applications: digital communications, target tracking in clutter noise and speech enhancement.

Nov 30

Speaker: John Kern

Title: Bayesian Spatial Covariance Models

The impact of covarince function choice on Gaussian process models for spatially continuous data will be examined through a comparison of traditional selection methods with non-traditional, non-parmetric selection methods. These non-traditional methods represent the leading edge of spatial covariance analyses (at least for one person), and make use of process convolution models, which, along with Gaussian process models, will be presented in enough detail for even the most novice spatial data analyst to feel comfortable.

Dec 2

Speaker: Hedibert Lopes

Title: Factor Models, Stochastic Volatility, and Meta-analysis

This talk will be based on part of the research I have been doing for the past two years here at Duke. The material presented will eventually be part of my PhD thesis, so below I try to summarize the overall idea of each one of the three main project I am currently involved. Since part 1 (Model uncertainty in factor analysis) has already been extensively presented, I will spend most of the talk on parts 2 and 3. However, part 1 will be briefly touched when talking about model comparison in situations where prior information for the factor model parameters in lacking, in which case we will refer to recent developments by Perez and Berger.

[Part 1] MODEL UNCERTAINTY IN FACTOR ANALYSIS (Duke Statistics DP #38-98) (with Mike West) Bayesian inference in factor analytic models has received renewed attention in recent years, partly due to computational advances but also partly to applied focuses generating factor structures as exemplified by recent work in financial time series modeling. The focus of our current work is on exploring questions of uncertainty about the number of latent factors in a multivariate factor model, combined with methodological and computational issues of model specification and model fitting. We explore reversible jump MCMC methods that build on sets of parallel Gibbs sampling-based analyses to generate suitable empirical proposal distributions and that address the challenging problem of finding efficient proposals in high-dimensional models. Alternative MCMC methods based on bridge sampling are discussed, and these fully Bayesian MCMC approaches are compared with a collection of popular model selection methods in empirical studies. Various additional computational issues are discussed, and the methods explored in studies of some simulated data sets and an econometric time series example. More recently, we have also explored Perez and Berger's expected posterior priors in a couple of factor analysis examples.

[Part 2] FACTOR MODEL AND MULTIVARIATE STOCHASTIC VOLATILITY: TIME-EVOLVING LOADINGS AND SIMULATED-BASED SEQUENTIAL ANALYSIS (with Mike West and Omar Aguilar) In this project we extend previous work from Aguilar and West (1999) (Duke Statistics DP #98-03) on factor models with multivariate stochastic volatilities. Our main contribution consists on allowing the factor loadings to evolve with time in certain way that virtually retains the interpretation of the factor scores throughout time. In other words, the weights that some factors might have on a particular subset of time series is allowed to change with time mimicking real financial/economic scenarios. An example occurs when a country (or countries) enter/leave a particular market, where such market might have been well represented up to that point in time by a common factor. From a more technically viewpoint, we also improved on previous work by allowing the model parameters and states to be estimated based on the information up to that time, that is, sequentially. We implemented auxiliary particle filters ideas to deal with the states in our, while the fixed parameters were sampled sequentially according to a Liu and West's recently developments (Duke Statistics DP #99-14). We will present some preliminary results based on Aguilar and West's application on international exchange rate returns.

[Part 3] META-ANALYSIS FOR LONGITUDINAL DATA MODELS USING MULTIVARIATE MIXTURE PRIORS (with Peter Mueller and Gary Rosner) We propose a class of longitudinal data models with random effects which generalize currently used models in two important aspects. First, the random effects model is a flexible mixture of multivariate normals, accomodating population heterogeneity, outliers and non-linearity in regression on subject-specific covariates. Second, the model includes a hierarchical extension to allow for meta-analysis over related studies.

The random effects distributions are split into one part which is common across all related studies (common measure), and one part which is specific to each study and captures the variability intrinsic within patients from the same study. Both, the common measure and the study-specific measures, are parametrized as mixture of normal models. To allowed a random number of terms in the mixtures we introduce a reversible jump algorithm for posterior simulation. The proposed methodology is broad enough to embrace current hierarchical models and it allows for "borrowing-strength" between studies in a new and simple way.

The motivating application is the analysis of two studies carried out by the Cancer and Leukemia Group B (CALGB). In both studies, we record for each patient's white blood cell count (WBC) over time to characterize the toxic effects of treatment. The WBC counts are modeled through a nonlinear hierarchical model that gathers the information from both studies.

Jim Berger
August, 1999