Speaker: Jim Berger
Title: Interacting with Automotive Engineers in Predicting Fuel Efficiencies II
One of the interesting features of case studies in statistics is that they involve extensive contact with non-statisticians at all stages of the statistical analysis. Afterwards, however, it is common to report only a stylized version of the process, with many of the most interesting `interactions' suppressed.
Such a case study was ``Bayesian Estimation of Fuel Economy Potential Due to Technology Improvements,'' by Andrews, Berger and Smith, available from my web page (and in the first CMU Case Studies volume). In these two lectures I will try to walk through the case study, but from the perspective of emphasizing the interactions with nonstatisticians along the way. Issues of problem formulation, modeling, and formal prior elicitation will be discussed.
For more information, see
Speaker: Miklos Csuros (Yale University)
Title: Recovering Evolutionary Trees through Harmonic Greedy Triplets
A main area of computational biology is the development and analysis of evolutionary tree reconstruction algorithms. These algorithms have not only enabled the exploration of evolutionary relationships among species but also led to the discovery of new proteins and improved inference of transmission of viral diseases such as AIDS.
Evolutionary tree reconstruction algorithms use a set of aligned sequences (examples include DNA sequences for corresponding genes in different species, sequences of certain regions of HIV in different patients etc.), to build a binary tree with one leaf for each sequence, which models the sequence evolution leading to the observed sequences through a series of mutations on a common ancestral sequence associated with the tree root. A widely used model of evolution assumes that the mutations occur independently in each sequence position, according to the same mutation mechanism. This mechanism consists of broadcasting a symbol from the root towards the leaves, with possible errors occurring on the edges. The errors are determined stochastically according to mutation matrices assigned to the edges.
For N sequences of length L, it is a natural desire of
the computer scientist to have an algorithm that builds a tree in
time polynomial in N and L. In addition, since L is not very large in
practical cases [500-2000 chars], it also desirable that for any
0
A recently developed version of the basic HGT algorithm runs in
O(N^2) time with the same bounds on sample length requirements as HGT,
and has achieved high success rates in simulated experiments.
The experiments were conducted on biologically motivated trees of
135, 500, and 1895 leaves, with high mutation rates [7%-47%].
Speaker: Sir David Cox
Title:
Statistical Science; Some Reflections on Past, Present, and Future
Keynote address at the Gertrude Cox Statistics Conference, RTI. Information
at http://www.rti.org/gmcoxconf
Speaker: Mark Vangel (National Institute of Standards and Technology and Duke University)
Title:
The Analysis of Interlaboratory Study Data
Arriving at a consensus by combining information is
a common problem in applied statistics. We consider aspects
of this problem relevant to collaborative (or
`interlaboratory') studies, which are particularly common in
analytical chemistry and engineering. Two situations will
be discussed. For the simpler case, measurements are made on
a single material by multiple laboratories. The
laboratories may differ in the precision with which they
make the measurements, as well as in the number of
measurements made. A one-way random normal ANOVA model is
assumed: unbalanced and with unequal within-group variances.
For the second situation, each laboratory makes measurements
on m materials, so a two-way normal mixed model is used,
again with lack of balance and unequal within-laboratory
variances. For both the one- and two-way cases, we are
interested in consensus estimates of material means,
estimates of between-laboratory variability, and
uncertainties in these estimates. A review of some
approaches in common use will be followed by new results
for maximum-likelihood and Bayesian analyses, illustrated
with examples.
Speaker: Cancelled, due to hurricane.
Speaker: Rainer Spang (Deutsches Krebsforschungszentrum
Theoretische Bioinformatik)
Place: Bryan Research Building, Room 103
Time: 4:00 - 5:00 p.m
Title:
Random Sequence Similarity
One of the most frequently used techniques in computer based molecular sequence
analysis is the search for sequence similarity. Striking
resemblance of amino acid sequences is assumed to have an evolutionary
background
and gives a clue for the determination of the function of
genes and proteins. But due to the large number of comparisons performed, random
similarities are frequently observed and are hard to
distinguish from those arising from distant relationship. This talk gives an
overview on the statistical properties of random similarities between pairs
of amino acid sequences. Existing theory for pairwise sequence comparison is
extended to the context of database searches, and simulation
experiments are applied to overcome inherent shortcomings in purely analytical
approaches.
Speaker: Michael Lavine
Title:
Modelling Seedling Mortality
Recruitment of new seedlings to tree populations is probably the
principle limitation on population growth rates, and it controls the
diversity of forests (Watt **, Grubb **, Tilman and Pacala**, Clark et
al. 1998, Hubbell et al. 1999). Rates of seedling establishment are
highly variable in space and time (Beatty **, ). Once first year
seedlings become established, individuals may reside in the seedling
stage for decades (**).
Rates of seedling input and turnover are unknown, because field
methods are inefficient, and data collection is laborious. With
adequate coverage, the standard practice of marking individual
seedlings in sample plots could provide direct estimates of time- and,
sometimes, age-specific survivorship. But this level of detail goes
beyond the information needed for ecological objectives.
Understanding how recruitment limits population growth or community
dynamics typically does not demand age-specific data. Beyond the
first year, when mortality rates are especially high, vital rates
depend mostly on size, and age can often be ignored. Rates of
seedling input and turnover in the understory rarely require detailed
year-by-year demographic data.
Labor-intensive data collection means that census information is
limited. The vast majority of studies last a year or less, and
sampling tends to be confined to a single stand. Because this
duration is too short and data lack adequate spatial coverage, we now
have a body of studies providing unnecessary detail but failing to
provide confident estimates (Clark et al.\ 1999). There is need for
rapid census methods that can allow for extensive data collection and
estimation of turnover (survivorship).
Here we present a modeling approach that yields survival estimates
based on a rapid census method. Our method distinguishes only between
first year (New) vs.\ older (Old) seedlings, because survival rates are
expected to be similar within these two classes, and because Old
seedlings can usually be identified by presence of bud scale scars.
We apply the method to a five year data set from the southern
Appalachians (Clark et al. 1998). We conducted annual censuses of New
and Old seedlings in 1m$^2$ quadrats arranged along transects.
Individuals were not individually tagged; we simply count individuals
of each species in the two classes. We use counts of red maple (Acer
rubrum) to demonstrate our analysis.
For more information, see
Speaker: Susie Bayarri (University of Valencia and Duke University)
Title:
Examples Relating to Conditional Versus Unconditional Inference
One of the main differences between Bayesian and
the common frequentist approach to statistics is the
issue of conditioning. Bayesian inferences are conditional on the
observed data alone; unobserved data usually have no role in
the analysis. In contrast,
usual frequentist inference also requires careful consideration of
`what might have happened, but didn't.' The differences are reviewed
through some standard examples, including situations of considerable
practical importance such as sequential testing (for, e.g., clinical
trials). Conditioning issues surrounding P-values are also reviewed,
with it being illustrated that P-values, although ostensibly
conditional, do not accurately reflect actual conditional error rates.
Speaker: Susie Bayarri (University of Valencia and Duke University)
Title:
P-values for Bayesian Model Checking
A method for checking the compatibility of a model with the observed
data, that does not require specification of an alternative model, is a most
useful tool in exploratory stages of a statistical analysis.
From a Bayesian perspective, p-values computed either with the prior or
posterior predictive distributions are frequently used for this purpose.
From a frequentist
perspective, plug-in p-values are the usual choice. In this talk, two
new proposals for p-values are introduced, and argued to be superior to
either the common Bayesian or frequentist choices. The proposals allow
use of noninformative prior distributions, avoid incoherencies in
previous proposals, are typically computable by common MCMC methods,
and have optimal asymptotic performance (from a frequentist perspective).
Examples that will be discussed include that of Fisher's exact test.
For more information, see
Speaker: Herbie Lee
Title:
Introduction to Neural Networks
Neural networks can be viewed statistically as a method of
nonparametric regression. I will briefly discuss some of the history
of neural networks, and then show how they fit into modern statistics.
Many people claim the parameters are interpretable, but I beg to
differ, and I will show a simple example of noninterpretability. No
introduction is complete without discussing backpropagation, so I will
mention it. Last is a discussion of some proposed priors, including a
noninformative one.
Speaker: Herbie Lee
Title:
Issues in Bayesian Neural Network Modeling
The motivation for much of my research on neural networks has been the
issue of model selection (and model averaging). How can we choose the
best size of network and the best subset of explanatory variables?
This question is often ignored in the neural network literature. I
will present an algorithm for model selection that can be used for
neural networks as well as many other problems.
An issue that arises in Bayesian model selection is the estimation of
the marginal probability of the data, which is the normalizing
constant for the posterior. This is an open area of research, and
current methods tend to fail miserably when applied to neural
networks.
Finally, I will present some asymptotic consistency results.
Frequentist neural networks have been shown to be asymptotically
consistent for all nice true regression functions. I extend these
results to the Bayesian context, showing asymptotic consistency of the
posterior.
Speaker: Simon Godsill (Cambridge University)
Title:
Sequential Monte Carlo Inference for Large and Evolving Datasets: the
Particle Filter
When performing Bayesian inference for complex statistical models, it is
standard practice to apply Monte Carlo methods such as importance
sampling or Markov chain Monte Carlo (MCMC). However, in cases where
very large datasets are involved, especially when the data arrive point
by point (sequentially), it will be impractical to apply such
batch-based methods, both computationally and in terms of the computer
memory requirements. Instead it will be necessary to devise methods in
which a representation of the posterior distribution is allowed to
evolve in time as successive data points arrive. I will describe the
formulation of state space dynamical models for evolving datasets and
review the standard methods for dealing with the problem, such as the
classic Kalman filter. I will then go on to describe state of the art
methods based on particle filters. In these, a `cloud' of random
particles is constructed to approximate a posterior distribution. When a
new data point arrives, the cloud is updated using importance sampling
methods. I will show how these methods can be refined by adding in
elements of Markov chain Monte Carlo, and will outline some
applications to time-varying autoregressions and speech data.
Speaker: Neil Shephard (University of Oxford)
Place: Social Sciences Building, Room 111
Title:
Analysis of High Dimensional Multivariate Stochastic Volatility Models
This paper is concerned with the fitting and comparison of high dimensional
multivariate time series models with time varying correlations. The
models considered here combine features of the classical factor model with those of the univariate stochastic volatility model. Specifically, a set of
unobserved time-dependent factors, along with an associated loading matrix, are used to model the contemporaneous correlation while,
conditioned on the factors, the noise in each factor and each series is assumed to follow independent three-parameter univariate stochastic
volatility processes. A complete analysis of these models, and its special
cases, is developed that encompasses estimation, filtering and model
choice. The centerpieces of our estimation algorithm (which relies on MCMC
methods) is (1) a reduced blocking scheme for sampling the free
elements of the loading matrix and the factors and (2) a special method for
sampling the parameters of the univariate SV process. The sampling of
the loading matrix (containing typically many hundreds of parameters) is done
via a highly tuned Metropolis-Hastings step. The resulting algorithm
is completely scaleable in terms of series and factors and very
simulation-efficient. We also provide methods for estimating the log-likelihood
function and the filtered values of the time-varying volatilities and
correlations. We pay special attention to the problem of comparing one version
of the model with another and for determining the number of factors. For this
purpose we use MCMC methods to find the marginal likelihood
and associated Bayes factors of each fitted model. In sum, these procedures lead to the first unified and practical likelihood based analysis of
truly high dimensional models of stochastic volatility. We apply our methods in detail to two datasets. The first is the return vector on 20 exchange
rates against the US Dollar. The second is the return vector on 40 common stocks quoted on the New York Stock Exchange.
Speaker: Christina Geyer
Title:
Detecting Manipulation in Data Sets Using Benford's Law
Benford's Law is a numerical phenomenon in which the first significant
digits of sets of data that are counting or measuring some fact follow a
certain multinomial distribution. We will begin by giving a history of
Benford's Law and defining which data sets are expected to follow
Benford's Law. Next, we will look at two statistical detection methods
currently employing Benford's Law, one a classical test of the mean of
the first significant digits, the other a Bayesian test of the
distribution of first significant digits. We will then discuss a Bayesian
test of the mean of the first significant digits and a classical test of
the distribution of first significant digits. Finally, we will compare
these methods using simulated data.
Speaker: Shanti S. Gupta (Purdue University)
Title:
On Empirical Bayes Selection Procedures
The talk will deal with selection of good exponential populations compared
with a control. After deriving the Bayes rules, an empirical Bayes rule is
constructed. Its asymptotic behavior is studied. It is shown that the
empricial Bayes rule has convergence rate of order n^{-r/2} for some
r between 0 and 2. Time permitting, some recent work on the
empirical Bayes selection
of Poisson populations will be presented.
Speaker: Ashish Sanil (National Institute of Statistical Sciences)
Title:
Nonparametric Regression Based on L-Splines
Cubic spline smoothing is a popular nonparametric regression
technique. The method involves fitting a function to data while
penalizing the size of its second derivative. This ensures
that the fitted function will have few sharp "wiggles", and therefore
appear smooth. L-Spline methods are extensions of the cubic smoothing
spline smoothing which allows one to penalize the size of any linear
differential operator L, and thereby incorporate prior notions of the
behavior of the underlying function into the fitting procedure. This
talk is primarily an introduction to L-Spline methodology. I will also
talk about the close connection between L-Splines and some other
techniques such as Kriging and Gaussian process smoothing. Finally, I
will discuss some research issues concerning model-selection,
computational algorithms, etc.
Speaker: Viridiana Lourdes
Title:
Bayesian Analysis of Longitudinal Data in a Case Study in the VA Hospital
System
As part of a long-term concern with measures of "quality-of-care" in
the VA hospital system, the VA Management Sciences Group is involved
in large-scale data collection and analysis of patient-specific data
in many care areas. Among many variable of interest are observed
times of "return to follow-up care" of individuals discharged
following initial treatment. Follow-up protocols supposedly encourage
regular and timely returns, and observed patterns of variability in
return time distributions are of interest in connection with questions
about influences on return times that are specific to individual
patients, care areas, hospitals, and system-wide policy changes.
Our study has explored such issues in the area of psychiatric and
substance abuse patients across the nationwide system of VA
hospitals.
The study is ongoing, and in this talk I will touch on selected
aspects of the modelling and data analysis investigations we've
developed to date. In particular, I will discuss our studies of
discrete duration models that are designed to help us understand and
estimate the effects on return times that are specific to individual
hospitals -- the primary question for the VA -- in the context of a
large collection of potential additional covariates. We adopt
logistic regression models to describe discretised representations of
the underlying continuous return time distributions. These models
take into account categorical covariates related to the several
socio-demographic characteristics and aspects of medical history of
individual patients, and treat the primary covariate of interest --
the VA hospital factor -- using a random effects/hierarchical
prior. Our models are analysed in parallel across a range of chosen
"return time cut-offs", providing a nice analogue method of exploring
and understanding how posterior distributions for covariate effects
and hyperparameters vary with chosen cut-off. This perspective allows
us to identify important aspects of the non-proportional odds
structure exhibited by this very large and rich data set, by
isolating important and interesting interactions between cut-offs and
specific covariates. Summarisation of the sets of high-dimensional
posterior distributions arising in such an analysis is challenging,
and is most effectively done through sets of linked graphs of
posterior intervals for covariate effects and other derived
parameters.
We explore and exemplify this work with a full year's data, from
1997, and then discuss extended models for multiple year that include
time series structure to address dependencies across years. Issues of
covariate selection via Bayesian model selection methods, and other
practical questions, arise, and will be mentioned as time permits.
This is joint work with Mike West and Jim Burges.
Speaker: Marco Ferreira
Title:
Model Selection and the Minimum Description Length Principle
The Minimum Description Length (MDL) principle
for model selection will be described. In the MDL framework, the complexity of
each description of the observed data is used to decide between
models. The MDL principle will be illustrated by its application to
selection among linear regression models.
Speaker: Jim Berger
Title:
The Controversy Over P-values
Probably the most commonly used statistical tool is the P-value or
observed significance level. Statisticians have long struggled against
the misuse and misinterpretation of P-values, but a genie can be hard
to put back in the bottle. The debate about P-values has now even
reached the popular press, with articles appearing that seriously
question the value of research findings that are based primarily on P-
values.
There is nothing inherently wrong with a P-value, in that it is a
statistical measure of something and virtually all statisticians feel
that it has at least some valid uses. The problems arise when is it
used in the wrong way or interpreted to mean something that it is not.
The wrong ways in which P-values are used will be briefly reviewed,
but this is a common theme in statistics and science and so will be
discussed only briefly.
The less well understood problem with P-values is that their common
misinterpretation as some type of error probability (or, even worse,
as the probability of an hypothesis) can lead to very erroneous
conclusions. As an example, in testing a precise hypothesis, such as
the null hypothesis that an experimental drug is negligibly different
in effect from a placebo, a P-value of 0.05 is commonly thought to
indicate significant evidence against the hypothesis whereas, in
reality, it implies that the evidence is essentially balanced between
the hypotheses. This will be demonstrated in a variety of ways and the
effect illustrated in several examples. Finally, a simple calibration
of P-values will be proposed that can, at least, prevent the worst
misinterpretations.
Speaker: Peter Müller
Title:
An Adaptive Bayesian Dose-Finding Design
We propose a fully decision theoretic design for a dose-finding
clinical trial.
The proposed solution is based on a probability model
for the unknown dose/response curve and a utility function which
formalizes the relative preferences over alternative outcomes.
The chosen probability model for the dose/response curve is a normal
dynamic linear model (NDLM) which allows computationally efficient
analytic posterior inference. The proposes utility function formalizes
learning about the dose/response curve as minimizing posterior
variance in some key parameters of the dose response curve, for
example, the unknown response at the unknown ED95 dose.
To find the optimal dose to be assigned to the respective next patient
we maximize the utility function in expectation, marginalizing over all
random variables which are unknown at the time of decision making,
including unknown model parameters and still un-observed future
responses.
The proposed approach is myoptic in the sense that each day
we compute the optimal doses as if the patients arriving on that day
were the last ones recruited into the trial.
Speaker: Rodney Sparapani
Title:
Cardiac Enzymes as a Clinical Outcome
Creatinine-kinase MB (CK-MB) measurements are commonly used in clinical
practice to assess myocardial infarction (MI). An MI diagnosis is usually
based solely on CK-MB measurements, but may make use of other cardiac
enzymes such as Total Creatinine-kinase, Troponin I, Troponin T or
electro-cardiograms. Furthermore, CK-MB measurements are proportional to
the severity of an MI and the probability of mortality. The efficacy
endpoint in many cardiovascular trials is the composite of death and MI in
either an event rate comparison or a survival analysis. A problem with
composite endpoints is that they ignore the severity of the MI and may
subsequently require larger sample sizes due to a loss of information. For
example, what if a compound doesn't prevent MI, but instead decreases its
severity? In that case, even a larger sample size might miss the treatment
effect entirely. A logical candidate for a quantitative, and potentially
more powerful, endpoint would be CK-MB measurements. We'll look at how
CK-MB measurements might be used as a clinical outcome of a clinical trial.
It is assumed that this approach could be extended to other cardiac enzymes
as well.
Speaker: Arnaud Doucet (Cambridge University)
Title:
Sequential Monte Carlo Methods and Applications to Signal Processing
Sequential Bayesian estimation is important in many applications involving
real-time signal processing, where data arrival is inherently sequential.
Classical suboptimal methods for sequential Bayesian estimation include
the Extended Kalman filter and the Gaussian sum approximation. Although
these methods are easy to implement, they also perform poorly for many
nonlinear and/or non Gaussian models. Sequential Monte Carlo (SMC) methods
-also named particle filters- are a set of simulation-based methods which
allow to address these complex problems. I will present a generic
algorithm and discuss several improvement strategies. Convergence results
will be also briefly reviewed. Finally I will describe the applications of
SMC to a few signal processing applications: digital communications,
target tracking in clutter noise and speech enhancement.
Speaker: John Kern
Title:
Bayesian Spatial Covariance Models
The impact of covarince function choice on Gaussian process models for
spatially continuous data will be examined through a comparison of
traditional selection methods with non-traditional, non-parmetric
selection methods. These non-traditional methods represent the leading
edge of spatial covariance analyses (at least for one person), and make
use of process convolution models, which, along with Gaussian process
models, will be presented in enough detail for even the most novice
spatial data analyst to feel comfortable.
Speaker: Hedibert Lopes
Title:
Factor Models, Stochastic Volatility, and Meta-analysis
This talk will be based on part of the research I have been
doing for the past two years here at Duke. The material
presented will eventually be part of my PhD thesis, so below I
try to summarize the overall idea of each one of the three main
project I am currently involved. Since part 1 (Model uncertainty in
factor analysis) has already been extensively presented, I will spend
most of the talk on parts 2 and 3. However, part 1 will be briefly
touched when talking about model comparison in situations where prior
information for the factor model parameters in lacking, in which case
we will refer to recent developments by Perez and Berger.
[Part 1] MODEL UNCERTAINTY IN FACTOR ANALYSIS (Duke Statistics DP #38-98)
(with Mike West)
Bayesian inference in factor analytic models has received renewed
attention in recent years, partly due to computational advances but
also partly to applied focuses generating factor structures as
exemplified by recent work in financial time series modeling.
The focus of our current work is on exploring questions of uncertainty
about the number of latent factors in a multivariate factor model, combined
with methodological and computational issues of model specification and
model fitting. We explore reversible jump MCMC methods that build on
sets of parallel Gibbs sampling-based analyses to generate suitable
empirical proposal distributions and that address the challenging
problem of finding efficient proposals in high-dimensional models.
Alternative MCMC methods based on bridge sampling are discussed, and these
fully Bayesian MCMC approaches are compared with a collection of popular
model selection methods in empirical studies. Various additional
computational issues are discussed, and the methods explored in
studies of some simulated data sets and an econometric time series
example. More recently, we have also explored Perez and Berger's
expected posterior priors in a couple of factor analysis examples.
[Part 2] FACTOR MODEL AND MULTIVARIATE STOCHASTIC VOLATILITY:
TIME-EVOLVING LOADINGS AND SIMULATED-BASED SEQUENTIAL
ANALYSIS
(with Mike West and Omar Aguilar)
In this project we extend previous work from Aguilar and West (1999)
(Duke Statistics DP #98-03) on factor models with multivariate stochastic
volatilities. Our main contribution consists on allowing the factor
loadings to evolve with time in certain way that virtually retains the
interpretation of the factor scores throughout time. In other words,
the weights that some factors might have on a particular subset of
time series is allowed to change with time mimicking real financial/economic
scenarios. An example occurs when a country (or countries)
enter/leave a particular market, where such market might have been well
represented up to that point in time by a common factor. From a more
technically viewpoint, we also improved on previous work by allowing the
model parameters and states to be estimated based on the information
up to that time, that is, sequentially. We implemented auxiliary particle
filters ideas to deal with the states in our, while the fixed parameters
were sampled sequentially according to a Liu and West's recently
developments (Duke Statistics DP #99-14). We will present some preliminary results
based on Aguilar and West's application on international exchange rate
returns.
[Part 3] META-ANALYSIS FOR LONGITUDINAL DATA MODELS USING
MULTIVARIATE MIXTURE PRIORS
(with Peter Mueller and Gary Rosner)
We propose a class of longitudinal data models with random effects
which generalize currently used models in two important aspects.
First, the random effects model is a flexible mixture of multivariate
normals, accomodating population heterogeneity, outliers and
non-linearity in regression on subject-specific covariates. Second,
the model includes a hierarchical extension to allow for meta-analysis
over related studies.
The random effects distributions are split into one part
which is common across all related studies (common measure), and one
part which is specific to each study and captures the variability
intrinsic within patients from the same study.
Both, the common measure and the study-specific measures, are
parametrized as mixture of normal models. To allowed a
random number of terms in the mixtures we introduce a reversible jump
algorithm for posterior simulation. The proposed methodology
is broad enough to embrace current hierarchical models and it allows
for "borrowing-strength" between studies in a new and simple way.
The motivating application is the analysis of two studies carried out
by the Cancer and Leukemia Group B (CALGB).
In both studies, we record for each patient's white blood cell count
(WBC) over time to characterize the toxic effects of treatment. The WBC
counts are modeled through a nonlinear hierarchical model that
gathers the information from both studies.
Sep 9
Sep 14
Sep 16
Sep 21
Sep 23
paper.
Sep 28
Sep 30
paper.
Oct 5
Oct 7
Oct 14
Oct 19
Oct 21
Oct 26
Oct 28
Nov 2
Nov 4
Nov 9
Nov 11
Nov 16
Nov 18
Nov 30
Dec 2
Jim Berger
August, 1999