Speaker: Stanley Pound, Texas A & M University
Title: Cluster Analysis of Amplified Fragment Length Polymorphism
Amplified Fragment Length Polymorphism (AFLP) is a technique for obtaining a DNA fingerprint. Currently, there is considerable interest among biologists in using AFLP data to group organisms into families. A special beta-binomial model is developed for AFLP and utilized to conduct cluster analysis. An analysis of 11 varieties of rice is presented.
Speaker:
Title: Optimal predictive model selection
Often the goal of model selection is to choose a model for future prediction, and it is natural to measure the accuracy of a future prediction by squared error loss. Under the Bayesian approach, it is commonly perceived that the optimal predictive model is the model with highest posterior probability, but this is not necessarily the case. In this talk we show that, for selection among normal linear models, the optimal predictive model is often the median probability model, which is defined as the model consisting of those variables which have overall posterior probability greater than or equal to 1/2 of being in a model. The median probability model often differs from the highest probability model. Examples that are given include nonparametric regression and ANOVA.
Speaker:
Title:
Speaker: Trent Buskirk, University of Nebraska, Lincoln
Title: The Second and Third Dimensions of Drug Use
You've designed a survey, selected your sample, overcome nonresponse, collected data and now you would like to conduct several hypothesis tests to refute or confirm your research hypotheses. Some of these tests are nonparametric and assume that the underlying distributions for the variables you are interested in are symmetric. Other tests assume that the bivariate density function is symmetric. How will you perform diagnostics to confirm or refute such assumptions? Can you obtain a picture of the density functions for the variables of interest using data collected from some specified survey design?
In this presentation, we will discuss some of the current methodologies available for creating density estimates from survey data. In particular, we will introduce a bivariate kernel estimator that can be used to estimate the distributions of pairs of variables that are of interest to surveyors. In particular, we will use data from the 1998 National Household Survey on Drug Abuse to illustrate bivariate relationships between smoking and alcohol use, between marijuana and cocaine use and many others.
Speaker: David Banks, Bureau of Transportation Statistics
Title: Mining Superlarge Datasets
Data mining has sprung up at the nexus of computer science, statistics, and management information systems. It attempts to find structure in large, high-dimensional datasets. To that end, the different disciplines have each developed their own repertoire of tools, which are now beginning to cross-pollinate and produce an improved understanding of structure discovery. This talk looks at three problems in this area: (1) preanalysis, which subsumes data cleaning and data quality; (2) indexing, which is related to cluster analysis; and (3) analysis, which includes a comparison of such new-wave techniques as MARS, neural nets, projection pursuit regression, and other methods for structure discovery.
Speaker: David Schwartz, Department of Neurobiology
Title: In search of a probabilistic theory of tonal perception
Over 2000 years ago the Greek philosopher Pythagoras first recorded the observation that the pleasantness of the sound made by plucking two strings varies as a function of the ratio between the strings' physical dimensions. Length ratios that could be expressed using small integers (e.g., 2:1, 3:2, 4:3) evoked more pleasant sounds than did ratios that could only be expressed in terms of larger integers (e.g., 7:5, 9:8, 42:35). Pythagoras and subsequent Medieval and Renaissance scholars so marveled at this coincidence of mathematics and esthetics that they built an entire cosmology around it. The best they could offer by way of explanation, however, was that small numbers were more perfect than large numbers, because they better reflected the mind of God.
In the 19th century the German physicist Helmholtz observed that a plucked string vibrates simultaneously at many different frequencies; the fundamental frequency corresponding to the pitch we hear, and at higher frequencies whose values are integer multiples of the fundamental frequency. These higher frequencies are termed harmonics. On the basis of this observation Helmholtz proposed a psychophysical theory to explain the relationship between frequency ratios and tonal perception. He proposed that the harmonics of tones related by small integer frequency ratios either coincide or are widely spaced and thus do not interfere with each other at the auditory receptor surface. They therefore evoke pleasant sounds. The harmonics of tones related by large integer frequency ratios, however, neither coincide nor are widely enough spaced to avoid interfering with each other. The dissonance we hear, Helmholtz suggested, is caused by interference, or beating, among the higher harmonics of the two tones.
Von Bekesy's observation in the 20th century that the basilar membrane of the inner ear could physically instantiate the process Helmholtz described lent further credibility to the theory. However, in the 150 years since Helmholtz first published his ideas, no one has produced any direct evidence that perceptual interference of the kind Helmholtz described actually occurs, while evidence has been presented suggesting that such interference, even if it does occur, cannot explain the phenomenon of tonal dissonance. Nonetheless, despite the lack of any empirical support for Helmholtz's theory, textbooks continue to espouse it, perhaps because no one as yet has put forth any viable alternative account.
In the Purves lab we have been exploring an alternative way of conceiving tonal dissonance in particular and tonal perception in general. Beginning with the observations that 1) any frequency ratio can be formed by pairing two harmonics from a single harmonic series, and 2) that a set of harmonically related tones typically are perceived as a pitch corresponding to the fundamental frequency of the series, we propose that the phenomena of tonal consonance and dissonance reflect the empirical likelihood of a particular amplitude distribution among the frequencies of a harmonic series. We have supported this conception by demonstrating that any dissonant tonal combination can be made consonant by adding amplitude at appropriate lower frequencies, and any consonant tonal combination can be made dissonant by removing amplitude from the lower frequencies of the harmonic series composing the tone. We are currently attempting to express this conception in the form of a probabilistic model, where perceived consonance can be predicted as a function of the probabilistic relationship between two different amplitude distribution of a given harmonic series.
Speaker:
Title:
Speaker: Stephen Ponisciak
Title: Bayesian Analysis of Teacher Effectiveness
The quality of schools and teachers has become an important issue over the past several years. Several authors have discussed the propriety of using students' scores on state or national tests to evaluate individual students' performance, teachers' performance, and schools' performance. More recently, other authors (most notably Sanders and Horn, 1998, ``Research Findings from the Tennessee Value-Added Assessment System (TVAAS) Database: Implications for Educational Evaluation and Research,'' Journal of Personnel Evaluation in Education, 12:3, 247-256) have advocated a "value-added" approach, in which the authors evaluate a teacher's performance based on his or her students' gains on the norm-referenced part of a statewide exam. Standardized tests bring with them a long and arduous history of controversy. Seeking to avoid these arguments (and to create new ones), we have developed a system to evaluate teacher performance based on students' grades in subsequent courses, rather than test scores. Using five years of data from high schools in one school district, we evaluate the quality of teachers based on students' course grades by treating the teacher and student as random effects in an ordinal probit model with latent variables (as in Johnson (1997), "An Alternative to Traditional GPA for Evaluating Student Performance," Statistical Science, v. 12, No. 4, 251-278). Our analysis will include only courses of study that are clearly consecutive, such as mathematics and foreign languages. We will evaluate whether student-level variables, such as the student's average grade in other courses and demographic variables, should be included in the model, and we will assess the fit of the model by examining sorted latent residuals and determining how many of the pairwise comparisons are correctly evaluated.
Speaker: Kate Calder
Title: Assessing Sources of Uncertainty in a Dynamic Forest Model
The dynamics of forest ecosystems are often studied using complex computer models. These models help ecologists predict how forest characteristics will change over time if the environment is modified. The Clark lab at Duke University has studied the complete demographic life history of the tree species in two forests in North Carolina and has collected large amounts of data on reproduction, dispersal, growth, and survival as well as data on the local environment such as understory light, soil moisture, and soil nutrients. Their goal is to estimate parameters of a forest stand simulator using the data. We address a statistical issue related to this forest model: how can the data be used to assess the amount of accuracy gained by adding complexity to the model? We restate this problem as a Bayesian inverse problem and use MCMC techniques to perform model selection for forest models of different levels of complexity.
Speaker: German Molina
Title: A case study in model selection for probit models with dichotomous data
Our motivating example requires probit analysis of binary data. We have 7 binary variables of interest. These define 128 possible covariates, out of which 72 are possible. This leaves us with 2^72 possible models to explore (some less since we restrict ourselves to nested models). We define the prior on the full model and induce by marginalizing and conditioning the corresponding priors on the parameters of the submodels. The Laplace approximation is used to compute the marginal posterior of each such submodel. We also present a possible scheme, which makes use of the marginals previously computed, for searching through the model space. Model averaging is used for inference on the parameters of interest. A comparison with the BIC-based results is provided. This is joint work with Jim Berger.
Speaker: William H. Jefferys, University of Texas
Title: Correlation With Errors-in-Variables and an Application to Galaxies
The usual theory of correlation coefficients fails when one or both of the observed quantities has error. I describe a Bayesian approach that correctly handles this errors-in-variables problem, and which also gives posterior odds on the model with correlation against the model without correlation, allowing a decision as to whether the correlation should be considered real or not.
Speaker:
Title:
Speaker: Rui Paulo
Title: Complex Computer Models and Default Spatial Priors
The analysis of complex computer models is a very active area of research in statistics. These computer codes usually correspond to the implementation of a complex mathematical model that aims at describing a real life process. In general, the code is assembled because the real life process is costly to directly observe, but at the same time it is so computationally intensive that only a limited number of runs are feasible. Nonetheless, running the code is still cheaper than making direct observations of the process.
In this setting, there are numerous questions that are amenable to statistical treatment, eg, the problem of how closely the computer code represents reality. We present a general Bayesian strategy to model both computer output and field data, along with an application example dealing with acceleration curves collected at the time prototype vehicles crash against a barrier.
The statistical modeling of these data from a Bayesian perspective highlights the need for default specification of prior distributions for the parameters of Gaussian processes. In the last part of this talk, we also present some results in this area, addressing in particular strategies to overcoming numerical problems that arise when computing with these priors.
Speaker: Ana Grohovac Rappold
Title: Polya Trees: What They Are and Their Properties
Bayesian non-parametric and semi-parametric models are attractive alternatives in modeling statistical problems when we are unable or not willing to satisfy parametric assumptions. In Bayesian non-parametric models we model the distribution function as a random quantity. Therefore, we need priors on the space of distribution functions. We also need such priors to be easily updatable, and conjugate. Polya trees are one example of accomplishing these goals, and they can with probability 1, assign mass to the space of continuous distributions. In this lecture, we will present the basics of how Polya trees constitute a prior on the space of probability functions, how they are constructed, and how they are used in statistical inference.
Sources of much of the topics covered in this lecture can be found in papers by Ferguson (1974), Mauldin, Sudderth, Williams (1992), Lavine (1992), Lavine (1994).