abstracts Aug 25

Speaker:Yuguo Chen

Title: Sequential Importance Sampling for Permutation Tests on Truncated Data

There are many statistical applications that are related to permutations with restricted positions. For example, in order to find the level of nonparametric tests for correlation in truncated data, we need to sample uniformly from the set of restricted permutations, which is quite difficult. We provide a sequential importance sampling approach for permutation tests on truncated data. It can also give an estimate of the total number of allowable permutations, i.e., the permanent. We apply it to a quasar luminosity evolution problem.

Sept 1

Speaker:Yuguo Chen

Title: Sequential Monte Carlo Methods for Filtering and Smoothing in Hidden Markov Models

This talk provides a general theory of sequential Monte Carlo methods to tackle the long-standing problem of optimal filtering and smoothing in hidden Markov models for general state spaces. The theory addresses two basic issues concerning sequential Monte Carlo filters, namely, how the proposal distribution for importance sampling should be chosen, and why and when resampling should be carried out. We also show how sequential Monte Carlo methods can be applied to smoothing problems by combining forward and backward filters. The methodology developed is applied in particular to estimation of parameters that may undergo occasional changes at unknown times, for which it is shown that the methodology indeed yields relatively fast and accurate simulation-based procedures to compute the Bayes estimates of the time-varying parameters. The methodology is illustrated with real data in an application to DNA sequence analysis.

Sept 15

Speaker:Beatrix Jones, Scott Nichols

Title: Using Abundance Measurements to Improve Bycatch Estimates

In our problem, "bycatch" is the unintentional catching of fish by shrimping boats in the Gulf of Mexico. The National Marine Fisheries Service studies bycatch via an observer program, where observers ride along on shrimp boats and examine the catch to see how many fish of various species are included. While there are about 4000 observations from this program, the data are quite noisy and are spread over a large area (most of the Gulf Coast) and long time period (1972-1998). NMFS also has separate studies that assess the abundace of different fish species in the same region over this time period, consisting of about 10000 measurements. The process of taking the abundance measurements is similar to the shrimping process--a net is sent down and what comes up is counted. Our goal is to explore the relationship between the abundance and bycatch measurements, and use any similarity to improve our idea of the total annual bycatch. This problem presents interesting statistical challenges because the measurements don't follow a nice distribution, and have spatial and temporal structure.

Sept 22

Speaker: Xiaodong Lin

Title: Constrained Mixture of Factor Analyzers for Simultaneous Dimension Reduction and Clustering

High dimensional data are regularly generated from various sources. Traditional data analysis performs dimension reduction and clustering separately. In this talk we address the problem of simultaneous clustering and dimension reduction by the constrained mixture of factor analyzers. A constrained EM algorithm is used for the parameter estimations. Under our model the clusters are allowed to adopt to different factor sizes and the exhaustive model search becomes infeasible. A two-step model selection procedure is then proposed to select the optimal dimensionality for each cluster. Several high dimensional data set are studied using our procedure and comparisons will be made. This is joint work with Michael Zhu.

Sept 29

Speaker: Mark Huber

Title: Simulating the Birds and the Bees: Direct Sampling for Hardy-Weinberg Proportions

Under conditions of random mating, no selection and no mutation, species will reach Hardy-Weinberg equilibrium. Biologists can only take a finite sample from the entire population, the goal is then to determine if the population has reached their equilibrium point. Various statistical tests exist to determine this, the best ones require the ability to simulate a random genetic makeup for the sample. Previous methods required exponential time to simulate a sample, here I'll present a new method that only needs linear time.

Oct 6

Speaker: Fabio Rigat

Title: Bayesian Treed Exponential Survival Models

Bayesian inference for right censored time-to-event data is typically carried out by fitting a parametric or nonparametric model for the survival distribution through MCMC methods given a fixed set of covariates. Treed survival distributions are finite mixtures based on binary recursive partitions of the covariate space. In the talk I will illustrate two model search strategies to find high posterior probability tree models when the leaf distribution is exponential. The strenght of this approach is that the variable selection problem and the data fitting/prediction problem are jointly solved.

Oct 20

Speaker: Christopher Hans

Title: (Recent Advances in) Gaussian Graphical Model Determination

The use of undirected Gaussian graphical models as a tool for describing conditional independence structure in high dimensional datasets has become more feasible over the last several years. In this talk I first give an introduction to Gaussian graphical models, describing how graphs can be used to localize computation for large models. Secondly I will review recent advances in stochastic computation for high dimensional graphical models which were developed during the SAMSI Stochastic Computation program. Examples will be given to demonstrate the effectiveness of these methods and to highlight their use in gene expression experiments.

Oct 27

Speaker: Feng Liang

Title: Regularization and Shrinkage

The problem of approximating a function from finite observations is ill-posed and a classical way to solve it is through regularization. Regularization theory formulates the estimation problem as a variational problem of finding the function (or curve) that minimizes certain functional. Regularization (or smoothing) parameter is introduced there to achieve the trade-off between fitting error and penalty, whose value is often chosen by cross-validation. In many applications, such a problem can be formulated as a mean vector estimation problem for Gaussian sequence model in a bounded space. Historical background on parameter estimation in Gaussian sequence model (shrinkage phenomenon) will be reviewed and possible future research direction for both parameter and density estimation will be discussed.

Nov 3

Speaker: Daohai Yu

Title: Analysis of Interval-Censored Time-to-Event Data with a Marked Observation Process

The random censorship assumption is often violated in interval-censored data, such as when the data arise from a serial screening and the timing of screening could potentially depend on the patient's health status. We propose a new likelihood-based approach coupled with a class of innovative conditional models for dependent interval-censored time-to-event data with a marker for discretionary visits. It generalizes the current analysis methods assuming independent censoring, such as the Turnbull's estimator for interval-censored data. The proposed model for interval-censored data is conditional on the marker history. It accounts for both regularly scheduled visits and visits whose timing is motivated by patient status, and hence allows for dependent censoring. Right-censored data can be imputed from this conditional model. Marginal inferences can then be derived through marginal modeling using standard right-censored survival data methods and multiple imputation based on the conditional model results.

In this talk, we present the approach with non-parametric marginal models. The relationship between the dependent censoring we considered here and the coarsening at random (CAR) are examined and some key results established. Simulation results reveal that our method has high power for detecting a departure from independent censoring. It also gives better estimates in terms of bias than do approaches that ignore the dependence of censoring, while not sacrificing efficiency. Finally, the approach is illustrated through an application to a data set from a study of nursing home residents who were followed for two years. The results provided the evidence that the interval-censoring is dependent on the outcome. It was also clear from the results that using current analysis methods assuming independent censoring would result in bumpy and biased estimates for the survival function. Thus our approach is superior compared to the others in this case.

Nov 10

Speaker: Christine Kohnen

Title: A Primer on Multiple Imputation

One of the ways of handling nonresponse in surveys is with the use of multiple imputation, introduced by Rubin in the late 1970s. In this talk I will first give an introduction and motivation for the use of multiple imputation. Then I will go over the derivations for finite imputations in the normal case and briefly mention the system of checks available that show multiple imputation conditions are valid from a frequentist perspective. I will conclude with some of the current issues concerning multiple imputation.

Nov 17

Speaker: David Banks

Title: Statistical Issues in Counterterrorism

This talk will describe recent efforts by the statistical community to contribute to the federal initiatives against terrorism. Apart from an overview, the substance of the talk will focus upon two areas in which there seem to be substantive research questions: the application of game theory and portfolio analysis to threat preparedness, and the use of high-dimensional nonparametric regression to estimate distance functions pertinent to biometric identification.

Nov 24

Speaker: Jerry Reiter

Title: Research on Releasing Simulated, Multiply-Imputed Data to Protect Confidentiality

When releasing data to the public, statistical agencies seek to provide detailed data while limiting disclosures of respondents' information. Typical strategies for disclosure limitation include recoding variables, swapping data, or adding random noise to data values. However, these methods can distort relationships among variables in the dataset and complicate users' analyses. An alternative approach is to release multiply-imputed, synthetic microdata, as suggested by Rubin (1993) and Little (1993). This approach is attracting much interest from agencies, and the statistical research on this topic is in its initial stages. In this talk, I present my current research on this topic. The talk will have some theoretical results, but the results can be easily comprehended by those with knowledge of introductory Bayesian statistics.

Yuguo Chen
August, 2003