STA 395 Readings in Statistical Science

Syllabus

Fall 2004 Meeting Times:

M: 4:25-5:40 Old Chem 101
F: 3:30-4:30 Old Chem 116

Speakers

August 30 "Neighbor-Joining with Subtree Weights.": Ruriko Yoshida, Duke University; Abstract: The Neighbor-Joining algorithm is a recursive procedure for reconstructing phylogenetic trees that is based on a transformation of pairwise distances between leaves for identifying cherries in the tree (two nodes are a cherry if there is exactly one intermediate vertex on the path between them). We show that estimates of the weights of m-leaf subtrees are more accurate than pairwise distances, and derive a generalization of the cherry picking criterion which uses such weights. This leads to an improved neighbor-joining algorithm whose total running time is still polynomial in the number of taxa. In the first talk, I will remind an outline of The Neighbor-Joining algorithm with pairwise distances, introduced by Saitou and Nei in 1987 and in the second talk, I will describe the Neighbor-Joining algorithm with the weights of m-leaf subtrees and a generalization of the cherry picking criterion. This is joint work with Dan Levy and Lior Pachter. Saitou/Nei, Studier/Keppler, Case Frog Data
September 6 (continued): Ruriko Yoshida, Duke University
September 13: SAMSI workshop (optional)
September 20 "Can Male Baboons Recognize Their Own Children?": Floyd Bullard, Duke University; Abstract: Baboons live together in communities and both males and females are promiscuous, making it uncertain whether baboon males can recognize their own offspring. When a young baboon cries for an adult's help, its father is more likely to come to its aid than you'd expect if males helped entirely at random, but a number of confounding factors make it possible still that he is ignorant of his own paternity. We examine a multinomial logit model that predicts the probability that a male will assist a young baboon in need and conclude that this assisting behavior adds substantially to the evidence that baboons are aware of paternal as well as maternal relationships.
September 27 "What Am I Going to Do When I Grow Up: Careers After Duke Statistics.": Jerry Reiter, Duke University; Abstract: In this talk, we discuss some of the options available to PhD students from Duke Statistics. We describe the general process of applications to academic jobs, including writing CVs, statements of research and teaching, and interviewing. The talk will be informal, allowing ample time for questions.
October 4 "Posterior Inference without a Likelihood.": Ana Grohovac, Duke University; Abstract: Oceanographers are interested in studying decadal variability in the ocean's heat content through the depth of the mixed layer M. This is the top layer of the ocean where the waters are being mixed through the turbulent mixing and convection creating a nearly uniform temperature ocean layer. It evolves with the annual cycle of surface temperatures and is spatially correlated. Statistician's goal is to acquire a model P(M, data) . When asked, expert is able to evaluate her posterior belief for any single profile but the sampling distribution is not known. We propose to find a functional form g(M|data) that yields the posterior uncertainty approximate to that of the expert under the uniform prior. Later g is in turn used as a likelihood in conjunction with other priors to model spatio-temporal dependence, since temperatures are measured at equal distances. The function g is not a product of a known density function and is used as an approximate likelihood. In fact, g is a distribution of two pivotal variables delta₁ and delta₂ functions of both the data and M . In this talk would like to give an alternative explanation of how the joint model P(M, data) in Rⁿ⁺¹ may arise from the joint model on a transformed space P( delta₁, delta₂, h(data)) in Rⁿ⁺¹.
October 11: Holiday
October 18 "Exploring two "omics" data sets with classification trees.": Susan Simmons, UNC Wilmington; Abstract: The post-genomic era is witnessing a tremendous increase in the number of studies involving the "omics" sciences. Researchers are interested in obtaining information from areas such as genomics, proteomics, transcriptomics, toxicogenomics, and metabolomics. This presentation will illustrate the approach via two analyses of "omics" data sets in which a tree structure is used for classification purposes. First, a metabolomics data set with 105 metabolites will be explored to learn which metabolites are important for classifying individuals into a diseased status versus a nondiseased status. The second analysis involves a genomic data set in which biomarkers are identified for predicting the location of a quantitative trait (this is referred to as Quantitative Trait Loci, or QTL).
October 25 "Statistical Learning and Applications in Computational Biology.": Sayan Mukherjee, Duke University; Abstract: A brief introduction to statistical learning theory will be followed by a description of some problems of interest and open questions related to these problems.

The introduction will consist of the formulation of the problem and a description of classical notions such as generalization, universal consistency, Vapnik-Chervonenkis dimension, and uniform Glivenko-Cantelli classes. Recently developed relations between the stability of learning algorithms (for example error minimization of Tikhonov regularization) and generalization will be described. Concentration inequalities such as the Efron-Stein or McDiarmid's inequality are important in the stability results, however for this application an inequality needs to be generalized to handle cases when the martingale difference is not uniformly small but small with high probability. Isoperimetry can be used to get a tight bound on the size of the bad set, the set where the martingale difference is not small.

Some computational/statistical problems in the analysis of gene expression data will be discussed:
1) classifying binary and multiclass expression data using Support Vector Machines
2) feature/gene selection algorithms
3) finding pathways/gene sets enriched in expression data. Biological contexts for the above problems as well as limitations of some current techniques will described.
November 1 "Assessing Predictor Importance in GLMMs.": Dawn Banard, Duke University; Abstract: In generalized linear mixed models the normal distribution does not necessarily provide an increasingly good approximation to the likelihood as the size of the data set increases, since the number of random effects may also increase. This means that even in large data sets the chi-squared approximation to the deviance may be quite bad. Therefore the p-value obtained using this approximation may be a misleading way to assess whether a predictor can be removed from the model. I will consider alternative criterion to assess the importance of an arbitrary predictor. The Bayes factor and posterior odds do not rely on an asymptotic normal approximation to the likelihood, so I will focus on these criterion. I will compare the likely computational burden of the various methods to estimate these two quantities.
November 8 "Generalized Spatial Dirichlet Process Model.": Jason Duan, Duke University; and Michele Guindani and Alan Gelfand; Abstract: We propose a novel spatial process mixture model that is marginally Dirichet at any location and spatially dependent. In this presentation, we first develop a generalized spatial Dirichlet process model for spatial data and discuss its properties. Owing to the discreteness of the Dirichlet process models, we mix through this process against a pure error process. The Bayesian posterior inference is implemented using Gibbs sampling. Spatial prediction under this model is discussed.
November 15 " Preserving Data Confidentiality While Encouraging Data Sharing.": Christine Kohnen, Duke University; Abstract: Data disseminators, in particular statistical agencies, must balance the tasks of providing users of public-use data with enough information for inferences while protecting the confidentiality of their respondents. Due to the potential consequences associated with data disclosures, agencies operate conservatively with regards to data protection and disclosure risks. Their concerns of disclosure limitation not only limit the information agencies can release for public-use, but also what they can share with other agencies. Even in situations where the sharing of data between agencies would be mutually beneficial to participating agencies and in turn to the potential users of public-use data, data sharing is at best limited. One approach agencies can use to safely share their data and create public use data in the process, is to exchange synthetic data rather than real data. I will discuss the use of multiple imputation methods as a means of sharing confidential data among statistical agencies. In addition, inferential methods will be presented for combining multiple data sets based on the implementation of simulation studies, along with some preliminary results using real data.
November 22 "U-Control Charts revisited.": Susie Bayarri, Duke/SAMSI/University of Valencia; Abstract: U-control charts are simple graphical tools widely used for monitoring whether a production process is in or out of control. In spite of their indiscriminate use, they present serious limitations which are not usually taken into account. One severe limitation is the assumption (often inadequate) of Poisson counts; more importantly they require a (questionable) 'base period' , for estimation purposes, in which the process is assumed to be under control. During this period (usually quite large) no action can be taken. In this talk, we present interesting (Bayesian) alternatives to usual u- Control Charts. We use empirical-Bayes and Bayes methods and compare them with the traditional frequentist implementation. Empirical Bayes methods are somehow easier to implement and they deal nicely with extra-Poisson variability (and, at the same time, they informally check the adequacy of the Poisson assumption). However, they still need the questionable base period. The sequential, full Bayes approach on the other hand also avoids this drawback of traditioal u- charts. The implementation requires numerical simulation, and also use of a prior distribution. Several possibilities for both objective and informative priors are explored. We will argue that the sequential, full Bayesian u-Control Chart is a powerful and versatile tool for process monitoring.
This is joint work with G. García-Donato, University of Castilla-La-Mancha
November 29 "An Introduction to Parallel Computing @ Duke Statistics": Chris Hans, Duke University; Abstract: It is anticipated that parallel computing resources will be increasingly available to Duke Statistics students and affiliates during the next few years. In order to facilitate use of these resources, I will present an introductory tutorial to parallel computing, drawing on my experiences from the last two years. I plan to cover: (i) basic structure and the "big picture" of distributed computing, (ii) example code for simple parallel computing problems and (iii) a description of how to run parallel code on Duke's CSEM Cluster (the cluster for which access is expected to increase over the next few years). This tutorial is primarily intended for people with little or no knowledge of parallel computing and will not address any advanced distributed computing concepts or theory.

Discussion Papers:

Statistical Modeling: Two Cultures, L. Breiman
Algebraic algorithms for sampling from conditional distributions, P. Diaconis and B. Sturmfels
Sequential Imputations and Bayesian Missing Data Problems, A. Kong, J. S. Liu, and H. Wong
Nonparametrics, Peter Hall

STA 395 Readings in Statistical Science

Speakers

Questions: Woncheol Jang, Ian H. Dinwoodie