STA 395 Readings in Statistical Science
Syllabus
Fall 2004 Meeting Times:
- M: 4:25-5:40 Old Chem 101
- F: 3:30-4:30 Old Chem 116
Speakers
- August 30 "Neighbor-Joining with Subtree Weights."
- Ruriko Yoshida, Duke University
- Abstract:
The Neighbor-Joining algorithm is a recursive procedure for
reconstructing phylogenetic trees that is based on a transformation of
pairwise distances between leaves for
identifying cherries in the tree (two nodes are a cherry if there is
exactly one intermediate vertex on the path between them). We show that
estimates of the weights of m-leaf subtrees are more accurate than
pairwise distances, and derive a generalization of the cherry picking
criterion which uses such weights. This leads to an improved
neighbor-joining algorithm whose total running time is still polynomial
in the number of taxa. In the first talk, I will remind an outline of
The Neighbor-Joining algorithm with pairwise distances, introduced by
Saitou and Nei in 1987 and in the second talk, I will describe the
Neighbor-Joining algorithm with the weights of m-leaf subtrees and a
generalization of the cherry picking criterion.
This is joint work with Dan Levy and Lior Pachter.
Saitou/Nei, Studier/Keppler,
Case Frog Data
- September 6 (continued)
- Ruriko Yoshida, Duke University
- September 13
- SAMSI workshop (optional)
- September 20 "Can Male Baboons Recognize Their Own
Children?"
- Floyd Bullard, Duke University
- Abstract: Baboons live together in communities and both males
and females are
promiscuous, making it uncertain whether baboon males can recognize
their own
offspring. When a young baboon cries for an adult's help, its father is
more
likely to come to its aid than you'd expect if males helped entirely at
random, but a number of confounding factors make it possible still that
he is
ignorant of his own paternity. We examine a multinomial logit model
that
predicts the probability that a male will assist a young baboon in need
and
conclude that this assisting behavior adds substantially to the
evidence that
baboons are aware of paternal as well as maternal relationships.
- September 27 "What Am I Going to Do When I Grow Up: Careers
After Duke Statistics."
- Jerry Reiter, Duke University
- Abstract: In this talk, we discuss some of the options available
to
PhD students from Duke Statistics. We describe the general process of applications
to academic jobs, including writing CVs, statements of research and
teaching, and interviewing. The talk will be informal, allowing ample
time for questions.
- October 4 "Posterior Inference without a Likelihood."
- Ana Grohovac, Duke University
- Abstract: Oceanographers are interested in studying decadal
variability in the
ocean's heat content through the depth of the mixed layer M.
This is
the top layer of the ocean where the waters are being mixed through
the turbulent mixing and convection creating a nearly uniform
temperature ocean layer. It evolves with the annual cycle of surface
temperatures and is spatially correlated. Statistician's goal is to
acquire a model P(M, data) . When asked, expert is able to
evaluate
her posterior belief for any single profile but the sampling
distribution
is not known. We propose to find a functional form g(M|data)
that
yields the posterior uncertainty approximate to that of the expert
under
the uniform prior. Later g is in turn used as a
likelihood in conjunction with other priors to model spatio-temporal
dependence, since temperatures are measured at equal distances.
The function g is not a product of a known density function
and is
used as an approximate likelihood. In fact, g is a
distribution of
two pivotal variables delta1 and delta2
functions of both the
data and M . In this talk would like to give an alternative
explanation of how the joint model P(M, data) in Rn+1
may
arise from the joint model on a transformed space P( delta1,
delta2, h(data)) in Rn+1.
- October 11
- Holiday
- October 18 "Exploring two "omics" data sets with
classification trees."
- Susan Simmons, UNC Wilmington
- Abstract: The post-genomic era is witnessing a tremendous
increase in the number of studies involving the "omics" sciences.
Researchers are interested in obtaining information from areas such as
genomics, proteomics, transcriptomics, toxicogenomics, and
metabolomics. This presentation will illustrate the approach via two
analyses of "omics" data sets in which a tree structure is used for
classification purposes. First, a metabolomics data set with 105
metabolites will be explored to learn which metabolites are important
for classifying individuals into a diseased status versus a nondiseased
status. The second analysis involves a genomic data set in which
biomarkers are identified for predicting the location of a quantitative
trait (this is referred to as Quantitative Trait Loci, or QTL).
- October 25 "Statistical Learning and Applications in
Computational Biology."
- Sayan Mukherjee, Duke University
- Abstract: A brief introduction to statistical learning theory
will
be followed by a description of some problems of interest
and open questions related to these problems.
The introduction will consist of the formulation of the problem
and a description of classical notions such as generalization,
universal consistency, Vapnik-Chervonenkis dimension, and
uniform Glivenko-Cantelli classes. Recently developed relations
between the stability of learning algorithms (for example error
minimization of Tikhonov regularization) and generalization will
be described. Concentration inequalities such as the Efron-Stein
or McDiarmid's inequality are important in the stability results,
however for this application an inequality needs to be generalized
to handle cases when the martingale difference is not uniformly
small but small with high probability. Isoperimetry can be used
to get a tight bound on the size of the bad set, the set where the
martingale difference is not small.
Some computational/statistical problems in the analysis of
gene expression data will be discussed:
1) classifying binary and multiclass expression data
using Support Vector Machines
2) feature/gene selection algorithms
3) finding pathways/gene sets enriched in expression data.
Biological contexts for the above problems as well as limitations of
some current techniques will described.
- November 1 "Assessing Predictor Importance in GLMMs."
- Dawn Banard, Duke University
- Abstract: In generalized linear mixed models the normal distribution does not
necessarily provide an increasingly good approximation to the
likelihood as the size of the data set increases, since the number of
random effects may also increase. This means that even in large data
sets the chi-squared approximation to the deviance may be quite bad.
Therefore the p-value obtained using this approximation may be a
misleading way to assess whether a predictor can be removed from the
model.
I will consider alternative criterion to assess the importance of an
arbitrary predictor. The Bayes factor and posterior odds do not rely
on an asymptotic normal approximation to the likelihood, so I will
focus on these criterion. I will compare the likely computational
burden of the various methods to estimate these two quantities.
- November 8
"Generalized Spatial Dirichlet Process Model."
- Jason Duan, Duke University
- and Michele Guindani and Alan Gelfand
-
Abstract:
We propose a novel spatial process mixture model that is marginally
Dirichet at any location and spatially dependent. In this presentation, we
first develop a generalized spatial Dirichlet process model for spatial
data and discuss its properties. Owing to the discreteness of the
Dirichlet process models, we mix through this process against a pure error process.
The Bayesian posterior inference is implemented using Gibbs sampling.
Spatial prediction under this model is discussed.
- November 15 "
Preserving Data Confidentiality While Encouraging Data Sharing."
- Christine Kohnen, Duke University
-
Abstract:
Data disseminators, in particular statistical agencies, must balance
the tasks of providing users of public-use data with enough
information for inferences while protecting the confidentiality of
their respondents. Due to the potential consequences associated with data
disclosures, agencies operate conservatively with regards to data
protection and disclosure risks. Their concerns of disclosure
limitation not only limit the information agencies can release for
public-use, but also what they can share with other agencies. Even in
situations where the sharing of data between agencies would be
mutually beneficial to participating agencies and in turn to the
potential users of public-use data, data sharing is at best limited.
One approach agencies can use to safely share their data and create
public use data in the process, is to exchange synthetic data rather
than real data. I will discuss the use of multiple imputation methods
as a means of sharing confidential data among statistical agencies.
In addition, inferential methods will be presented for combining
multiple data sets based on the implementation of simulation studies,
along with some preliminary results using real data.
- November 22 "U-Control Charts revisited."
- Susie Bayarri, Duke/SAMSI/University of Valencia
- Abstract: U-control charts are simple graphical tools widely
used for monitoring whether a production process is in or out of
control. In spite of their indiscriminate use, they present serious
limitations which are not usually taken into account. One severe
limitation is the assumption (often inadequate) of Poisson counts; more
importantly they require a (questionable) 'base period' , for
estimation purposes, in which the process is assumed to be under
control. During this period (usually quite large) no action can be
taken. In this talk, we present interesting (Bayesian) alternatives to
usual u- Control Charts. We use empirical-Bayes and Bayes methods and
compare them with the traditional frequentist implementation. Empirical
Bayes methods are somehow easier to implement and they deal nicely with
extra-Poisson variability (and, at the same time, they informally check
the adequacy of the Poisson assumption). However, they still need the
questionable base period. The sequential, full Bayes approach on the
other hand also avoids this drawback of traditioal u-
charts. The implementation requires numerical simulation, and also use
of a prior distribution. Several possibilities for both objective and
informative priors are explored. We will argue that the sequential,
full Bayesian u-Control Chart is a powerful and versatile tool for
process monitoring.
This is joint work with G. García-Donato, University of
Castilla-La-Mancha
- November 29 "An Introduction to Parallel Computing @ Duke Statistics"
- Chris Hans, Duke University
- Abstract: It is anticipated that parallel computing resources will be increasingly
available to Duke Statistics students and affiliates during the next few years. In
order to facilitate use of these resources, I will present an introductory
tutorial to parallel computing, drawing on my experiences from the last
two years. I plan to cover: (i) basic structure and the "big picture" of
distributed computing, (ii) example code for simple parallel computing
problems and (iii) a description of how to run parallel code on Duke's
CSEM Cluster (the cluster for which access is expected to increase over
the next few years). This tutorial is primarily intended for people with
little or no knowledge of parallel computing and will not address any
advanced distributed computing concepts or theory.
Discussion Papers: