.. bootstrap confidence intervals file, created by ARichards ============================== Bootstrap Confidence Intervals ============================== We present a problem and show a model based approach to estimating confidence intervals then we follow up with a bootstrap based approach. This example implements the bias-corrected and accelerated :math:`BC_{a}` method to calculate confidence intervals. The notation us borrowed from Efron and Tibshirani's *An Introduction to the Bootstrap* [1]. The Problem ___________ Ten university students were recruited to play the seminal video game pac-man. After equal amounts of practice the students were then asked to play the first stage and the time to completion was recorded. The following data in seconds were recorded. >>> import numpy as np >>> x = np.array([63.1,64.6,63.4,65.6,63.4,65.7,64.6,63.1,65.9,63.3]) Method 1 ________ Because we do not know the population variance the sample standard deviation :math:`s^{2}` can be used as a subsitute. .. math:: :nowrap: \begin{equation*} s_{d} = \sqrt{\frac{\sum^{n}_{i=1} (x_{i} - \bar{x} )^{2}}{n-1}} \end{equation*} >>> sd = np.sqrt(np.sum(np.power(x - x.mean(),2)) / (x.size-1)) >>> sd 1.1489608831945115 Then we can calculate the CI as .. math:: :nowrap: \begin{equation*} CI: \left(\bar{x} \pm t_{n-1},1-\frac{\alpha}{2} \times \frac{s_{d}}{\sqrt{n}} \right) \end{equation*} >>> import scipy.stats as stats >>> alpha = 0.05 >>> interval = stats.t.ppf(1.0 - (alpha / 2.0),x.size-1) * (sd / np.sqrt(x.size)) >>> ci = (x.mean() - interval, x.mean() + interval) >>> ci (63.448082897537432, 65.09191710246256) We are 95% confident that the true population mean lies between 63.45 and 65.09. Method 2 ________ let :math:`\mathbf{x}^{*}` be a nonparametric bootstrap sample from the empirical distribution :math:`\hat{F}` .. math:: :nowrap: \begin{equation*} \hat{F} \rightarrow \mathbf{x}^{*} = \{ x_{1}^{*}, \ldots, x_{n}^{*} \} \end{equation*} We note that the difference among the means hints that the treatment group may have longer survival times in general. Example from immunology _______________________ In these data we have reason to believe that the confidence intervals will not always be symmetric about :math:`\hat{\theta}`. We use two methods to illustrate the differences in CI estimates. Let the data be the cytokine responses (\%) for a sample measured by 16 laboratories. For the purposes of this example we will assume independence among the samples. >>> import numpy as np >>> x = np.array([0.4043, 0.5958, 0.5876, 0.5939, 0.6053, 0.4649, 0.5722,\ ... 0.6234, 0.5208, 0.6117, 0.6137, 0.7397, 0.6981, 0.7757, 0.7596, 0.532]) **Method 2** - The :math:`BC_{a}` method described above Further Notes _____________ An alternative way to analyze these data is to use normal theory and assume that both underlying distributions are Gaussian. We can then calculate the *achieved significance level* ASL as .. math:: ASL = 1 - \Phi \left( \frac{\hat{\theta}}{\sigma \sqrt{1/n+1/m}} \right) where :math:`\Phi` is the standard normal cdf. Because :math:`\sigma` is unknown we can use .. math:: \bar{\sigma} = \left\{ \frac{ \sum^{n}_{i=1} (z_{i}-\bar{z})^{2} + \sum^{m}_{j=1} (y_{j}-\bar{y})^{2} } {n+m-2} \right\}^{\frac{1}{2}} >>> sigma_bar = np.sqrt(( np.sum((z-z.mean())**2) + np.sum((y-y.mean())**2) ) / (z.size + y.size - 2.0)) >>> sigma_bar 54.235210425392665 such that >>> import scipy.stats as stats >>> 1.0 - stats.norm.cdf(theta_hat / (sigma_bar * np.sqrt((1.0/z.size)+(1.0/y.size)))) 0.13117683784495782 Instead of estimating the variance with :math:`\bar{\sigma}`, a fixed constant, we can use Student's *t*-test. >>> 1.0 - stats.t.cdf(theta_hat / (sigma_bar * np.sqrt((1.0/z.size)+(1.0/y.size))),14) 0.1406062923976501 In both cases we fail to reject the null. __________________ * `Literate Programming `_ Bibliographic notes ___________________ 1. Efron, Bradley and Tibshirani, R. J., An Introduction to the Bootstrap,Chapman & Hall/CRC Monographs on Statistics & Applied Probability, 1994.