.. reproducible research file, created by ARichards

================
Permutation Test
================

Permutation tests are a group of nonparametric statistics.  Here we use a permutation test to test the null hypothesis that two different groups come from the same distribution.  The notation and examples shown here are borrowed from Efron and Tibshirani's *An Introduction to the Bootstrap* [1].  Our specific problem is that we observe two groups of data

.. math::
    :nowrap:

    \begin{eqnarray*}
        F \rightarrow \mathbf{z} &= \{ z_{1}, \ldots, z_{n} \}\\
        G \rightarrow \mathbf{y} &= \{ y_{1}, \ldots, y_{m} \}
    \end{eqnarray*}

and we are interested in testing the null hypothesis that

.. math::

    H_{0}: F = G.
 
Data
____

In the case of this example the following data are survival times for treatment :math:`z` and control :math:`y` groups.

>>> import numpy as np 
>>> z = np.array([94,197,16,38,99,141,23])
>>> y = np.array([52,104,146,10,51,30,40,27,46])

We note that the difference among the means hints that the treatment group may have longer survival times in general.

>>> theta_hat = z.mean() - y.mean()
>>> print theta_hat
30.6349206349

Here we run the permutation test

>>> def run_permutation_test(pooled,sizeZ,sizeY,delta):
...     np.random.shuffle(pooled)
...     starZ = pooled[:sizeZ]
...     starY = pooled[-sizeY:]
...     return starZ.mean() - starY.mean()
...

>>> pooled = np.hstack([z,y])
>>> delta = z.mean() - y.mean()
>>> numSamples = 10000
>>> estimates = np.array(map(lambda x: run_permutation_test(pooled,z.size,y.size,delta),range(numSamples)))
>>> diffCount = len(np.where(estimates <=delta)[0])
>>> hat_asl_perm = 1.0 - (float(diffCount)/float(numSamples))
>>> hat_asl_perm
0.14070000000000005

and we fail to reject the null hypothesis.

Further Notes
_____________

An alternative way to analyze these data is to use normal theory and assume that both underlying distributions are Gaussian.  We can then calculate the *achieved significance level* ASL as

.. math::

   ASL = 1 - \Phi \left( \frac{\hat{\theta}}{\sigma \sqrt{1/n+1/m}} \right)

where :math:`\Phi` is the standard normal cdf.  Because :math:`\sigma` is unknown we can use

.. math::

  \bar{\sigma} = \left\{ \frac{ \sum^{n}_{i=1} (z_{i}-\bar{z})^{2} + \sum^{m}_{j=1} (y_{j}-\bar{y})^{2} } {n+m-2} \right\}^{\frac{1}{2}}  

>>> sigma_bar = np.sqrt((  np.sum((z-z.mean())**2) + np.sum((y-y.mean())**2) ) / (z.size + y.size - 2.0))
>>> sigma_bar
54.235210425392665

such that

>>> import scipy.stats as stats
>>> 1.0 - stats.norm.cdf(theta_hat / (sigma_bar * np.sqrt((1.0/z.size)+(1.0/y.size))))
0.13117683784495782

Instead of estimating the variance with :math:`\bar{\sigma}`, a fixed constant, we can use Student's *t*-test.

>>> 1.0 - stats.t.cdf(theta_hat / (sigma_bar * np.sqrt((1.0/z.size)+(1.0/y.size))),14)
0.1406062923976501  

In both cases we fail to reject the null.

__________________


* `Literate Programming <http://www.literateprogramming.com/tools.html>`_

Bibliographic notes
___________________

1. Efron, Bradley and Tibshirani, R. J., An Introduction to the Bootstrap,Chapman & Hall/CRC Monographs on Statistics & Applied Probability, 1994.