Sta112FS Lecture 8 - Statistical inference, Pt. 2

September 18, 2014

Fom last time

Organ donors

People providing an organ for donation sometimes seek the help of a special "medical consultant". These consultants assist the patient in all aspects of the surgery, with the goal of reducing the possibility of complications during the medical procedure and recovery. Patients might choose a consultant based in part on the historical complication rate of the consultant's clients.

One consultant tried to attract patients by noting that the average complication rate for liver donor surgeries in the US is about 10%, but her clients have only had 3 complications in the 62 liver donor surgeries she has facilitated. She claims this is strong evidence that her work meaningfully contributes to reducing complications (and therefore she should be hired!).

\(\frac{3}{62}\) = 0.048

\(H_0: p = 0.10\)

\(H_A: p < 0.10\)

Simulation based inference

Simulate many random samples of \(n = 62\) assuming that \(H_0: p = 0.10\) is true.
For each simulation, record \(\hat{p}\), the proportion of successes (complications during surgeries)
Calculate the p-value = P(observed or more extreme outcome | \(H_0\) true)
If p-value < significance level, reject \(H_0\) in favor of \(H_A\). If not, fail to reject \(H_0\).

Application exercise 6:

Automate the process of constucting the simulated null distribution using 100 simulations. Plot the distribution using a stacked dot plot, and calculate the p-value two ways. First, counting the dots on the plot, and then using R and subsetting.

Challenge: Your code should have as few hard coded arguments as possible. The ultimate goal is to be able to re-use the code with little modification for another dataset/hypothesis test.

Types of alternative hypotheses

One sided (one tailed) alternatives: The parameter is hypothesized to be less than or greater than the null value (< or >)
Two sided (two tailed) alternatives: The parameter is hypothesized to be not equal to the null value (\(\ne\))
- Calculated as two times the tail area beyond the observed sample statistic
- More objective, and hence more widely preferred

Functions

Function Basics

In R functions are first order objects, this means we can work with them like any other object in R.

f = function(x) x*x
list(f)

## [[1]]
## function (x) 
## x * x

typeof(f)

## [1] "closure"

Function Parts

The two parts of a function are the arguments (formals) and the code (body).

gcd = function(loc1, loc2)
{
  deg2rad = function(deg) return(deg*pi/180)

  lat1 = deg2rad( loc1[1] )
  lat2 = deg2rad( loc2[1] )
  long1 = deg2rad( loc1[2] )
  long2 = deg2rad( loc2[2] )

  R = 6371 # Earth mean radius in km
  d = acos(sin(lat1)*sin(lat2) + cos(lat1)*cos(lat2) * cos(long2-long1)) * R
  
  return(d) # distance in km
}

formals(gcd)

## $loc1
## 
## 
## $loc2

body(gcd)

## {
##     deg2rad = function(deg) return(deg * pi/180)
##     lat1 = deg2rad(loc1[1])
##     lat2 = deg2rad(loc2[1])
##     long1 = deg2rad(loc1[2])
##     long2 = deg2rad(loc2[2])
##     R = 6371
##     d = acos(sin(lat1) * sin(lat2) + cos(lat1) * cos(lat2) * 
##         cos(long2 - long1)) * R
##     return(d)
## }

los_angeles = c(34.052235, -118.243683)
durham = c(36.002453, -78.905869)

gcd(los_angeles, durham)

## [1] 3564

Return values

In the preceding slides we have seen two approaches for returning values: explicit and implicit return values. The former should be preferred of the later except in the case of very simple functions.

Explicit - includes one or more returns

f = function(x) 
      return(x*x)

Implicit - value from last statement is returned.

f = function(x) 
      x*x

Returning multiple and named values

If we want a function to return more than one value we can group things using either vectors or lists.

f = function(x) list(value = x, squared = x^2, cubed = x^3)

f(2)

## $value
## [1] 2
## 
## $squared
## [1] 4
## 
## $cubed
## [1] 8

f(2:3)

## $value
## [1] 2 3
## 
## $squared
## [1] 4 9
## 
## $cubed
## [1]  8 27

Argument defaults

In R it is possible to give function arguments default values,

f = function(x=1,y=1,z=1) paste0("x=",x," y=",y," z=",z)

f()

## [1] "x=1 y=1 z=1"

f(x = 2, y = 4, z = 9)

## [1] "x=2 y=4 z=9"

f(z=3)

## [1] "x=1 y=1 z=3"

Scoping

R has generous scoping rules, if it can't find a variable in the functions body's scope, it will look for it in the next higher scope, and so on.

y = 1
f = function(x) 
{
  x+y
}
f(3)

## [1] 4

g = function(x)
{
  y=2
  x+y
}
g(3)

## [1] 5

Additionally, variables defined within a scope only persist for the duration of that scope, and do not overwrite variables at the higher scopes.

x = 1
y = 1
z = 1
f = function()
{
    y = 2
    g = function()
    {
      z = 3
      return(x + y + z)
    }
    return(g())
}
f()

## [1] 6

c(x,y,z)

## [1] 1 1 1

Getting Help

Prefixing any function name with a ? will open the related help file for that function.

?sum
?`+`
?`[`

For functions not in the base package, you can also see their implementation by entering the function name without parentheses (or using body function).

lm

## function (formula, data, subset, weights, na.action, method = "qr", 
##     model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, 
##     contrasts = NULL, offset, ...) 
## {
##     ret.x <- x
##     ret.y <- y
##     cl <- match.call()
##     mf <- match.call(expand.dots = FALSE)
##     m <- match(c("formula", "data", "subset", "weights", "na.action", 
##         "offset"), names(mf), 0L)
##     mf <- mf[c(1L, m)]
##     mf$drop.unused.levels <- TRUE
##     mf[[1L]] <- quote(stats::model.frame)
##     mf <- eval(mf, parent.frame())
##     if (method == "model.frame") 
##         return(mf)
##     else if (method != "qr") 
##         warning(gettextf("method = '%s' is not supported. Using 'qr'", 
##             method), domain = NA)
##     mt <- attr(mf, "terms")
##     y <- model.response(mf, "numeric")
##     w <- as.vector(model.weights(mf))
##     if (!is.null(w) && !is.numeric(w)) 
##         stop("'weights' must be a numeric vector")
##     offset <- as.vector(model.offset(mf))
##     if (!is.null(offset)) {
##         if (length(offset) != NROW(y)) 
##             stop(gettextf("number of offsets is %d, should equal %d (number of observations)", 
##                 length(offset), NROW(y)), domain = NA)
##     }
##     if (is.empty.model(mt)) {
##         x <- NULL
##         z <- list(coefficients = if (is.matrix(y)) matrix(, 0, 
##             3) else numeric(), residuals = y, fitted.values = 0 * 
##             y, weights = w, rank = 0L, df.residual = if (!is.null(w)) sum(w != 
##             0) else if (is.matrix(y)) nrow(y) else length(y))
##         if (!is.null(offset)) {
##             z$fitted.values <- offset
##             z$residuals <- y - offset
##         }
##     }
##     else {
##         x <- model.matrix(mt, mf, contrasts)
##         z <- if (is.null(w)) 
##             lm.fit(x, y, offset = offset, singular.ok = singular.ok, 
##                 ...)
##         else lm.wfit(x, y, w, offset = offset, singular.ok = singular.ok, 
##             ...)
##     }
##     class(z) <- c(if (is.matrix(y)) "mlm", "lm")
##     z$na.action <- attr(mf, "na.action")
##     z$offset <- offset
##     z$contrasts <- attr(x, "contrasts")
##     z$xlevels <- .getXlevels(mt, mf)
##     z$call <- cl
##     z$terms <- mt
##     if (model) 
##         z$model <- mf
##     if (ret.x) 
##         z$x <- x
##     if (ret.y) 
##         z$y <- y
##     if (!qr) 
##         z$qr <- NULL
##     z
## }
## <bytecode: 0x104474e28>
## <environment: namespace:stats>

When to use functions

The goal of a function should be to encapsulate a small reusable piece of code.

Name should make it clear what the function does (think in terms of simple verbs).
Functionality should be simple enough to be quickly understood.
The smaller and more modular the code the easier it will be to reuse elsewhere.
Better to change code in one location than code everywhere.

Application exercise 7:

Write a function that takes in the birth month of a person, and outputs the phrase "You are a season baby!", where season is determined by birth month.

Hint: The paste function might be useful.

Application exercise 8:

Write a function with the inputs data (a vector of the observed data), null (null value), alternative (the alternative hypothesis, options: less than, greater than, not equal to), success (level of the categorical variable to be considered a success), seed (the seed to be set). As an output, the function should report the null hypothesis, the alternative hypothesis, a summary of the dataset, the observed sample statistic, the p-value, and the simulated null distribution.
Confirm that your function is working properly using the earlier dataset/hypothesis test on livery donor surgery complications.
One of the earliest examples of behavioral asymmetry is a preference in humans for turning the head to the right, rather than to the left, during the final weeks of gestation and for the first 6 months after birth.This is thought to influence subsequent development of perceptual and motor preferences. A study of 124 couples found that 64.5% turned their heads to the right when kissing. Do these data provide convincing evidence that majority of couples turn their heads to the right when kissing? Use the function you created earlier to conduct the appropriate hypothesis test. You can download the dataset at https://stat.duke.edu/courses/Fall14/sta112.01/data/kissing.html.

Hint: If you need a reminder on how to load a csv file into R, see HW2: https://stat.duke.edu/courses/Fall14/sta112.01/hw/hw2.html.