Goto Lab:

Homework 4

Due Tues 5 Sept in class

Assignment:

This assignment contains homework problems out of the text as well as a lab portion.

As usual, go thru all the conceptual exercises in Chapter 2.
Chapter 2 Exercises: 13, 15, 16 (compare to the permutation results you obtained in hw3), 20 and 21.
Make the power curve plot which is assigned at the end of the lab below. *Note that the lab sections under the orange headings are not directly relevant to the assignment - they give additional info relating Splus to the topic we are studying. Hence they can be skipped without any harm.

Doing t calculations in Splus
- Lets go thru the example of Bumpus's data which looks at survival and length of humerus for adult male house sparrows -- see Case study 2.1.1. Assume I've read in the data CASE0201.ASC into a dataframe called bumpus.
- The following commands give summary statistics for the humerous lengths by group. Of course, you would also like to look at the data visually, but I'm just talking about Splus computations here.
```
ok1_bumpus$code==1
ok2_bumpus$code==2
ybar1_mean(bumpus$humerus[ok1])
ybar2_mean(bumpus$humerus[ok2])
sd1_sqrt(var(bumpus$humerus[ok1]))
sd2_sqrt(var(bumpus$humerus[ok2]))
n1_length(bumpus$humerus[ok1])
n2_length(bumpus$humerus[ok2])
```
- This should give the solutions:
  
  Group n Average (in/1000) SD (in/1000)
  
  1 Died 24 727.92 23.54
  
  2 Survived 35 738.00 19.84
  
  *note this is in thousandths of an inch; Display 2.8 in the book uses inches. You can get the answer in inches by multiplying the means and sds by 0.001.
- The following gives the t-statistic to test whether or not the means are equal.
```
# calculate the pooled sd estimate
sdpool_sqrt(((n1-1)*sd1^2 + (n2-1)*sd2^2)/(n1+n2-2))
# calculate se for ybar2-ybar1
sediff_sdpool*sqrt(1/n1 + 1/n2)
# compute the t-statistic 
tval_(ybar2 - ybar1)/sediff
# now get the area to the left under the t curve
P_pt(tval,df=n1+n2-2)
# now get the one-sided p-value
1-P
# the two sided p-value is twice that above
(1-P)*2
```
- To get a 95% confidence interval:
```
# get the required t_df(1-alpha/2)..
tstar_qt(1-.05/2,df=n1+n2-2)
# lower end of the ci
lower_ybar2 - ybar1 - tstar*sediff
# upper end of the ci
upper_ybar2 - ybar1 + tstar*sediff
# print out the ci:
print(c(lower,upper))
```
- Just to check, here are the values for various quantities:
```
> round(c(sdpool,sediff,tval,P,1-P,(1-P)/2,tstar,lower,upper),3)
[1] 21.411  5.674  1.777  0.960  0.040  0.020  2.002  -1.279  21.446
```
Using the t.test() function from the command line
- A very useful Splus function is t.test(). Type ?t.test to find more info on it. It takes one or two data vectors as its argument - one for a one-sample t-test; two for a two-sample t-test.
- As an example, I'll use the Bumpus data above:
```
ok1_bumpus$code==1
ok2_bumpus$code==2
y1_bumpus$humerus[ok1]
y2_bumpus$humerus[ok2]
```
  Now y1 and y2 hold the data from the two groups. To compute a two sample t-test to compare mu2 to mu1:
```
> t.test(y2,y1)

         Standard Two-Sample t-Test 

data:  y2 and y1 
t = 1.777, df = 57, p-value = 0.0809 
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
  -1.279386  21.446053 
sample estimates:
 mean of x mean of y 
       738  727.9167
```
  Compare these results to those above.
- You can also assign an object to the result of t.test() and then use various pieces of it.
```
> a_t.test(y2,y1)
```
  a is a list of values. To see what is in a type:
```
> names(a)
[1] "statistic"   "parameters"  "p.value"     "conf.int"    "estimate"   
[6] "null.value"  "alternative" "method"      "data.name"  
```
  So you can grab the p-value, confidence interval, or t-statistic for example.
```
> a$statistic
        t 
 1.776998
> a$p.value
[1] 0.0809045
> a$conf.int
[1]  -1.279386  21.446053
attr(, "conf.level"):
[1] 0.95
```
  This feature can be very convinient if you're doing many t-tests or running a simulation.
Writing functions in Splus
- A very useful feature of Splus is writing functions. Here's an example of a function:
```
> sd_function(x) sqrt(var(x))
```
  This creates a function called sd(). It requires a vector x as its argument. This function takes a vector of numbers x and computes the standard deviation. To try it out type:
```
> sd(c(1,2,3))
[1] 1
> sd(rnorm(5,mean=69,sd=3))
[1] 5.682016
> sd(rnorm(500,mean=69,sd=3))
[1] 2.952136
```
- Here's a slightly more elaborate function. It takes in two numeric vectors and returns a vector where the first element is the lower end of the 95% confidence interval, and the second element is the upper end.
```
findci_function(x,y){
 a_t.test(x,y)
 ci_as.vector(a$conf.int)
 return(ci)
}
```
  No try it out on some data:
```
> findci(rnorm(10,mean=0,sd=2),rnorm(15,mean=1,sd=2))
[1] -2.124404  1.049983
> findci(y2,y1)
[1]  -1.279386  21.446053
```
Computing the power of a 2-sample t-test.
- Here's a function to calculate the power of a two-sample t-test. Power is the chance of obtaining a significant result when the difference between group means and standard deviation are known. Here's a function that gives the power of a one-sided t-test for comparing two groups:
```
power_function(mudiff,n,sd,level=.05){
 tstat_mudiff/sqrt(sd^2/n + sd^2/n)
 df_n-2
 tcrit_qt(1-level,df=df)
 1-pt(tcrit-tstat,df)
}
```
  The function requires the arguments:
  - mudiff the difference between group means. It's assumed to be positive.
  - n the sample size. It's assumed to be the same for each group.
  - sd the standard deviation for the observations. It's assumed the SD is the same for both groups.
  - Optional argument level . If you don't supply this argument it will be set to .05 (which is standard). If you want stronger evidence from your study, you might change the level to something like .01.
- Paste this function into Splus. To compute the chance of decting a difference between means (at the .05 level) when the actual mean difference is 1 and the SD is 3 and the number of subjects in each group is 20 is
```
> power(1,20,3)
[1] 0.2525874
```
  For a sample size of 50 each it's
```
> power(1,50,3)
[1] 0.4958101
```
An example
- Here's an example from consulting with researchers working on the FACE study here in Duke Forest. Here some trees are exposed to higher ambient CO₂ as compared to the control trees. Initial measurements of Vcmax were taken to get an idea of how many trees would be needed to detect a difference in Vcmax between the treatment and control trees. Here are the data:
```
trt_c(36.23,30.86,38.25,39.61,39.16,40.49,35.08,39.2)
cont_c(30.9,32.62,28.96,28.71,29.68,33.26)
```
  From this data we can get an idea of what the mean difference might be and what SD could be expected from a larger study.
```
> mean(trt)
[1] 37.36
> mean(cont)
[1] 30.68833
> mean(trt) - mean(cont)
[1] 6.671667
> ntrt_length(trt)
[1] 8
> ncont_length(cont)
[1] 6
> sdpool_sqrt((var(trt)*(ntrt-1) + var(cont)*(ncont-1))/(ntrt+ncont-2)) 
[1] 2.72809
```
  So if there were 10 trees in each group, a good guess at the chance of detecting a difference in mean Vcmax is:
```
> power(6.67,10,2.73)
[1] 0.9965274
```
  Which is very likely.
- A single calculation isn't always super useful to the researchers. Sometimes it's helpful to give them a graph of power as a function the difference in means. Here's one graph I supplied the researchers. I take advantage of the fact that power works even if I put in a vector of values for mudiff
```
> power(c(1.1,2.2,3.3),6,sdpool)
[1] 0.1125099 0.2515317 0.4862440
```
  Here's the Splus commands to make a pretty graph:
```
# consider 5 different sample sizes: n=3,4,5,6 and 10
n_c(3,4,5,6,10)
# create a vector mdiff which ranges from 0 to 10, and
# whose length is 40.
mdiff_seq(0,10,length=40)
# now make the plotting region, but use type="n" so that nothing
# is yet plotted.  I'll make sure all the labels and stuff
# are ok.
plot(mdiff,power(mdiff,10,sdpool),type="n",
     ylab="power - chance of rejecting Ho",
     xlab="mu(treatment) - mu(control)",
     main="Power curve for Vcmax, using t-test with pooled se")
# Now go ahead and put in lines that correspond to the
# power as a function of mdiff for the various sample sizes -
# mdiff on the x-axis and the power on the y-axis.
# The argument  lwd=3  tells Splus to make the
# line width 3 times larger than normal.
lines(mdiff,power(mdiff,3,sdpool),lwd=3)
lines(mdiff,power(mdiff,4,sdpool),lwd=3)
lines(mdiff,power(mdiff,5,sdpool),lwd=3)
lines(mdiff,power(mdiff,6,sdpool),lwd=3)
lines(mdiff,power(mdiff,10,sdpool),lwd=3)
# finally, put on labels so we can tell which line goes with
# which sample size.  The line that increases the slowest is
# the one with the smallest sample size.  After you enter
# the text command below, you need to click on the plot at 5 different
# locations so that Splus knows where to put the labels. 
# At the first click, "n=3" will be added to the plot;
# at the 5th click,   "n=10" will be added to the plot.
# Remember hit the escape key if you get stuck.
text(locator(5),c("n=3","n=4","n=5","n=6","n=10"))
```
  Here is the resulting picture I got.
Lab Assignment

Group	n	Average (in/1000)	SD (in/1000)
1 Died	24	727.92	23.54
2 Survived	35	738.00	19.84