1. Unbalanced ANOVA: Consider the model \(y_{i,j} = \theta_j + \epsilon_{i,j}\) for \(i=1,\ldots, r_j\) and \(j=1,\ldots,p\), where the \(\epsilon_{i,j}\)’s are i.i.d. normal with mean zero and variance \(\sigma^2\). Note that the sample size for each group may be different.

    1. Write out a (full rank) model matrix and obtain a formula for \(\hat y\) and \(SSE\).
    2. Obtain formulas for \(\hat y_H\) and \(SST=RSS_H\), that is, the fitted values and residual sum of squares under the reduced model \(y_{i,j} = \mu + \epsilon_{i,j}\).
    3. Write out the vectors \(y-\hat y_H\), \(y-\hat y\) and \(\hat y- \hat y_H\). Show that \(SST\) may be orthogonally decomposed into \(SSB=|| \hat y- \hat y_H||^2\) and \(SSE\).
    4. What is an appropriate \(F\)-test for evaluating \(H:\theta_1=\cdots = \theta_p\)? Specifically, what are the degrees of freedom for the null distribution?
    5. Compute \(E[MSB]\) and compare it to the formula for the equal sample size case. Suppose we had some a priori opinions about the means of the groups. Can we use this information to allocate replications, in order to increase \(E[MSB]\) (thereby increasing the probability of rejecting the null hypothesis)?
  2. Analysis of covariance: The file “schoolData.rds” contains information on 751 high schools that took part in the 2002 National Educational Longitudinal Study (NELS), and includes the following variables:

    • BYG10EP: an ordered categorical factor relating school size (enrollment);
    • BY10FLP: an ordered categorical factor relating to the number of students on a free-lunch program;
    • BYTXMSTD: a school-level average score on a standardized math test.

    Of interest is evaluating the effect of school size on math scores.

    1. Make boxplots of BYTXMSTD as a function of BYG10EP, and as a function of BY10FLP. Describe evidence of differences in math score means across the different levels of the factors.
    2. Construct the ANOVA table and \(p\)-value for the \(F\)-test that evaluates the evidence for differences in mean math scores by enrollment category.
    3. Repeat part b. after projecting out potential effects of BY10FLP. Describe differences between the two ANOVAs.

    (Comment: The number of students from a given school varies across schools, and here we are ignoring this potential source of heteroscedasticity. One remedy would be to use weighted least-squares.)

  3. Expected \(F\) statistic: Suppose \(y_{i,j} = \theta_j + \epsilon_{i,j}\) for \(i=1,\ldots,r\), \(j=1,\ldots,p\) and the \(\epsilon_{i,j}\)’s are i.i.d. \(N(0,\sigma^2)\). Let \(F(y) = MSA/MSE\) be the \(F\)-statistic for testing \(\theta_1= \cdots = \theta_p\). Compute the expectation of \(F\) as a function of \(r\), \(p\) and the \(\tau_j\)’s “by hand”, that is, only using properties of standard (central) \(\chi^2\) distributions, and
    not by looking up properties of the non-central \(\chi^2\) or \(F\) distributions.

  4. Non-central \(F\): Note that if \(X\sim N_k(\gamma,I)\) then \(||X||^2\) has a non-central \(\chi^2_k\) distribution with non-centrality parameter \(||\gamma||^2\).

    1. Let \(SSA\) be the across-group sum of squares from a one-factor ANOVA for the model \(y_{i,j}\sim N(\mu+\tau_j,\sigma^2)\). Derive the distribution of \(SSA\), and the distribution of \(MSA/MSE\).
    2. Suppose you are designing an experiment to assess evidence of differences between \(p\) treatment level, and want to know how big the number of replications \(r\) per level should be. Under the assumption of normal, homoscedastic errors, determine the smallest value of \(r\) so that the probability of rejecting the null hypothesis of no treatment effect is at least 80% when the variance of the treatment effects \(\sum \tau_j^2/p\) is equal to the error variance \(\sigma^2\), for \(p=4,8\) and \(16\).