STA114/MTH136: Statistics Lab 10, Apr 5 2001


In lecture we discussed using an F-Test to compare a large model with a smaller one:
Big:
Yi ~ ß 0 + ß 1 Xi1 + ... + ß r Xir + ß r+1 Xir+1 + ... + ß p Xip + epsi
Small:
Yi ~ ß 0 + ß 1 Xi1 + ... + ß r Xir + epsi
Note these are "nested", so the second is just a submodel of the first with ß r+1 = ... = ß p = 0, and so the residual sum of squares Q will be larger for the Small model (where Q is minimized subject to the constraint that the last (p-r) ß j's are zero). The F statistic is just the ratio of an estimate of sigma2 based on the difference of Q's to an estimate of sigma2 based on the squared residuals from the big model; if H0: ß r+1 = ... = ß p = 0 is true then both the numeratior and the denominator ought to be pretty good estimates of the same quantity sigma2, so their ratio F ought to be about one; if H0 is false, then the smaller model ought to have a much bigger residual sum of squares, so the ratio F ought to be quite big. This lab explores that for the Mercedes data, using the S-Plus functions "lm" to fit the linear models and "anova" to compute the F statistic and the associated P-value. It assumes you've copied the file mercedes into a file where S-Plus can find it; do that first, using your browser's "save link as" function (right-click).


  motif();
  help.start();
  cars  <- read.table("mercedes", head=T, skip=23);
  y     <- log(cars$Price);  # Log price
  mile  <- cars$Mile         # Mileage
  age   <- 2*cars$Age        # Age, in years
  modl  <- factor(cars$Mod); # Model (500, 450, 380, 280, 200) as FACTOR

  fit1  <- lm(y ~ mile);        # Small model
  fit2  <- lm(y ~ mile + modl); # Big   model

  Q2 <- sum(fit2$resid^2);      # Residual sum of squares, big model
  Q2;
  summary(fit2);
  sqrt(Q2/fit2$df.resid);       # Estimate of sigma... not sigma^2!!!
                                # Do you recognize this from summary(fit2)???

  Q1 <- sum(fit1$resid^2);      # Residual sum of squares, small model (BIGGER)
  Q1;
  summary(fit1);
  sqrt(Q1/52);                  # Note fit2$df.resid is 52 (4 more than fit1)

  F = ((Q1-Q2)/4) / (Q2/48)     # F-Ratio
  1-pf(F, 4, 48);               # p-value

  anova(fit1, fit2);            # Recognize the p-value and F-statistic?

  fit3 <- lm(y ~ mile + age);  # Model adding just age to mileage... 
  anova(fit1, fit3);            # Does age improve fit?

  fit4 <- lm(y ~ mile + modl + mile*modl);  # Even bigger model, with
                                            # sep. slopes for diff. models
  anova(fit2, fit4);            # Do separate slopes seem justified?  Why?

  q();


Play with the mercedes models or with the pollution data, comparing models with different explanatory variables. Which models seem best?