STA114 / MTH136 - Statistics -Lab 10, Apr 5 2001

STA114/MTH136: Statistics Lab 10, Apr 5 2001

In lecture we discussed using an F-Test to compare a large model with a smaller one:

Big:: Y_i ~ ß ₀ + ß ₁ X_i1 + ... + ß _r X_ir + ß _r+1 X_ir+1 + ... + ß _p X_ip + eps_i
Small:: Y_i ~ ß ₀ + ß ₁ X_i1 + ... + ß _r X_ir + eps_i

Note these are "nested", so the second is just a submodel of the first with ß _r+1 = ... = ß _p = 0, and so the residual sum of squares Q will be larger for the Small model (where Q is minimized subject to the constraint that the last (p-r) ß _j's are zero). The F statistic is just the ratio of an estimate of sigma² based on the difference of Q's to an estimate of sigma² based on the squared residuals from the big model; if H₀: ß _r+1 = ... = ß _p = 0 is true then both the numeratior and the denominator ought to be pretty good estimates of the same quantity sigma², so their ratio F ought to be about one; if H₀ is false, then the smaller model ought to have a much bigger residual sum of squares, so the ratio F ought to be quite big. This lab explores that for the Mercedes data, using the S-Plus functions "lm" to fit the linear models and "anova" to compute the F statistic and the associated P-value. It assumes you've copied the file mercedes into a file where S-Plus can find it; do that first, using your browser's "save link as" function (right-click).

  motif();
  help.start();
  cars  <- read.table("mercedes", head=T, skip=23);
  y     <- log(cars$Price);  # Log price
  mile  <- cars$Mile         # Mileage
  age   <- 2*cars$Age        # Age, in years
  modl  <- factor(cars$Mod); # Model (500, 450, 380, 280, 200) as FACTOR

  fit1  <- lm(y ~ mile);        # Small model
  fit2  <- lm(y ~ mile + modl); # Big   model

  Q2 <- sum(fit2$resid^2);      # Residual sum of squares, big model
  Q2;
  summary(fit2);
  sqrt(Q2/fit2$df.resid);       # Estimate of sigma... not sigma^2!!!
                                # Do you recognize this from summary(fit2)???

  Q1 <- sum(fit1$resid^2);      # Residual sum of squares, small model (BIGGER)
  Q1;
  summary(fit1);
  sqrt(Q1/52);                  # Note fit2$df.resid is 52 (4 more than fit1)

  F = ((Q1-Q2)/4) / (Q2/48)     # F-Ratio
  1-pf(F, 4, 48);               # p-value

  anova(fit1, fit2);            # Recognize the p-value and F-statistic?

  fit3 <- lm(y ~ mile + age);  # Model adding just age to mileage... 
  anova(fit1, fit3);            # Does age improve fit?

  fit4 <- lm(y ~ mile + modl + mile*modl);  # Even bigger model, with
                                            # sep. slopes for diff. models
  anova(fit2, fit4);            # Do separate slopes seem justified?  Why?

  q();

Play with the mercedes models or with the pollution data, comparing models with different explanatory variables. Which models seem best?