STA114/MTH136: Statistics Lab 10, Apr 5 2001
In lecture we discussed using an F-Test to compare a large model
with a smaller one:
- Big:
- Yi ~
ß 0 + ß 1 Xi1 + ...
+ ß r Xir
+ ß r+1 Xir+1 + ...
+ ß p Xip + epsi
- Small:
- Yi ~
ß 0 + ß 1 Xi1 + ...
+ ß r Xir + epsi
Note these are "nested", so the second is just a submodel of the first
with ß r+1 = ... = ß p = 0, and so the
residual sum of squares Q will be larger for the Small model
(where Q is minimized subject to the constraint that the last (p-r)
ß j's are zero). The F statistic is just the ratio of
an estimate of sigma2 based on the difference of Q's to
an estimate of sigma2 based on the squared residuals from
the big model; if H0: ß r+1 = ... = ß
p = 0 is true then both the numeratior and the denominator
ought to be pretty good estimates of the same quantity
sigma2, so their ratio F ought to be about one; if
H0 is false, then the smaller model ought to have a much
bigger residual sum of squares, so the ratio F ought to be quite big.
This lab explores that for the Mercedes data, using the S-Plus functions
"lm" to fit the linear models and "anova" to compute the F statistic and
the associated P-value. It assumes you've copied the file
mercedes
into a file where S-Plus can find it; do that first, using your
browser's "save link as" function (right-click).
motif();
help.start();
cars <- read.table("mercedes", head=T, skip=23);
y <- log(cars$Price); # Log price
mile <- cars$Mile # Mileage
age <- 2*cars$Age # Age, in years
modl <- factor(cars$Mod); # Model (500, 450, 380, 280, 200) as FACTOR
fit1 <- lm(y ~ mile); # Small model
fit2 <- lm(y ~ mile + modl); # Big model
Q2 <- sum(fit2$resid^2); # Residual sum of squares, big model
Q2;
summary(fit2);
sqrt(Q2/fit2$df.resid); # Estimate of sigma... not sigma^2!!!
# Do you recognize this from summary(fit2)???
Q1 <- sum(fit1$resid^2); # Residual sum of squares, small model (BIGGER)
Q1;
summary(fit1);
sqrt(Q1/52); # Note fit2$df.resid is 52 (4 more than fit1)
F = ((Q1-Q2)/4) / (Q2/48) # F-Ratio
1-pf(F, 4, 48); # p-value
anova(fit1, fit2); # Recognize the p-value and F-statistic?
fit3 <- lm(y ~ mile + age); # Model adding just age to mileage...
anova(fit1, fit3); # Does age improve fit?
fit4 <- lm(y ~ mile + modl + mile*modl); # Even bigger model, with
# sep. slopes for diff. models
anova(fit2, fit4); # Do separate slopes seem justified? Why?
q();
Play with the mercedes models or with the
pollution data,
comparing models with different explanatory variables. Which models
seem best?