In this lab, we will use MatLab to do multiple linear
regression and residual analysis.
Items that we will investigate using MatLab include:
Use the data from EXAMPLE 12.13 on page 677 in the text. The dependent variable is Carbon Monoxide (CM) content of different cigarette brands. We want to investigate the relationship of three predictor or independent variables with CM, these variables are Tar (T), Nicotine (N), and Weight (W). (see page 677 for more details).
Read in the data and call the first column T, the second N, third W, and the fourth column CM.
smoke = load('lab9.dat');
T = smoke(:,1);
N = smoke(:,2);
W = smoke(:,3);
X = [ones(length(smoke),1) T N W ]; % X needs to be a
% matrix with a leading
% column of ones.
CM = smoke(:,4);
Y = CM;
Let's investigate the the items above:
% The predictor variables are in the matrix X along with a column of % intercepts (1's) and the response variable (Y) is CM. % Now find the least squares regression line [b,bint,r,rint,stats] = regress(Y,X,0.05); b % = estimates for the regression coefficients
The hypothesis Ho: bi = 0, vs. Ha:bi != 0 can be tested in the same way that we tested for b1 = 0 last week (see Lab 8). What are we testing here?
Now that there are more than one predictor variables we can test the hypothesis that at least one of the coefficients b1,..., b3 is not zero. That is we test Ho: b1 = b2 = b3 = 0, vs. Ha: at least one is not zero. To perform this hypothesis test we need to look at the accompanying F-statistic and p-value recorded in stats:
stats % = R-squared, F-statistic, p-value
The multiple coefficient of determination is the R-squared in stats. What does it signify?
stats % = R-squared, F-statistic, p-value
Examination of the residuals can tell us whether the assumptions we make to do regression are reasonably met. Recall the assumptions:
We can use the residuals to determine if these assumptions are met. First plot the residuals against the fitted values. What types of patterns should you expect to see if the assumptions are met? What can you conclude about the propriety of using multiple regression on this dataset? We can also look at a histogram of the residuals to determine if they are normal (see Lab 2 if you don't remember how).
plot(X*b,r,'.') %add appropriate title and labels, etc.
You can get a pretty good idea of whether a variable should be included in the model by looking at the residuals vs. predictor variable.
plot(T,r,'.') % OR plot(N,r,'.') % OR plot(W,r,'.')
Do any of these independent variables look like they should not be included as linear terms? If you don't know what to look for consult chapter 12 in the book.