STA102 Lab13

Lab 13 Objectives

In this lab, we will use S-Plus to perform a multiple regression analysis using a data set containing several variables on low birth weight infants. We've seen this data set in chapter 19 (notes and POB) where we examined the linear regression of infant head circumference on a few explanatory variables. Here, we do the same kind of analysis using, instead, the length of the infant as the response variable. We'll mimic the analysis done in section 19.3, Further Applications, starting on page 460 of POB. This analysis will cover similar issues that you will find in your Data Project, including the use of an indicator explanatory variable and an "interaction" term. Also, in your Data Project, you are asked to transform your response variable. We'll see that transforming variables in S-Plus is easy to do.

Multiple Linear Regression

First, import the low.birth.weight.infants data set. We'll work with infant length as the response, and try to model mean length as a (linear) function of a few explanatory variables (as in section 19.3, POB).

You might first construct a scatter plot matrix by following the instructions in the previous lab. I suggest you put the length variable first (in the upper left part of the matrix) as a way to reference it as the response variable (not necessary).

If you chose to construct the scatter plot matrix, you can see that length appears to be correlated with gestational age gestage, but the correlation with mother's age momage looks weak, at best (or look at your notes--the scatter plot is in chapter 19). We'll skip the simple linear regression of length on gestage and perform a multiple regression with gestage and momage as explanatory variables. We'll test whether momage explains any significant (linear) variability in length after accounting for the effect of gestage. That is, we'll test if the slope coefficient associated with momage is significantly different from zero (Ho on page 461). By the way, that's what most regression packages will report for hypothesis tests: whether each coefficient is different from zero _after_ accounting for all variability with the remaining explanatory variables.

To the menu! Statistics>Regression>Linear.... In the Linear Regression dialog box, specify length as the response (dependent) variable and both gestage and momage as explanatory (independent) variables. Notice how S-Plus automatically writes the formula in the Formula: field; you could just type in the formula directly if you wanted to. This will come in handy if you have to specify a transformation or an interaction term (more later...). To plot residuals, select the Plot tab and check the Residual vs. Fit box (i.e., the e-i's vs. the y-hat-i's). To get the ANOVA part of the output as in Table 19.1, check the ANOVA Table box on the Results tab. (We won't discuss this.) Click OK.

You'll see results similar to Table 19.1, with some slight but (hopefully) obvious differences in notation. Also, you should see a residual plot. S-Plus automatically indicates the 3 most "outlying" residuals (whether they are really outliers or not). It appears that we do have a few potential outliers. Read the discussion in POB that goes along with this output. Next, let's see if we can "model out" the outliers by including a different explanatory variable. We'll exclude momage since the p-value on the printout indicates that its coefficient is not significantly different from zero.

Before doing another regression, make sure the tox column in the data sheet is numeric (0's and 1's), not a factor variable. This will ensure that the output is the same as Table 19.2 and is consistent with our (and POB's) discussion of indicator variables. (By default, S-Plus will not code it as 0 and 1 if it's a factor.)

Back to the menu! Statistics>Regression>Linear.... You should be getting the hang of this by now, so I'll conserve on verbiage: regress length on gestage and tox (an indicator for the mother having toxemia (1) or not (0)).

Experiment with the Results, Plot, and Predict tabs. Notice you can save the residuals (the e-i's) and the fitted values (the y-hat-i's) from from the Results tab. You can specify the name of a new or an existing data sheet to save these into. Sometimes it's convenient to save the results into the same data sheet along with with the other (response and explanatory) variables. The Predict tab allows you to save predictions, confidence intervals (upper bounds and lower bounds), and the standard error of the fitted values (estimated mean values). If you do not specify new data (values of the explanatory variables at which to predict), S-Plus just predicts using the explanatory variable values in the existing data set. So, in this case, the predictions are the same as the fitted values that you can get from the Results tab. You can use the fitted values and confidence interval results to creat plots (like in the previous lab). You have to be a bit creative to get "prediction" intervals, but it's not too hard. (Left to you.)

If you haven't already pressed OK at this point, do it now!

If you asked for a Residuals vs. Fit plot, you'll see that including tox took care of some outliers; although there still seems to be at least one bad point. Compare your S-Plus results with those in Table 19.2 and read the discussion for this regression. Next, we fit an interaction term.

Since both gestage and tox seem to be significant (at the 5% level; what about multiple tests?!), we keep both. The tox indicator variable just shifts the line up and down; that is, it allows for different intercepts depending on whether the mother had toxemia or not (see Figure 19.6, page 464, POB). The line may also have a different slope, depending on toxemia. We'll include the interaction of gestage:tox to check this. The discussion in POB explains this fairly well.

Now, regress length on gestage, tox and gestage:tox; you specify an interaction in S-Plus using the colon, ":", symbol. Rather than trying to figure out how to get this with the menus, just type your model into the Formula: field on the Model tab of the Linear Regression dialog box. Again, experiment with the other tabs. (You can press OK at will.) Compare your output to Table 19.3, Figure 19.7, and the accompanying discussion in POB. Still, we seem to have that pesky outlier! You can try doing a regression without it; this may change things a bit, may not. (I didn't check.)

You might think the the residuals indicate some non-constant variance, but this may just be due to the fact that we do not have too many premature infants that are "too premature" (and too short). Similarly, we do not have too many long babies either. So, I don't think the residuals really indicate a departure from the constant variance assumption.

If we thought the residuals did show some non-constant variance, we might transform our response variable in some way. Also, if the residuals indicate some trend, be may need to transform our explanatory variables. In your Data Project, you are asked to perform a transformation on the response variable. This is easy to do in S-Plus. You merely type the function of the response variable into the Formula: field on the Model tab of the Linear Regression dialog box. You could also create a new column in your data sheet using the Data>Transform... menu; it's fairly self-explanatory. You would then use the transformed variable as you would any other (response or explanatory) variable.

We'll continue to discuss multiple regression and S-Plus in the next lecture.