Lab: Introduction to linear regression

This problem was taken from Regression Analysis: Theory, Methods, and Applications (1990), by A. Sen and M. Srivastava.

Description of the data

This exercise deals with the relationship between population density (pd) and vehicle thefts (vtt) per thousand residents in 18 Chicago districts (D). District 1 represents downtown Chicago. The data can be found here. Use the Netscape menus (File, then Save As)to save the data to a file on your desktop (say "vthefts.txt").

Starting S-Plus and getting the data

To start S-Plus, click Start, then Programs, then Statistics & Mathematics, then S-PLUS 2000. To read the data into S-Plus, choose from the S-Plus menus File, then Import Data, then From file. Choose the file that you have just saved. You should have a spreadsheet open with data in two columns. Note: By default, S-Plus will name the dataset according to the name of the file from which the data were imported.

Using S-Plus to fit regression models

1. Make a plot of vtt (vehicle thefts per thousand residents) versus pd (population density) as the independent. To do this, choose Graph, 2D Plot, then Scatter plot. Although scattered, how would you describe the relationship between the variables (largely linear, a little linear, not linear at all, etc.)?

2. Re-make the plot, this time including a least squares line. To do this, choose Graphs, 2D Plot, then Fit - Linear Least Squares. The line will be drawn using least squares estimates, which minimize the squared vertical differences between the observations and the fitted line.

3. Looking at your plot, do you notice any outliers? How are they affecting your perception of the trend? Remove the outlier, and re-make the plot as above. How does this change the least squares line?