STA103: Linear regression

Lab: Introduction to linear regression

This problem was taken from Regression Analysis: Theory, Methods, and Applications (1990), by A. Sen and M. Srivastava.

Description of the data

This exercise deals with the relationship between population density (pd) and vehicle thefts (vtt) per thousand residents in 18 Chicago districts (D). District 1 represents downtown Chicago. The data can be found here. Use the Netscape menus (File, then Save As)to save the data to a file on your desktop (say "vthefts.txt").

Starting S-Plus and getting the data

To start S-Plus, click Start, then Programs, then Statistics & Mathematics, then S-PLUS 2000. To read the data into S-Plus, choose from the S-Plus menus File, then Import Data, then From file. Choose the file that you have just saved. You should have a spreadsheet open with data in two columns. Note: By default, S-Plus will name the dataset according to the name of the file from which the data were imported.

Using S-Plus to fit regression models

1. Make a plot of vtt (vehicle thefts per thousand residents) versus pd (population density) as the independent. To do this, choose Graph, 2D Plot, then Scatter plot. Although scattered, how would you describe the relationship between the variables (largely linear, a little linear, not linear at all, etc.)?

2. Re-make the plot, this time including a least squares line. To do this, choose Graphs, 2D Plot, then Fit - Linear Least Squares. The line will be drawn using least squares estimates, which minimize the squared vertical differences between the observations and the fitted line.

Does there appear to be any relationship between population density and rate of vehicle thefts (think about the slope of the line)? How does the amount of scatter around the line affect your estimation of the strength of the relationship?
What rate of vehicle thefts would you predict (based on your least squares line) for population densities around 19500? How reliable is your estimate?

3. Looking at your plot, do you notice any outliers? How are they affecting your perception of the trend? Remove the outlier, and re-make the plot as above. How does this change the least squares line?

Does the relationship appear to be more linear without the outlier?
How has your prediction (see last question) changed? What does this say about the influence of outliers on least squares procedures?
What would you say about the fit of the least squares line for values of the independent variable that are significantly higher or lower than the mass of the data. (Consider the outlier. Also consider what the least squares line would predict if we "plug in" population density equal to 0.)