Lab: Introduction to linear regression
This problem was taken from Regression Analysis: Theory, Methods, and Applications (1990), by A. Sen and M. Srivastava.
Description of the data
This exercise deals with the relationship between population density (pd) and vehicle thefts (vtt) per thousand residents in 18 Chicago districts (D). District 1 represents downtown Chicago. The data can be found here. Use the Netscape menus (File, then Save As)to save the data to a file on your desktop (say "vthefts.txt").
Starting S-Plus and getting the data
To start S-Plus, click Start, then Programs, then
Statistics & Mathematics, then S-PLUS 2000. To read
the data into S-Plus, choose from the S-Plus menus File, then
Import Data, then From file. Choose the file that
you have just saved. You should have a spreadsheet open with data in
two columns. Note: By default, S-Plus will name the dataset
according to the name of the file from which the data were imported.
Using S-Plus to fit regression models
1. Make a plot of vtt (vehicle thefts per thousand
residents) versus pd (population density) as the independent. To do this, choose Graph, 2D Plot, then Scatter plot. Although scattered, how would you describe the relationship between the variables (largely linear, a little linear, not linear at all, etc.)?
2. Re-make the plot, this time including a least squares line. To do this, choose Graphs, 2D Plot, then Fit - Linear Least Squares. The line will be drawn using least squares estimates, which minimize the squared vertical differences between the observations and the fitted line.
- Does there appear to be any relationship between population
density and rate of vehicle thefts (think about the slope of the
line)? How does the amount of scatter around the line affect your
estimation of the strength of the relationship?
- What rate of vehicle thefts would you predict (based on your least
squares line) for population densities around 19500? How reliable is
your estimate?
3. Looking at your plot, do you notice any outliers? How are they affecting your perception of the trend? Remove the outlier, and re-make the plot as above. How does this change the least squares line?
- Does the relationship appear to be more linear without the outlier?
- How has your prediction (see last question) changed? What does this say about the influence of outliers on least squares procedures?
- What would you say about the fit of the least squares line for values of the independent variable that are significantly higher or lower than the mass of the data. (Consider the outlier. Also consider what the least squares line would predict if we "plug in" population density equal to 0.)