Statistics 110B Project -- Spring 1999

Data and Problem

We will analyze the fuel economy of 1970 to 1982 model year cars. The purpose of our analysis will be to explore the relationship between fuel economy (as measured in miles per gallon, MPG) and factors that may influence it. We seek to answer, at least in part, the following questions: 'what variables are related to fuel economy?', 'what is a good statistical model for fuel economy?', and 'what are the limitations of our analysis?' A brief description of the variables that constitute the data set follows.
Variable Type Definition
MPG continuous fuel economy in miles per gallon,
cylindrs ordinal number of engine cylinders,
displace continuous total volume of the engine cylinders (its displacement) in cubic inches,
horsepwr continuous engine horsepower,
weight continuous the car's weight in pounds,
accel continuous time required to accelerate from O to 60 mph (in seconds),
modlyear ordinal year of manufacture,
make nominal manufacturer of car,
origin nominal region of origin (US, Europe or Japan),

This data set was used by the Committee on Statistical Graphics of the American Statistical Association (ASA) in its Second (1983) Exposition of Statistical Graphics Technology, a forum for vendors of statistical graphics software to demonstrate their packages (the data are distributed by Statlib a service of the Department of Statistics at Carnegie Mellon University).

Write-Up

You may work alone, but I suggest you work in a group of two or three (max) class members (The remaining three lab sessions will be given over to the project, so you might want to work with others from your section. You will probably need to continue your analysis after hours, and can do so at any campus Unix cluster). Group projects are expected to reflect in quality the effort of two or three individuals.

Your report should be about 5 or 6 typed pages in length (excluding figures and appendix material) and include:

  1. a title page listing project group members,
  2. a 2 (max 3) page stand-alone, non-technical "executive" summary of findings,
  3. a 1 page (max) summary of your exploratory analysis (see questions 1 and 2 below),
  4. a 1 page (max) section describing steps you took to reach your final regression model (see question 3 below),
  5. a 1 paragraph (max) discussion section in which you outline the limitations and relevance of your analysis,
  6. and an appendix of plots (at most 4) and relevant computer output (at most 1 regression/other analysis).
You will be graded on the quality (not quantity!) of your write-up (clear, concise spell-checked writing, creativity and care in presentation) as well as on your technical understanding and correct application of statistical methods. The grade break-down is as follows: "executive summary" (8 points), exploratory analysis (4 points), regression analysis (4 points), presentation (2 points), conclusions/criticisms (2 points)--total = 20 points.

Presume your reader is comfortable with basic statistical methods, but is not an expert in them, and assume that your reader is not familiar with the data. When you use a statistical method like regression or ANOVA explain why and carefully interpret your results. Present only important summaries, plots and their interpretations; don't burden your reader with unnecessary facts and analysis.

Your project report is due in class on April 27th. Late reports will not be accepted.

Down-load the SAS/Insight program

Click here and a program with the project data set will appear in your browser window. Click on "File>Save As..." in Netscape and choose "Format for Saved Document: Text" then click "OK". The program is now saved in your account (in your home directory, by default). The file's name is "project.sas". Return to this page by choosing "GO>Back" from the Netscape menu bar.

To print plots and figures created in SAS you first need to set your printer environment variable so that printed output will be directed to the printer of your choice, most likely the printer that serves the cluster in which you are working. To print in the teer lab, type "setenv PRINTER teerlp1" or "setenv PRINTER teerlp2" before typing "sas project &." If you are working in another cluster replace teerlp1 with the appropriate printer name; a list of printer names and info on printing files can be found here. More information on printing, especially output containing text, can be found here.

To get started type "sas project &" in one of the terminals open on your screen.

"Questions"/Directions for Analysis

1) Familiarize yourself with the data. Identify aspects of the data that will be important in your subsequent analysis. You might, for example, answer the questions: "what differences are there between U.S., European and Japanese autos?", "how does fuel economy change over time?" and "are these changes the same for the 3 regions of origin?", etc.

2) Explore relationships between MPG and other variables in the data set. Look at scatter plots of MPG by continuous covariates, and box plots or histograms of fuel economy grouped by categorical covariates. What do you see? Do any of the covariates show promise as predictors of MPG?

3) Use linear regression to explore the relationship between fuel economy and factors that may influence it. While data for most of the technological and physical variables that determine fuel economy are not available to us in this data set, we can use the variables we have to correlate engine size (cylindrs, displace, horsepwr), size of car (weight), engine performance (accel) and year and place of manufacture (modlyear, origin) with fuel economy.

Start by fitting a linear regression model for MPG including (a) promising covariates identified in "Question 2" and (b) those that you have strong prior beliefs should be included in the model as predictor variables. Plot residuals against fitted values for this regression; verify this plot looks like the right panel of figure 14-10 on page 463 of the book. Create a new response variable logMPG = log(MPG). Repeat the previous regression using logMPG as the response variable. Look at plots of residuals against all predictor variables to verify that each residual relationship is flat (for example, the relationships between weight and fuel economy may be different for light, moderately heavy and heavy autos; look for patterns in the residual plots like those pictured in Figure 14-7(b) on page 462 of the book). Enter appropriate quadratic terms into the model, if necessary. Search for a parsimonious model by one-at-a-time removing variables that (a) have little explanatory ability and (b) for you had weak prior reason to be included in the model.

Summarize the regression results: What fraction of variability in logMPG does your final model explain? Is the residual plot OK? Interpret the model: on average, controlling for other factors, how much higher/lower fuel economy do you predict for a car weighing 3000 pounds as opposed to 2500?, built in 1975, 1980?, etc... Remember that your response variable is the log of fuel economy.

Return to the Stat 110B home page.


iversen@stat.duke.edu
last updated April 2 1999