STA 113 -- Data Sets

Data for HWs

All data sets from exercises in Mendenhall et al. are on-line, in directory /afs/acpub/project/sta215/. The files are labeled exXX_YY.dat, where XX and YY are chapter and problem number, respectively. For example, to copy the data set for Problem 2.10 into your directory, type
	cp /afs/acpub/project/sta215/ex02_10.dat .
To see the data set type
	less ex02_10.dat

ALternatively you can download the data files using your web browser from here .

Use the MINITAB commands set or read to import the data set into MINITAB. See the page Intro to Minitab.

Want to try some data analysis using MINITAB? Below are a few data sets which you could analyze (or collect your own).

Homeless Data

In the data set ~peter/ The data set records number of homeless and potentially related variables for 50 US cities (Source: "Where Do the Homeless Come From" by William Tucker, National Review September 25, 1987).
The variables are:
(1) homeless: homeless/1000 pop  obtained from HUD 1984 report on
	      homeless. Tucker adjusted Cleveland, Balto and NY 
(2) numhless: number of homeless people
(3) poverty:  percent below poverty line based on 1979 census 
(4) unemploy: 1984 unemployement rates
(5) publich:  public housing/1000
(6) pop:      pop in 1000
(7) meantemp: mean temperature
(8) vacancy:  vacancy rate (1984)
(9) rentctl:  =1 for rent control (note that Worcester MA has a
	      grow limiting "movement" according to Tucker but has
	      no formal rent control ordinance)
Hint: To read in the data set use read '' c1-c9 Minitab will then automatically ignore the non-numeric 10th column with the city name.


In ~peter/ The data is a matrix with 8 columns giving statistics for the states. Columns are (1) Population estimate as of July 1, 1975; (2) Per capita Income (1974); (3) Illiteracy (1970, percent of population); (4) Life Ex- pectancy in years (1969-71); (5) Murder and non-negligent manslaughter rate per 100,000 population (1976); (6) Percent High-school Graduates (1970); (7) Mean Number of days with min temperature < 32 degrees (1931-1960) in capital or large city; and (8) Land Area in square miles. A 9-th column gives the state name.

Hint: To read the data into minitab use read '' c1-c8. Minitab will automatically ignore the 9-th non-numeric column.


In the data set ~peter/peru.DAT. A study was conducted by some anthropologists to determine the long-term effects of a change in environment on blood pressure. In this study they measured the blood pressure of a number of Indians who had migrated from a very primitive environment, high in the Andes mountains of Peru, into the mainstream of Peruvian society, at a much lower altitude.

A previous study in Africa had suggested that migration from a primitive society to a modern one might increase blood pressure at first, but that the blood pressure would tend to decrease back to normal over time.

The anthropologists also measured the height, weight, and a number of other characteristics of the subjects. A portion of their data is given below. All these data are for males over 21 who were born at a high altitude and whose parents were born at a high altitude. The skin-fold measurements were taken as a general measure of obesity. Systolic and diastolic blood pressure usually are studied separately. Systolic is often a more sensitive indicator.

The variables are:
(1) age:           age in years
(2)  years:        years since migration
(3)  weight:       weight in kilograms
(4)  height:       height in millimeters
(5)  chin:         chin skin fold in millimeters
(6)  forearm:      forearm skin fold in millimeters
(7)  calf:         calf skin fold in millimeters
(8)  pulse:        pulse rate in beats per minute
(9)  systol:       systolic blood pressure
(10) diastol:	   diastolic blood pressure

Automobile Data

In ~peter/ The file gives data on 74 models by 12 statistics The variables are price in dollars, mileage in miles per gallon, repair records for 1977 and 1978 (coded on a 5-point scale, 5 is best, 1 is worst), headroom in inches, rear seat clearance (distance from front seat back to rear seat back) in inches, trunk space in cubic feet, weight in pounds, length in inches, turning diameter (clearance required to make a U turn) in feet, displace- ment in cubic inches, and gear ratio for high gear.

The data give statistics for automobiles of the 1979 model year as sold in the United States.

Hint: To read in the data set use read '' c1-c12. Minitab will then automatically ignore the non-numeric 13th column with the car model name.


In the file ~peter/furnace.DAT. Wisconsin Power and Light studied the effectiveness of two devices for improving the efficiency of gas home-heating systems. The electric vent damper (EVD) reduces heat loss through the chimney when the furnace is in its off cycle by closing off the vent. It is controlled electrically. The thermally activated vent damper (TVD) is the same as the EVD except it is controlled by the thermal properties of a set of bimetal fins set in the vent. Ninety test houses were used, 40 with TVD's and 50 with EVD's. For each house, energy consumption was measured for a period of several weeks with the vent damper active and for a period with the damper not active. This should help show how effective the vent damper is in each house.

Both overall weather conditions and the size of a house can greatly affect energy consumption. A simple formula was used to try to adjust for this. Average energy consumed by the house during one period was recorded as (consumption)/[(weather)(house area)], where consumption is total energy consumption for the period, measured in BTU's, weather is measured in number of degree days, and house area is measured in square feet. In addition, various characteristics of the house, chimney, and furnace were recorded for each house. A few observations were missing and recorded as *, Minitab's missing data code.

The variables are:
(1)  type        type of furnace:  1 = forced air, 
			2 = gravity, 3 = forced water 
(2)  ch.area     chimney area
(3)  ch.shape    chimney shape:  1 = round, 2 = square, 3 = rectangular
(4)       chimney height (in feet)
(5)  ch.line     type of chimney liner: 0 = unlinded, 1 = tile, 2 = metal
(6)  house       type of house:  1 = ranch, 2 = two-story, 3 = tri-level,
                                 4 = bi-level, 5 = one and a half stories
(7)  age         house age in years (99 means 99 or more years)
(8)      average energy consumption with vent damper in
(9)  btu.out     average energy consumption with vent damper out
(10) damper      type of damper:  1 = EVD, 2 = TVD


In the file ~peter/pulse.DAT. Students in an introductory statistics course participated in a simple experiment. The students took their own pulse rate. They then were asked to flip a coin. If their coin came up heads, they were to run in place for one minute. Then everyone took their own pulse again. The pulse rates and some other data are given below.
The variables are:
(1)  pulse1     --   first pulse rate
(2)  pulse2     --   second pulse rate
(3)  ran        --   1 = ran in place, 2 = did not run in place
(4)  smokes     --   1 = smokes regularly, 2 = does not smoke regularly
(5)  sex        --   1 = male, 2 = female
(6)  height     --   height in inches
(7)  weight     --   weight in pounds
(8)  activity   --   usual level of physical activity:  
			1 = slight, 2 = moderate, 3 = a lot


In the file ~peter/restrnt.DAT. The 1980 Wisconsin Restaurant Survey collected data on restaurants in Wisconsin. The survey was done primarily to allow educators, researchers, and public policy makers to evaluate the status of Wisconsin's restaurant sector and to identify particular problems that it was encountering. A second purpose was to develop data that would be useful to small business counselors in advising managers as to how to effectively plan and operate their small restaurants.

Nineteen of Wisconsin's counties were selected for the study. Lists of restaurants were drawn up from telephone directories and these were sampled in porportion to the population of the county. A sample of 1000 restaurants yielded 279 usable responses. The data set thus consists of 279 cases, one for each restaurant in the usable data.

The variables are:
(1)  id        -- identification number
forc outlook   -- values 1, 2, 3, 4, 5, 6, 7, denoting from very unfavorable
(2)            -- to very favorable
(3)  sales     -- gross 1979 sales in \$1000's
(4)  newcap    -- new capital invested in 1979, in \$1000's
(5)  value     -- estimated market value of the business, in \$1000's
(6)  costgood  -- cost of goods sold as a percentage of sales
     wages     -- wages as a percentage of sales
(7)  ads       -- advertising as a percentage of sales
(8)  typefood  -- 1 = fast food, 2 = supper club, 3 = other
(9)  seats     -- number of seats in dining area
(10) owner     -- 1 = sole proprietorship, 2 = partnership, 3 = corporation
(12) ft.empl   -- number of full-time employees
(13) pt.empl   --  number of part-time employees
(14) size      -- size of restaurant:  
		1 = 1 to 9.5 full-time equivalent employees,
                2 = 10 to 20 full-time equivalent employees.  
		(part-time employees are each counted as 1/2 of a 
		full-time employee).   

STA 110 Students

The file is in ~peter/ The data is described in the Minitab handout ``Review of Descriptive Statistics''. If you use this dataset, please consider questions different from those discussed in the handout.

There are 9 columns corresponding to the variables:

AGE (in years); HEIGHT (in inches); WEIGHT (in pound); RELIGION (1=Catholic, 2=Protestant, 3=Jewish, 4=Other, 5=None); MAJOR (1=psych, 3=bio, 4=pps, 5=soc, 11=other); attitude about abortion legislation: ABVIEW (1= ``unrestricted pro choice'' to 4=''unrestricted pro life''); political leaning: POL (1=very conservative to 5=very liberal), attitude towards STA 110 before the first class: BEFORE (0=very negative to 8=very positive); attitude after first class: AFTER (0 to 8).