The first is to analyze two "independent" samples: one of 1980 sales, the
other of 1990 sales (some homes have sold in both years and might be in both
samples, so the samples aren't really independent, but are probably nearly
so). The data set that is displayed when you start SAS is called "UNPAIRED".
The second column in this data set, threetwo, indicates three bedroom,
two bath homes with a "1" and other types of homes with a "0". The sale
price, in dollars, is given in column 3, price. The year of sale
(1980 or 1990) is given in column 4, year.
1) Make box plots for sale price by year (use the group subcommand on the distribution or box plot menu, group by year). Compare the two: notice that the distributions are very skewed. Recall that confidence intervals that use the normal and t-distributions in determining the allowance for error assume that the sample is from a normal population. This is most certainly not the case here. One way to cope with this problem is to make the data more normal by applying a transformation to it. To this end, the logarithm is a useful transformation for non-negative data with a long right tail. Create a new variable containing the logarithm of price (use the Edit > Variables menu). Calculate box plots for the transformed data. Is one distribution noticeably more variable than the other? Are their medians different? Do the distributions appear more nearly normal?
2) Use the distribution menu to calculate the group means, variances and sample sizes needed to form a 95% confidence interval for the difference in the logarithm of mean sale prices between 1980 and 1990 (1990 mean - 1980 mean) using formula 8-20 in the book (there is a direct way of doing this calculation in SAS, but it requires a topic we haven't covered yet--analysis of variance). Do homes seem to be appreciating or depreciating in value (on the log scale), i.e. does the confidence interval contain zero?
3) In the same way (no need to transform!), calculate a 95% confidence interval for the difference in the proportion of three bedroom, two bath homes sold in 1990 and sold in 1980 (1990 proportion - 1980 proportion) using formula 8-29. Is there evidence of a difference in the fraction of "three-two" homes sold in 1990 and 1980? Going back to problem 2, one reason homes might seem more valuable in 1990 than in 1980 is that different types of homes might have been more likely to sell in one year than the other. How does the confidence interval we calculate here help us address this question?
Our second approach to measuring appreciation will be to look at a paired sample of homes: those sold in 1980 and then again in 1990. Open the dataset "PAIRED." The second column, threetwo, indicates three bedroom, two bath homes with a "1" and other types of homes with a "0". The 1980 sale price, in dollars, is given in column 3, price80, and the 1990 sale price, in dollars, is given in column 4, price90.
4) Create a new variable defined to be the logarithm of price
in 1990 minus the logarithm of price in 1980 (use the Edit >
Variables menu: first use "log(Y)" to create columns with the
logarithm of the respective sale prices, then choose the "Y-X"
transformation from Edit > Variables >Other...). Note
that differences on the log scale correspond approximately to percent
changes on the data's original scale of measurement when the percent
change is small, hence a difference in logs of 0.2 corresponds
approximately to a 20% increase. Make a histogram of this variable
and describe its shape.
5) Calculate a 95% one-sample t-interval for the mean difference in logarithm of price (use the Distribution menu, click on the "Output" option and choose "95% C.I. for the Mean"). Does the interval include zero? What does this mean?
6) Compare the paired and unpaired intervals: which interval is
wider? Are the intervals centered near the same value? Which interval
do you believe gives a more accurate picture of the appreciation in market
value of homes in Dade County from 1980 to 1990? Why?
Return to the Stat 110B lab page.