Lab Objective
To get more practice using JMP commands, and to illustrate
the benefits of random sampling in surveys and causal
studies.
Lab Procedures
In a survey, the sampled data should be representative of
the target population. The simplest way to guarantee
representative data is to collect data from randomly
selected units in the population. We'll illustrate this
using real data.
Open the file agpop from the course directory. This
file is taken from the 1992 U.S. Census of
Agriculture. It contains data on agricultural
characteristics of all 3,078 counties in the United
States. Variables include acres92 (number of acres
devoted to farming in 1992), farms92 (number of farms
in 1992), largef92(the number of farms with more than
1,000 acres), smallf92 (the number of farms with fewer
than 9 acres), and similar variables for the 1987 and
1982 censuses. Also included are county and state
names, and a variable indicating the county's region
of the country (West, Northeast, North Central,
South). For more information on the Census of
Agriculture, including data from the 1997 census, you
can visit the web site of the National Agricultural Statistics
Service.
Data analysis tip: When
looking at a data set for the first time, it is always a
good idea to play around with it to get a feel for what it
contains. For example, I saw some -99 values in the data.
Obviously, it is not possible to have negative numbers of
farms or negative acres of land. The conclusion I can reach
is that the -99s identify missing data. Missing data
require special care, and you should seek out a
professional statistician when you have lots of missing
data. For this lab, we'll be stupid and treat -99s as
genuine values.
Questions: Exploring the data
Hints for questions 1 - 3: With smart sorts and summaries,
you can answer these questions very quickly.
1) What is the trend in total acres devoted to farmland
in Durham County, NC, from 1982 to 1992? That is, did
Durham become more or less agriculturally based over those
10 years? Report numbers to back up your claims.
2) What is the trend in total acres devoted to farmland in
the state of North Carolina over the same period? Hint: In
Tables-Summary, you can get summaries of more than one
variable simultaneously by holding the "Shift" button while
highlighting the variables.
3) Which state had the smallest number of farms in
1992?
The Census of Agriculture is a census, so the data set can
be used to obtain quantities for the entire population. For
example, we can calculate the total amount of acres devoted
to farming in the whole United States, the total number of
farms in the whole United States, etc. Let's use JMP to get
some of these quantities.
Select the Analyze menu option in JMP, then click on
Distribution. You'll see a box with the names of the
variables. Highlight farms92, then hit Y-columns. Do this
with all the 1992 variables. Now click Okay to get
summaries of the variables, such as the means, medians, and
many other statistics we'll use later in the semester.
You'll also see histograms and possibly other graphical
displays. We'll use those later in the semester as well.
Write down the population means on scrap paper for use in a
later question.
Since we have the actual population means, there's no need
to take random samples. There's no point in estimating
numbers when you can know them exactly! However, our
objective for lab is to see if random sampling works in a
real data set. So, here's what we'll do. We'll use JMP to
take a random sample of 500 counties. If random selection
truly gives a representative sample, the averages of the
variables in the sample should be close to the averages of
the variables in the whole population of 3,078 counties.
At first glance, it may seem preposterous to claim that 500
counties can represent 3,078 counties. Look at the ranges
of some of the variables: the acres92 has a smallest value
of -99 and a largest value of 7,229,985 acres, and the
number of large farms in a county stretches from 0 to 579.
How are we possibly going to get a sample that reflects the
characteristics of all these wide-ranging variables with
only 500 out of 3078 counties?!? Let's see what happens....
Question:
4) Take a random sample. Based on comparisons between
the sample means and population means, does it seem that
picking counties at random provides a representative
sample? Talk to the TA or instructor about your
conclusions, and any questions that you may have. After you
talk to the TA or instructor, they will give you credit for
answering this question.
It's easy to take a simple random sample in JMP from a data
file. First, make sure that no columns are highlighted.
Then, select Tables from the menu options, then select
Subset. Choose the option for Random Sample. Enter 500 as
the sample size, i.e. the number of counties to be sampled.
Hit OK and you get a new data table with 500 randomly
sampled counties. If you want to take another random sample
to check if the results from the first sample were just
dumb luck, close this new data table and repeat the
previous instructions.
The sample size 500 was chosen arbitrarily. Later in the
semester, we'll learn a principled method of choosing
sample sizes.
To me, what's amazing about this is that you usually get
pretty close by just throwing darts. In fact, you would be
hard pressed to get closer on all variables by any
non-random method of selecting data. I dare you to try at
home.
Data analysis tip: Here's a
generic method for taking a random sample from a
population. First, give each unit on the sampling frame a
distinct number in the range 1 to N, where N is the total
number of units on your sampling frame. Second, open a new
data file in JMP and create a single column with numbers
from 1 to N. Third, pick a random sample of these numbers
from this file using the Subset - Random Sample method.
Finally, collect data for those units whose numbers were
picked in the sample.
To generate a column of numbers in JMP that go from 1 to N,
go to Cols - Column Info. Select New Property - Formula.
Then, select Edit Formula. Next, select Numeric - Count.
Enter 1 in the from box; enter the number N in the to box
and the steps box. Click Okay until you get back to the
data sheet.