Statistics 101: Lab 1

Statistics 101
Data Analysis and Statistical Inference

Instructions for lab 1

Lab Objective

To become familiar with the software package JMP.

Lab Procedures

JMP gives us an enormous advantage over people who learned about and performed statistical analyses back in the pre-computer days. It allows us to avoid the drudgery of long, arithmetical calculations in favor of understanding concepts and analyzing data. You may find JMP a little annoying at times (all computer software is), but I suspect that you will be thankful of its existence once we start analyzing data.

There should be a shortcut to JMP on the Desktop that you can click to begin JMP. If not, click on the Windows Start button in the lower left corner of the page, then select Program-Math and Statistics-JMPIN-JMPIN. The software should open with a title page showing. Click on the title page and it will disappear leaving you ready to work.

Like most Windows-based programs, there are menu choices, including File, Edit, Tables, Rows, Cols, Analyze, Graph, Tools, Window, and Help. As the semester progresses, we will learn about the options under these menu choices. For now, we focus on creating and downloading JMP data sets.

Unit 1: Creating new data sets

Sometimes you need to create your own data sets from scratch (e.g., when analyzing data you collected for your final project). This tutorial familiarizes you with creating JMP data sets.

Click on the New Data Table button on the JMP work window. This brings up a blank spreadsheet that will become the data set. You also can create a new data file by navigating through the menus to File - New Data Table. The first step is to tell JMP the number of rows in the data set. Click on the red arrow next to Rows, or select Rows from the menu choices, and select Add Rows. Enter the number of rows you want. To keep things simple, we'll use 9 rows for this part of the lab.

Here are the data for nine people:

Sex Number
m 1
m 2
f 1
m 3
f 2
f 2
m 3
f 1
f 2

We're going to input the nine numbers in the first column. "Column 1" is a meaningless name for a variable, so let's rename it. Double click on the box containing "Column 1" and change the variable name to "Number".

Data analysis tip: When you create data sets for your own research, give your variables descriptive labels. It is easier to interpret analyses when the output has descriptive labels than when the output has labels like "Column 1", "Column 2", "Column 3", etc. Descriptive labels also make your data set comprehensible to others who may need or want to use it. Finally, if you use the data set in future analyses, you won't have to spend lots of time trying to decipher uninformative variable names.

Let's add another column to record the sexes. Click on Cols in the menu, and select New Column. Change the name of the column to "Sex" by writing over the "Column 2". You see a button for Data Type, which allows you to specify whether the column contains numbers (numeric), labels or names (character), or row states (we won't use this). Choose character. Next is Modeling Type, which helps JMP decide what graphs to show you. The two modeling types we use are continuous and nominal. We'll learn about these in more detail later, but the basic idea is to select continuous for numbers and nominal for variables that are labels. JMP displays variables that have numbers as data with a blue "C" and variables that have labels (or names) as data with a red "N".

After you input all the data, answer the following questions. Write your answers on a blank piece of paper to be turned in at the end of lab as your lab report. You're permitted and encouraged to talk about questions with your classmates, but write up your lab report with your own words. Feel free to ask for help from the TAs or classmates if you get stuck.

Questions:

1) How many people picked each number?

With nine people it's straightforward to look at the data and get an accurate count. But, in the entire class of 120 people, counting the incidences of each number "by hand" would be cumbersome. In such settings, you can make life easier by sorting the numbers in increasing order, then count the incidences. Let's do this in JMP just to get familiar with this handy command.

Select the Tables menu option and click on Sort . Select the variable "Number" and place it in the By box. Hit Sort. You get a sorted data set in a new table. Sorting is useful for many data analyses. In fact, you may want to use it again later in the lab.

2) If you want to sort the data first by sex and then by number (i.e. have all the females first with numbers in increasing order and all the males second with numbers in increasing order), which sequence of commands would you use? Select (a) or (b), and write just the letter of your answer on your report.

(a) Select the Tables menu option and click on Sort . Select the variable "Number" and place it in the By box. Then select the variable "Sex" and place it in the By box. Hit Sort.
(b) Select the Tables menu option and click on Sort . Select the variable "Sex" and place it in the By box. Then select the variable "Number" and place it in the By box. Hit Sort.

Okay, that's enough of the basics of creating your own spreadsheet. Now for some real data that someone else has collected.

Unit 2: Downloading data sets

Load in the data set forbes94, which contains the1994 compensation information for Chief Executive Officers (CEOs) of several large companies. To open this data set, click here. Take a look at those total compensation figures.... Yikes!! Why did I decide to go into academia?

When you get a data set, the first thing to do is figure out how many variables and how many units of observation you have to play with. This is pretty easy in JMP. Each column represents a variable, and each row represents a unit of observation. Hence, there are 800 CEOs in this data set. There are also a mix of numeric (blue "C") and character variables (red "N") in the data set.

Let's get into some analyses. Ask for help if you get stuck.

Questions:

3) JMP displays missing values with dots. True or false: There are more than five CEOs whose values of total compensation are missing in the data file.

Data analysis tip: It is common for some data to be missing on a file. Unfortunately, there is no universally accepted way of representing missing values. Some software packages, like JMP, use a dot or period. Other packages use an "NA" for not available. Some data producers, like federal agencies, use extreme values of a variable (e.g., -99) to indicate missing values. Using extreme values is bad practice: how does the user know if the value is an actual value or if it is a dummy for missingness? When you get a data set from someone, learn how they code missing data before doing any further analyses.

4) What is the salary (not total compensation) of the CEO of Blockbuster Entertainment?

We need to search through the data base for Blockbuster, then read off the salary of its CEO. One approach is to look at the company names row by row. For those who find joy only in tedium, this is the preferred approach. All others should go to the Edit menu option, and select Search and then Find. Type in "Blockbuster," selecting nothing else. I was mildly surprised that Blockbuster is considered a retail--not entertainment--industry. I also didn't expect Disney to be a "travel" industry. Who knew....

5) Which CEO has the highest total compensation? Who has the lowest total compensation?

6) Which industry type has the highest average CEO total compensation? Be careful not to read the decimals incorrectly when you answer the question.

There are way too many CEOs to figure this out by hand. Let JMP do all the work. Select the Tables menu option and click on Summary. Put the variable "Wide Industry" in the Group box, then highlight "Total Compensation". Next, click on Statistics to pull down a menu of summary statistics. Select the Mean (and just for kicks, one other summary that interests you). Hit Okay. You should see a table of the statistics you selected for the industries, ordered alphabetically by industry. You may need to scroll down to see all industries.

Each row in the table reports the value of the statistic aggregated over the industries. For example, there are 62 CEOs in "Food" industries, and their average total compensation equals $2,740,661.31. That's a lot of Twinkies.

In general, the summary command is useful for comparing means and other statistics for several groups.

7) How many of these CEOs got their undergraduate degree from Duke?

8) Let's assume all the CEOs from UNC schools graduated from UNC Chapel Hill. Assuming this, there are more CEOs with undergraduate degrees from Carolina than there are from Duke. Your friends at Carolina use this to argue that their graduates are more likely to be CEOs than Duke graduates. Defend our school! Use the CEO counts to make a statistical argument that Duke does not lag behind Carolina in producing CEOs. Write two or less sentences to justify your answer. (You need some information that is not in the data but is easily found on the web.)

9) Highest attained educational degree is in the variable "Grad degree". If your only interest was making good cash (total comp.), which path should you pursue: MBA (business), JD (law), MD (physician), PhD, or no graduate degree? Use highest average total compensation as your criterion, and choose only from these categories. Justify your answer in one sentence.

10) Explore the data to answer at least one question that interests you. Report your findings to one of the TAs or the instructor; you don't have to write anything on your lab sheet for this question. Ask your TAs for help with JMP if needed.

You may want to begin your list of JMP commands by adding instructions for the methods you used in Lab 1. We'll use sorting and summarizing by groups for Lab 2 (and for later labs), so it will be helpful to have commands for those data analysis tools handy. (Obviously, don't turn in this list; it's yours!)

This ends the lab.

DON'T FORGET TO LOG OFF FROM YOUR MACHINE.