Statistics 101
Data Analysis and Statistical Inference
Instructions for Lab 3
Lab
Objective: To explore data
with histograms and scatterplots.
Lab Procedures
Unit 1: Movie Box Office Data
What are the characteristics of
U.S. movies that make the most money?
The JMP data set movies.jmp contains
data on the 277 top grossing movies of all time as of 2003. To open the data, click on File - Open, and
open the folder Sample Data. Select Movies.jmp, then
click Open.
Tip: The unit of measurement for
the two monetary variables is not stated. That's bad practice. Always include a description of the units
somewhere on the file. Based on knowledge of movie revenues, it is clear
that that the unit of measurement is $1,000,000.
1) Describe the distribution of U.S. (domestic) grosses. Include
in your response: where most values are, note any outliers, whether the
distribution is tightly packed around its mean or is more dispersed, and report
the mean and standard deviation.
The simplest way to get a visual
(and numerical) description of a distribution in JMP is to go to the Analyze menu option and select Distribution. Put the variables of interest into the Y, Columns box and select OK.
This will produce a histogram and a box plot, and will also provide
quantiles and important statistics (e.g. mean and standard deviation).
2) Would
you say that a normal curve is a reasonable description of domestic grosses?
How about worldwide grosses? How
do you decide whether or not a distribution follows a normal curve? (There are two possible answers to this
question.)
You can
examine histograms for both variables simultaneously by adding both into the
"variables" box in Analyze-Distribution. To make a normal quantile plot, after
you’ve run Analyze Distribution, click on the red arrow next to
the variable name and select Normal Quantile Plot.
3) Which
movie is the clear outlier on both monetary variables?
If you
hold the cursor over a single point on the plots, JMP will tell you the
corresponding row number. For example,
“There’s Something About Mary” is row number 254 on the
original data set (if you don’t sort).
4) Worldwide
Grosses:
a) What movie rating
makes the most money on average worldwide?
b) What two types of movie have the most similar
distribution of worldwide grosses?
Justify your answer in no more than two sentences, noting what the
greatest similarity and greatest difference are.
c) Is a Family
movie or an Action movie more likely to gross over $300 million worldwide? What criteria did you use?
To answer
these questions, we can use a couple different tools. We can examine the relationship of worldwide
gross and movie type using a box plot. To get a box plot, click on
the Analyze menu option and
select Fit Y by X. Put the continuous variable in the Y,
Response box and the categorical variable in the X, Factor box.
Click OK. On the subsequent screen, select the red arrow next
to "Oneway Analysis of ...." and select Quantiles.
There are other display options available under that menu, which you
should explore a little (for example, in the last lab we also used Means and
Std Dev).
Sometimes you want to analyze only rows that meet some criteria on a
certain variable. For example, to answer
question 4c above you could get statistics on just the movies that grossed $300
million or more worldwide. To select certain rows, go to the Rows menu option – Row Selection – Select Where… Highlight “Worldwide $”. On the pulldown menu under the variable
options, change the setting from “equals” to “is greater than
or equal to”. In the space to the
right, specify what amount you want to select cases for (in this case, type “300”). Click Add
Condition. Click OK.
Now when you get variable summaries, you will only get results for the
selected rows. Careful –
if you click on a random square you will de-select the rows.
5) Describe the relationship between domestic
gross and worldwide gross.
a) What is the general trend of the
relationship? (e.g. positive and linear,
negative and linear, some other pattern, no clear pattern)
b) Are there any outliers or points that do not
fit the pattern?
c) What is the correlation between domestic and
worldwide grosses?
d) Would you describe the relationship as strong
or weak?
There are two
different methods of answering these questions, so you should do both in order
to make sure you understand the value of each.
First
Method: To make a scatterplot, go to the Analyze menu option and select Fit Y by X. Enter the continuous variable for the
vertical axis in the Y, Response box
and the continuous variable for the horizontal axis in the X, Factor box. (Note: for
this lab, it does not really matter which variable you put on the Y axis and which
one you put on the X axis.) Click OK.
On the subsequent screen, select the red arrow next to “Bivariate
Fit of…” and select Fit Line. This plots a line that best fits the
distribution of the points, and also provides some useful information. Of particular interest here, you can find the
correlation (“R”) by taking the square root of “RSquare”.
Second Method: Click on the Analyze menu option and select Multivariate Methods - Multivariate. Enter both variables into the “Variables”
box and click OK. The subsequent screen reports the pairwise
correlations and displays a “Scatterplot Matrix”. Make sure you understand what the matrix is
showing you. There are other options
available if you click on the red arrow next to “Multivariate”
which you should explore.
6) Outliers
can have a strong effect on correlations.
Using the movie you identified in Question 3 above, let’s check to
see if excluding that single case changes the correlations substantially. Did the correlations get stronger or
weaker?
To exclude a
single case, highlight the row number corresponding to the movie you want to
exclude. Then under the Rows menu option select Exclude/Include. Now
re-calculate the correlations in Question 5.
Tip: It is not acceptable
to exclude outliers from analyses unless you have a scientific reason to do so
(e.g., a data entry error, or maybe the outlying unit is not part of your
target population). Hiding outliers is fudging data to get results you
want. That is dishonest and unethical. When you see outliers, do
analyses with and without them. When the results do not change much, report
the results based on the full data set, and tell your audience that the
results were not sensitive to the outliers. When the results do change
substantially, report both sets of analyses: one with and one without the
outliers. This honestly informs people that your conclusions are not
on very solid ground, because particular data points affect the results
greatly.
Unit 2: Using JMP to Determine Areas under Normal
Curves
This part of the lab shows you
how to use JMP to determine areas under normal curves. First, open a new
data sheet and add one row and two columns. Right click on “Column 2” and select Column Info…
to bring up a dialogue box with column options. Select New Property -
Formula. Next, click Edit Formula. In the subsequent
dialogue box, select Probability--Normal distribution. In the box
where it says Normal Distribution(x), double click inside the (x) until
you get a window. Now, click on Column 1 in the "Table Columns"
box. Click OK twice to get back to the data table.
Now, you can enter any z-score in Column 1, and JMP will determine the area to
the left of that z-score in Column 2. For example, if you enter 0 in
Column 1, JMP returns a 0.5 for the area in Column 2. If you want areas
to the right of a z-score, just take calculate 1 – the percentage returned
by JMP. This may be helpful for later analyses.
7) Using JMP:
What is the area to the left of -1.645?
What is the area to the right of 3.14?
What is the area between -1.645 and 3.14?
Unit 3: The Correlation Challenge
8) Click Here. This takes you to a webpage of various
statistical applications. Go to the web
game "Guess the correlations" and play it at least three times. Try competing with a classmate or one of the
TAs. (You don’t need to write
anything down for this part of the lab.)