Statistics 101
Data Analysis and Statistical Inference
 

Instructions for Lab 3


Lab Objective: To explore data with histograms and scatterplots.


Lab Procedures

Unit 1:  Movie Box Office Data

What are the characteristics of U.S. movies that make the most money? The JMP data set movies.jmp contains data on the 277 top grossing movies of all time as of 2003. To open the data, click on File - Open, and open the folder Sample Data.  Select Movies.jmp, then click Open.

Tip: The unit of measurement for the two monetary variables is not stated.  That's bad practice. Always include a description of the units somewhere on the file.  Based on knowledge of movie revenues, it is clear that that the unit of measurement is $1,000,000.

1) Describe the distribution of U.S. (domestic) grosses.  Include in your response: where most values are, note any outliers, whether the distribution is tightly packed around its mean or is more dispersed, and report the mean and standard deviation.

The simplest way to get a visual (and numerical) description of a distribution in JMP is to go to the Analyze menu option and select Distribution. Put the variables of interest into the Y, Columns box and select OK. This will produce a histogram and a box plot, and will also provide quantiles and important statistics (e.g. mean and standard deviation).

2) Would you say that a normal curve is a reasonable description of domestic grosses?   How about worldwide grosses? How do you decide whether or not a distribution follows a normal curve? (There are two possible answers to this question.)

You can examine histograms for both variables simultaneously by adding both into the "variables" box in Analyze-Distribution.  To make a normal quantile plot, after you’ve run Analyze Distribution, click on the red arrow next to the variable name and select Normal Quantile Plot.

3) Which movie is the clear outlier on both monetary variables?

If you hold the cursor over a single point on the plots, JMP will tell you the corresponding row number. For example, “There’s Something About Mary” is row number 254 on the original data set (if you don’t sort).

4) Worldwide Grosses:

a) What movie rating makes the most money on average worldwide?

b) What two types of movie have the most similar distribution of worldwide grosses? Justify your answer in no more than two sentences, noting what the greatest similarity and greatest difference are.

c) Is a Family movie or an Action movie more likely to gross over $300 million worldwide? What criteria did you use?

To answer these questions, we can use a couple different tools. We can examine the relationship of worldwide gross and movie type using a box plot.  To get a box plot, click on the Analyze menu option and select Fit Y by X.  Put the continuous variable in the Y, Response box and the categorical variable in the X, Factor box.  Click OK. On the subsequent screen, select the red arrow next to "Oneway Analysis of ...."  and select Quantiles. There are other display options available under that menu, which you should explore a little (for example, in the last lab we also used Means and Std Dev).

Sometimes you want to analyze only rows that meet some criteria on a certain variable. For example, to answer question 4c above you could get statistics on just the movies that grossed $300 million or more worldwide. To select certain rows, go to the Rows menu option – Row SelectionSelect Where… Highlight “Worldwide $”. On the pulldown menu under the variable options, change the setting from “equals” to “is greater than or equal to”. In the space to the right, specify what amount you want to select cases for (in this case, type “300”). Click Add Condition. Click OK. Now when you get variable summaries, you will only get results for the selected rows. Careful if you click on a random square you will de-select the rows.

 

5) Describe the relationship between domestic gross and worldwide gross.

a) What is the general trend of the relationship? (e.g. positive and linear, negative and linear, some other pattern, no clear pattern)

 

b) Are there any outliers or points that do not fit the pattern?

 

c) What is the correlation between domestic and worldwide grosses?

 

d) Would you describe the relationship as strong or weak?

There are two different methods of answering these questions, so you should do both in order to make sure you understand the value of each.

 

First Method: To make a scatterplot, go to the Analyze menu option and select Fit Y by X. Enter the continuous variable for the vertical axis in the Y, Response box and the continuous variable for the horizontal axis in the X, Factor box. (Note: for this lab, it does not really matter which variable you put on the Y axis and which one you put on the X axis.) Click OK. On the subsequent screen, select the red arrow next to “Bivariate Fit of…” and select Fit Line. This plots a line that best fits the distribution of the points, and also provides some useful information. Of particular interest here, you can find the correlation (“R”) by taking the square root of “RSquare”.

 

Second Method: Click on the Analyze menu option and select Multivariate Methods - Multivariate. Enter both variables into the “Variables” box and click OK. The subsequent screen reports the pairwise correlations and displays a “Scatterplot Matrix”. Make sure you understand what the matrix is showing you. There are other options available if you click on the red arrow next to “Multivariate” which you should explore.

 

 

6) Outliers can have a strong effect on correlations. Using the movie you identified in Question 3 above, let’s check to see if excluding that single case changes the correlations substantially. Did the correlations get stronger or weaker?

 

To exclude a single case, highlight the row number corresponding to the movie you want to exclude. Then under the Rows menu option select Exclude/Include. Now re-calculate the correlations in Question 5.

Tip:   It is not acceptable to exclude outliers from analyses unless you have a scientific reason to do so (e.g., a data entry error, or maybe the outlying unit is not part of your target population).  Hiding outliers is fudging data to get results you want.  That is dishonest and unethical.  When you see outliers, do analyses with and without them.  When the results do not change much, report the results based on the full data  set, and tell your audience that the results were not sensitive to the outliers.  When the results do change substantially, report both sets of analyses: one with and one without the outliers.  This honestly informs people that your conclusions are not on very solid ground, because particular data points affect the results greatly.

 

 


Unit 2:  Using JMP to Determine Areas under Normal Curves

This part of the lab shows you how to use JMP to determine areas under normal curves.  First, open a new data sheet and add one row and two columns.  Right click on “Column 2” and select Column Info… to bring up a dialogue box with column options.  Select New Property - Formula.  Next, click Edit Formula.  In the subsequent dialogue box, select Probability--Normal distribution.  In the box where it says Normal Distribution(x), double click inside the (x) until you get a window.  Now, click on Column 1 in the "Table Columns" box.  Click OK twice to get back to the data table.

Now, you can enter any z-score in Column 1, and JMP will determine the area to the left of that z-score in Column 2.  For example, if you enter 0 in Column 1, JMP returns a 0.5 for the area in Column 2.  If you want areas to the right of a z-score, just take calculate 1 – the percentage returned by JMP.  This may be helpful for later analyses.

7) Using JMP:

What is the area to the left of -1.645?
What is the area to the right of 3.14?
What is the area between -1.645 and 3.14?



Unit 3:  The Correlation Challenge

8) Click Here. This takes you to a webpage of various statistical applications. Go to the web game "Guess the correlations" and play it at least three times. Try competing with a classmate or one of the TAs. (You don’t need to write anything down for this part of the lab.)