Getting Started

A brief outline of getting started is shown below. See the Lab 01 Instructions for more details about the steps.

  • Go to the sta210-sp20 organization on GitHub (https://github.com/sta210-sp20). Click on the repo with the prefix hw-04. It contains the starter documents you need to complete the lab.
  • Clone the repo in RStudio Cloud

Tips

Here are some tips as you complete HW 04:

  • Periodically knit your document and commit changes (the more often the better, for example, once per each new task)
  • Push all your changes back to your GitHub repo. The Git pane in RStudio should be empty after you push. You can also check your assignment repo on github.com to make sure it has updated.
  • Once you have completed your work, you will submit your assignment in Gradescope. You are welcome to resubmit multiple times until the assignment deadline. We will grade the most recent version of the .pdf file in Gradescope.

Packages

We will use the following packages in this assignment:

library(tidyverse)
library(broom)
library(knitr) 
library(pROC)
library(plotROC)

#include other packages as needed

Questions

Part 1: Conceptual

The Conceptual section of homework contains short answer questions about the concepts discussed in class. Some of these questions may also require short chunks of code to produce the output needed to answer the question. Answers should be written in complete sentences. (The questions in this section are from Broadening Your Statistical Horizons.)

  1. In an article by Roskes et. al. 2011, the authors report on the success rate of penalty kicks that were on-target, so that either the keeper saved the shot or the shot scored, for FIFA World Cup shootouts between 1982 and 2010. They found that 18 out of 20 shots were scored when the goalkeeper’s team was behind, 71 out of 90 shots were scored when the game was tied, and 55 out of 75 shots were scored with the goalkeeper’s team ahead.

    • Calculate the odds of a successful penalty kick for games in which the goalkeeper’s team was (i) behind, (ii) tied, or (iii) ahead.

    • Calculate the odds ratios for successful penalty kicks for (i) behind versus tied, and (ii) tied versus ahead. *Note: The odds ratio between events A and B can be calculated as \(\phi = \frac{\omega_A}{\omega_B}\) where \(\omega_i\) = odds of event \(i\).


Use the following scenario and model for questions 2 - 7.

An article in the Journal of Animal Ecology by Bishop (1972) investigated whether moths provide evidence of “survival of the fittest” with their camouflage traits. Researchers glued equal numbers of light and dark morph moths in lifelike positions on tree trunks at 7 locations from 0 to 51.2 km from Liverpool. They then recorded the numbers of moths removed after 24 hours, presumably by predators. The hypothesis was that, since tree trunks near Liverpool were blackened by pollution, light morph moths would be more likely to be removed near Liverpool. The following variables are used in this analysis:

  • morph = light or dark
  • distance = kilometers from Liverpool
  • placed = number of moths of a specific morph glued to trees at that location
  • removed = number of moths of a specific morph removed after 24 hours
  • log_odds_removed = log odds of being removed

The model with log_odds_removed as the response and distance, morph, the interaction distance*morph as the predictor variables is shown below.

term estimate std.error statistic p.value conf.low conf.high
(Intercept) -1.123 0.240 -4.687 0.001 -1.657 -0.589
distance 0.018 0.007 2.437 0.035 0.002 0.035
morphlight 0.374 0.339 1.103 0.296 -0.381 1.129
distance:morphlight -0.028 0.011 -2.612 0.026 -0.051 -0.004
  1. Use the model to interpret the following in terms of the log odds of a moth being removed:

    • Coefficient of distance for a dark moth
    • Coefficient of morphlight given the tree is in Liverpool (distance = 0)
  2. Use the model to interpret the following in terms of the odds of a moth being removed:

    • Coefficient of distance for a dark moth
    • Coefficient of morphlight given the tree is in Liverpool (distance = 0)
  3. Use the model to interpret the following in terms of the odds of a moth being removed:

    • 95% confidence interval for the coefficient of distance for a dark moth
    • 95% confidence interval for the coefficient of morphlight given the tree is in Liverpool (distance = 0)
  4. Use the model to interpret the intercept in terms of the odds of a moth being removed.

  5. Use the model to calculate the predicted odds of being removed for a light moth that is glued to the trunk of a tree that is 6.2 km from Liverpool.

  6. Calculate the predicted probability that the moth described in the previous question is removed.

Part 2: Data Analysis

The Data Analysis section of homework contains open-ended data analysis questions. Your response should be neatly organized and read as a complete narrative. This means that in addition to addressing the question there should also be exploratory data analysis and an analysis of the model assumptions. In short, these questions should be treated as “mini-projects”.

For this portion of the homework, we will look at the email.csv located in the data folder. Click here to read more about the data set and the variable definitions. The goal of this analysis is to create a simple spam filter that uses characteristics of an email to determine if an email is considered spam. We will use the following variables in the analysis:

  • spam: Indicator for whether the email was spam.
  • to_multiple: Indicator for whether the email was addressed to more than one recipient.
  • num_char: The number of characters in the email, in thousands.
  • number: Factor variable saying whether there was no number, a small number (under 1 million), or a big number.

Be sure each variable is the correct type in R. If needed, recode the variables so they are the correct type. Then, include the following in your analysis:

  • Univariate and bivariate exploratory data analysis.
    • Briefly describe the relationship between the response variable spam and each of the predictor variables.
  • Fit a model with to_mutiple and num_char as the predictor variables and display the model output.
  • Use a drop-in-deviance test to determine if number should be included in the model. Display the output from the test and write your conclusion in the context of the problem.
  • Check the assumptions for the model you chose based on the drop-in-deviance test.
  • Make the ROC curve and calculate the AUC for the model. Be sure to display at least 5 thresholds (n.cut = 5) on the curve.
  • Let’s suppose you’re a data scientist developing a spam filter. Based on your model, what threshold will you use to determine it would be reasonable to put an email in the spam folder (which the user is unlikely to check)? What trade offs did you consider to make your decision?

Submitting Your Assignment

Once your work is finalized in your GitHub repo, you will submit it to Gradescope. Your assignment must be submitted on Gradescope by the deadline to be considered on time.

See Submitting the Assignment for more details on how to submit the assignment on Gradescope.

Grading

Total 50
Part 1: Conceptual 20
Part 2: Data analysis 25
Document neatly organized with clear headers 3
At least 3 informative commit messages 2

Acknowledgement

The questions in Part 1 are modified from exercises in Chapter 6 of Broadening Your Statistical Horizons.

The questions and data in Part 2 are modified from exercises in Chapter 9 of OpenIntro Statistics, 4th ed.

References

Bishop, J. A. 1972. “An Experimental Study of the Cline of Industrial Melanism in Biston Betularia (L.) (Lepidoptera) Between Urban Liverpool and Rural North Wales.” Journal of Animal Ecology 41 (1): 209–43.

Roskes, Marieke, Daniel Sligte, Shaul Shalvi, and Carsten K. W. De Dreu. 2011. “The Right Side? Under Time Pressure, Approach Motivation Leads to Right-Oriented Bias.” Psychology Science, 22 (11): 1403–7.