Getting Started

Go to the STA210-Sp19 organization on GitHub (https://github.com/STA210-Sp19). Click on the repo with the prefix hw-05. It contains the starter documents you need to complete the lab.
Clone the repo in RStudio Cloud
Configure git using the use_git_config() function. You can also cache your password.

Packages

Fill in the packages you need to complete the assignment in your R Markdown document.

Questions

Part 1: Computations & Concepts

The Computations & Concepts section of homework contains short answer questions about the concepts discussed in class. Some of these questions may also require short chunks of code to produce the output needed to answer the question. Answers should be written in complete sentences.

In an article by Roskes et. al. 2011, the authors report on the success rate of penalty kicks that were on-target, so that either the keeper saved the shot or the shot scored, for FIFA World Cup shootouts between 1982 and 2010. They found that 18 out of 20 shots were scored when the goalkeeper’s team was behind, 71 out of 90 shots were scored when the game was tied, and 55 out of 75 shots were scored with the goalkeeper’s team ahead.
- Calculate the odds of a successful penalty kick for games in which the goalkeeper’s team was (i) behind, (ii) tied, or (iii) ahead.
- Calculate the odds ratios for successful penalty kicks for (i) behind versus tied, and (ii) tied versus ahead.
Below is an abstract for the paper Day Care Centers and Respiratory Health (Nafstad et al. 1999). Interpret the odds ratios (in bold) in the context of the research study.

Objective. To estimate the effects of the type of day care on respiratory health in preschool children.

Methods. A population-based cross-sectional study of Oslo children born in 1992 was conducted at the end of 1996. A self-administered questionnaire inquired about day care arrangements, environmental conditions, and family characteristics (n = 3853; response rate, 79%).

Results. In a logistic regression controlling for confounding, children in day care centers had more often nightly cough (adjusted odds ratio, 1.89; 95% confidence interval 1.34-2.67), and blocked or runny nose without common cold (1.55; 1.07-1.61) during the past 12 months compared with children in home care…
An article in the Journal of Animal Ecology by Bishop (1972) investigated whether moths provide evidence of “survival of the fittest” with their camouflage traits. Researchers glued equal numbers of light and dark morph moths in lifelike positions on tree trunks at 7 locations from 0 to 51.2 km from Liverpool. They then recorded the numbers of moths removed after 24 hours, presumably by predators. The hypothesis was that, since tree trunks near Liverpool were blackened by pollution, light morph moths would be more likely to be removed near Liverpool. The following variables are used in this analysis:
- morph = light or dark
- distance = kilometers from Liverpool
- placed = number of moths of a specific morph glued to trees at that location
- removed = number of moths of a specific morph removed after 24 hours
- log_odds_removed = log odds of being removed

The model with log_odds_removed as the response and distance, morph, and their interaction as the explanatory variables is shown below.

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-1.123	0.240	-4.687	0.001	-1.657	-0.589
distance	0.018	0.007	2.437	0.035	0.002	0.035
morphlight	0.374	0.339	1.103	0.296	-0.381	1.129
distance:morphlight	-0.028	0.011	-2.612	0.026	-0.051	-0.004

Interpret all the coefficients in the model (including the intercept) in terms of the log odds of being removed.

Using the model in Question 3, interpret all the coefficients in the model (including the intercept) in terms of the odds of being removed.
Use the model in Question 3 to calculate the predicted odds of being removed for a dark moth that is glued to the trunk of a tree that is 7.2 km from Liverpool.
Use the model in Question 3 to calculate the predicted probability of being removed for a light moth that is glued to the trunk of a tree that is 41.5 km from Liverpool.

Part 2: Data Analysis

The Data Analysis section of homework contains open-ended data analysis questions. Your response should be neatly organized and read as a complete narrative. This means that in addition to addressing the question there should also be exploratory data analysis and an analysis of the model assumptions. In short, these questions should be treated as “mini-projects”.

Data in NBA1718team.csv (“NBA Enhanced Box Scores and Standings” 2018) looks at factors that are associated with a team’s winning percentage. You will use the data to build a model that could be used to predict a team’s win percentage. You will use the following variables in the dataset:
- win_pct = Win percentage,
- FT_pct = Average Free Throw Percentage per game,
- TOV = Average Turnovers per game,
- FGA = Average Field Goal Attempts per game,
- FG = Average Field Goals Made per game,
- attempts_3P = Average 3 Point Attempts per game,
- avg_3P_pct = Average 3 Point Percentage per game,
- PTS = Average Points per game,
- OREB = Average Offensive Rebounds per game,
- DREB = Average Defensive Rebounds per game,
- REB = Average Rebounds per game,
- AST = Average Assists per game,
- STL = Average Steals per game,
- BLK = Average Blocks per game,
- PF = Average Fouls per game,
- attempts_2P = Average 2 Point Attempts per game

Click here to download the data. Upload the data into the data folder in your RStudio Cloud project.

Given the structure of the data, we will use the percentage of wins (rather than Win: Yes/No) as the response variable. There are a few ways to build a model for this type of response variable; however, we will approach it by using a logit transformation on the response, i.e. the log-odds.

Use win_pct to create a new variable that is the log odds of winning. Fit a linear model with this new variable as the repsonse.
Use the step function to conduct backward selection to help choose a final model that can be used to predict win percentage. You want to define the model in such a way that the intercept has a reasonable and meaningful interpretation. Be sure to output the coefficients of the final model along with its AIC value. Note: The model selected by backward selection may not be the final model. Consider all aspects of model selection and make any adjustments as necessary.
Based on this model, discuss which factors are important for predicting an NBA team’s win percentage.
What are the predicted odds of winning (and corresponding interval) for a team with average box statistics (e.g. avg free throw percentage, etc.)?
What is the predicted probability of winning for a team with average box statistics? You don’t need to calculate a confidence interval.
Based on your model, describe what statistics a team would need to have a high win percentage.

As usual, be sure to include exploratory data analysis before the model selection and appropriate analysis of the model fit and assumptions.

Grading

Total	80
Questions 1 - 6	40
Question 7	30
Documents complete and neatly organized (Markdown and knitted documents)	5
Answers written in complete sentences	3
Regular and informative commit messages	2

Acknowledgement

The questions from this assignment are modified from exercises in Chapter 6 of *Broadening Your Statistical Horizons.

References

Roskes, Marieke, Daniel Sligte, Shaul Shalvi, and Carsten K. W. De Dreu. 2011. “The Right Side? Under Time Pressure, Approach Motivation Leads to Right-Oriented Bias.” Psychology Science, 22 (11): 1403–7. doi:10.1177/0956797611418677.
Nafstad, Per, Jorgen A. Hagen, Leif Oie, Per Magnus, and Jouni J. K. Jaakkola. 1999. “Day Care Centers and Respiratory Health.” Pediatrics, 103 (4): 753–58. http://pediatrics.aappublications.org/content/103/4/753.
“NBA Enhanced Box Scores and Standings.” 2018. Accessed August 1. https://www.kaggle.com/pablote/nba-enhanced-stats.

HW 05: Logistic Regression

due Mon, Mar 25 at 11:59p