Getting Started

A brief outline of getting started is shown below. See the Lab 01 Instructions for more details about the steps.

  • Go to the sta210-sp20 organization on GitHub (https://github.com/sta210-sp20). Click on the repo with the prefix hw-01. It contains the starter documents you need to complete the lab.
  • Clone the repo in RStudio Cloud

Tips

Here are some tips as you complete HW 01:

  • Periodically knit your document and commit changes (the more often the better, for example, once per each new task)
  • Push all your changes back to your GitHub repo. The Git pane in RStudio should be empty after you push. You can also check your assignment repo on github.com to make sure it has updated.
  • Once you have completed your work, you will submit your assignment in Gradescope. You are welcome to resubmit multiple times until the assignment deadline. We will grade the most recent version of the .pdf file in Gradescope.

Packages

We will use the following packages in this assignment:

library(tidyverse)
library(broom)
library(knitr)
library(openintro)
library(MASS)

If you need to install any of the packages, type install.packages("package_name") in the console, where package_name is the package you need to install.

Questions

Part 1: Conceptual

The Conceptual section of homework contains short answer questions about the concepts discussed in class. Some of these questions may also require short chunks of code to produce the output needed to answer the question. Answers should be written in complete sentences.

For Questions 1 -2, we will be using the smoking dataset in the openintro R package. The data was originally provided by https://www.stem.org.uk/.

This dataset contains smoking habits and other demographic information for 1,691 randomly selected survey respondents in the United Kingdom (UK). Click here for a full list of the variables in the data. We will use the following variables:

  • smoke: Yes: respondent smokes regularly, No: respondent does not smoke regularly
  • age: Age in years
  1. According to a 2018 article by the BBC, the average age of a resident in the UK is 40. Is the average age of smokers in the UK significantly different from the average age of all UK residents?

    Hint: Before you begin, create a subset that only contains smokers.

    1. State the null and alternative hypotheses both in words and using statistical notation. You can use the syntax $mu$ to produce \(\mu\) when typing the hypotheses.*
    2. Are the conditions met to conduct a t-test? Include discussion about each condition, including your reasoning, and any supporting graphs, code, or output, as needed.
    3. Use the appropriate equation to calculate the test statistic. You can use R as a “calculator” by typing the equation in a R code chunk. Write the definition of the test statistic in the context of this problem.
    4. What are the degrees of freedom associated with this test statistic?
    5. Calculate the p-value and use it to state your conclusion in the context of this problem.
    6. If you were to construct a 95% confidence interval that corresponded to this hypothesis test, would you expect 40 to be in the interval? Briefly explain why or why not.
  2. Is there a significant difference in the average age of people in the UK who smoke versus those who don’t? To answer this question, will use a t-test to compare the mean age of those who smoke versus those who don’t.

    1. State the hypotheses both in words and using statistical notation.
    2. Are the conditions met to conduct a t-test for the difference in means? Include discussion about each condition, including your reasoning, and any supporting graphs, code, or output, as needed.
    3. Use the t.test function to test the hypotheses from part(a). Write the definition of the test statistic in the context of this problem.
    4. Write the definition of the p-value in the context of this problem.
    5. Use the p-value to make a conclusion in the context of this problem.
    6. Interpet the 95% confidence interval in the context of this problem.

Part 2: Data Analysis

The Data Analysis section of homework contains open-ended data analysis questions. Your response should be neatly organized and read as a complete narrative. This means that in addition to addressing the question(s) stated below, you should include exploratory data analysis and check the appropriate model assumptions. In short, these questions should be treated as “mini-projects”.

When veterinarians prescribe heart medicine for cats, the dosage often needs to be calibrated to the weight of the cat’s heart. It is very difficult to measure the heart’s weight, so veterinarians need a way to estimate it. One way to estimate it is using a cat’s body weight which is more feasible to obtain (though still difficult depending on the cat!).

We would like to fit a regression model that can be used by veterinarians to describe the relationship between a cat’s body weight and heart weight. You will use the cats dataset in the MASS package to complete the analysis. The cats dataset includes the following variables:

  • Sex: F: Female, M: Male
  • Bwt: Body weight in kilograms (kg)
  • Hwt: Heart weight in grams (g)

Be sure to include the following in your analysis:

  • Univariate and bivariate exploratory data analysis
  • Output of the regression model
  • Output and interpretation of \(R^2\)
  • Check of model assumptions

Tips

  • Be sure to include all relevant code and resulting output.

  • All plots should have proper labels for the axes and an informative title.

Submitting Your Assignment

Once your work is finalized in your GitHub repo, you will submit it to Gradescope. Your assignment must be submitted on Gradescope by the deadline to be considered “on time”.

To submit your assignment:

  • Go to http://www.gradescope.com and click Log in in the top right corner.

  • Click School Credentials ➡️ Duke NetID and log in using your NetID credentials.

  • Click on the STA 210 Regression Analysis course.

  • Click on the assignment, and you’ll be prompted to submit it.

    • If asked, login to your GitHub account.
    • If asked, click to request access to the “sta210-sp20” GitHub organization.
  • Select your assignment repo and choose “master” for the branch.

  • Click Upload. You should receive an email to confirm that the assignment has been submitted.

Grading

Total 50
Part 1: Conceptual Problems 25
Part 2: Data Analysis 20
Document has clear question headers and narrative written in complete sentences 3
At least 3 informative commit messages 2

Note there is a 10% penalty if the .pdf file is incomplete and the .Rmd file has to be knitted for grading.

Notes

*See Getting Started with LaTex and LaTex Symbols Guide for more information about typing statistical notation using LaTex.