Lab 03: Inference for Simple Linear Regression

due Tue, Feb 4 at 11:59p

The primary goal of today’s’ lab is to practice statistical inference and prediction for simple linear regression. Additionally, you continue developing your data visualization and data wrangling skills in R and getting used to the team workflow.

Getting Started

Each of your assignments will begin with the following steps.

Clone Assignment Repo

Clone the repo & start new RStudio project

Configure git

If you are unable to push to GitHub, it may be because you need to configure git. Follow the steps below to configure git.

To do so, you will use the use_git_config() function from the usethis package.

Type the following lines of code in the console in RStudio filling in your name and email address.

The email address is the one tied to your GitHub account.

library(usethis)
use_git_config(user.name = "GitHub username", user.email="your email")

For example, mine would be

library(usethis)
use_git_config(user.name="matackett", user.email="maria.tackett@duke.edu")

If you get the error message

Error in library(usethis) : there is no package called ‘usethis’

then you need to install the usethis package. Run the following code in the console to install the package. Then, rerun the use_git_config function with your GitHub username and email address associated with your GitHub account.

install.package("usethis")

Once you run the configuration code, your values for user.name and user.email will display in the console. If your user.name and user.email are correct, you’re good to go! Otherwise, run the code again with the necessary changes.

Packages

We will use the following packages in today’s lab.

library(tidyverse)
library(knitr)
library(broom)
library(openintro) 

#fill in any other packages you use in the lab

Data

Though Starbucks is most famous for it’s ever-growing selection of coffee drinks, it has quite the selection of food as well. In today’s lab, we will analyze the nutritional data of 77 food items sold at Starbucks. The data was originally obtained from the Starbucks menu in 2011; however, many of the items are still available today.

The data is available in the starbucks dataset from the o penintro package. It contains the following variables:

Variable Description
item Name of food item
calories Total number of calories
fat Total fat (in grams)
carb Total carbohydrates (in grams)
fiber Total fiber (in grams)
protein Total protein (in grams)
type Food category (e.g. bakery, sandwich, salad, etc.)

Exercises

When you walk to the counter at Starbucks, you’ll notice the large display of its most popular food items. Often, the number of calories is shown in the display but no other nutritional information is visible. Therefore, we’d like to use the total calories to estimate other nutritional values for a food item. Today we will focus on using calories to estimate the total carbohydrates (carb).

Exploratory Data Analysis

Team Member 1: Your turn to type! Team Member 1 should be a different person than last lab.

  1. What is the predictor variable? What is the response variable?

  2. Let’s begin by examining the distribution of the predictor variable. Make a histogram to display the distribution of the predictor variable. Describe the shape of the distribution.

Note: Inlcude an informative title and informative labels for the x and y axes. This applies to all plots in the lab.

  1. Use the summarise function to calculate measures of center and spread for the predictor variable. Only include the measures of center and spread that are appropriate for describing the distribution of the variable.

See the dplyr reference page for more information about the summarise function.

Below is example code for finding the minimum value of the response.

  1. Next, we want to examine the distribution of the response variable. We’ve primarily used histograms to visualize quantitative data, but let’s try something new! Plot the distribution of the response variable using an appropriate plot that is not a histogram. You can use the ggplot reference page to help generate ideas and see example code.

Team Member 1: Knit, commit, and push your work.

All team members: Pull to see the updated .Rmd and .pdf files.

Team Member 2: Your turn to type!

  1. Briefly describe the type of plot you chose and what features of the distribution are visible using that plot. For example, the description of a histogram may be,

On a histogram, the range of values are divided into bins of equal width, and the number of observations in each bin is shown. From a histogram, one can see the shape of the data. One can also get an idea of the approximate center and spread of the data.

  1. Compare the features of the distribution that are visible on the plot you chose versus a histogram. Which plot do you think is more effective for visualizing the distribution of a quantitative variable? Briefly explain your choice.

  2. Make a plot displaying the relationship between the response variable and predictor variable. Describe the relationship between the two variables.

  3. From the plot in the previous question, what assumption for regression might be violated? Briefly explain your reasoning. Note: We still need to examine the residuals before making a final determination about the model assumptions; however, we can start to get intuition using during the exploratory data analysis.

Regression

Team Member 2: Knit, commit, and push your work.

All team members: Pull to see the updated .Rmd and .pdf files.

Team Member 3: Your turn to type!

  1. Fit the regression model and display the output including the 95% confidence interval for the slope. Write the model equation. Use words/variable names when you write the equation (not “x” and “y”).

  2. Below are plots of the residuals needed to check the model assumptions. Recall the assumption you mentioned in Exercise 8. Which plot will you use to assess that assummption? What is your conclusion about whether this model assumption is satisfied? Briefly explain your reasoning.

  1. Comment on the remaining assumptions for simple linear regression. State whether the assumption is satisified and explain your reasoning.

Statistical Inference

  1. What is the 95% confidence interval for the slope? Interpret this interval in the context of the data.

Team Member 3: Knit, commit, and push your work.

All team members: Pull to see the updated .Rmd and .pdf files.

Team Member 4: Your turn to type!

  1. Suppose we want to test the following hypotheses:

\[\begin{aligned}&H_0: \beta_1 = 0 \\ &H_a: \beta_1 \neq 0 \end{aligned}\]

State the null and alternative hypotheses using words in the context of the data.

  1. What is the p-value of this hypothesis test? Use the p-value to state your conclusion in the context of the data.
  2. Consider the confidence interval from Exercise 12 and the hypotheses in Exercise 13. Is the confidence interval consistent with the null or alternative hypothesis? Briefly explain.

Prediction

  1. You’d like to purchase one piece of pumpkin bread from Starbucks! According to the Starbucks menu, pumpkin bread has 410 calories. Predict the average carbohydrates for all pumpkin bread sold by Starbucks. Include the estimate and appropriate interval.

You’re done and ready to submit your work! Knit, commit, and push all remaining changes. You can use the commit message “Done with Lab 3!”, and make sure you have pushed all the files to GitHub (your Git pane in RStudio should be empty) and that all documents are updated in your repo on GitHub. Then submit the assignment on Gradescope following the instructions below.

Submitting the Assignment

Once your work is finalized in your GitHub repo, you will submit it to Gradescope. Your assignment must be submitted on Gradescope by the deadline to be considered “on time”.

To submit your assignment:

Notes:

Grading

Exploratory Data Analysis 16
Regression Model & Assumptions 10
Statistical Inference 13
Prediction 4
Lab attendance & participation 3
Narrative in full sentences 2
Commit messages from every member 2
Total 50