The primary goal of today’s’ lab is to practice statistical inference and prediction for simple linear regression. Additionally, you continue developing your data visualization and data wrangling skills in R and getting used to the team workflow.
Each of your assignments will begin with the following steps.
Go to the sta210-sp20 organization on GitHub (http://www.github.com/sta210-sp20). Click on the repo with the prefix lab-01-review-r-. It contains the starter documents you need to complete the lab.
Click on the green Clone or download button, select Use HTTPS (this might already be selected by default, and if it is, you’ll see the text Clone with HTTPS as in the image below). Click on the clipboard icon to copy the repo URL.
Go to https://vm-manage.oit.duke.edu/containers and login with your Duke NetId and Password.
Click to log into the Docker container STA 210 - Regression Analysis. You should now see the RStudio environment.
Go to File ➡️ New Project ➡️ Version Control ➡️ Git.
Copy and paste the URL of your assignment repo into the dialog box Repository URL. You can leave Project Directory Name empty. It will default to the name of the GitHub repo.
Click Create Project, and the files from your GitHub repo will be displayed the Files pane in RStudio.
If you are unable to push to GitHub, it may be because you need to configure git. Follow the steps below to configure git.
To do so, you will use the use_git_config()
function from the usethis
package.
Type the following lines of code in the console in RStudio filling in your name and email address.
The email address is the one tied to your GitHub account.
For example, mine would be
If you get the error message
then you need to install the usethis
package. Run the following code in the console to install the package. Then, rerun the use_git_config
function with your GitHub username and email address associated with your GitHub account.
Once you run the configuration code, your values for user.name
and user.email
will display in the console. If your user.name
and user.email
are correct, you’re good to go! Otherwise, run the code again with the necessary changes.
Though Starbucks is most famous for it’s ever-growing selection of coffee drinks, it has quite the selection of food as well. In today’s lab, we will analyze the nutritional data of 77 food items sold at Starbucks. The data was originally obtained from the Starbucks menu in 2011; however, many of the items are still available today.
The data is available in the starbucks
dataset from the o penintro package. It contains the following variables:
Variable | Description |
---|---|
item |
Name of food item |
calories |
Total number of calories |
fat |
Total fat (in grams) |
carb |
Total carbohydrates (in grams) |
fiber |
Total fiber (in grams) |
protein |
Total protein (in grams) |
type |
Food category (e.g. bakery, sandwich, salad, etc.) |
When you walk to the counter at Starbucks, you’ll notice the large display of its most popular food items. Often, the number of calories is shown in the display but no other nutritional information is visible. Therefore, we’d like to use the total calories to estimate other nutritional values for a food item. Today we will focus on using calories
to estimate the total carbohydrates (carb
).
Team Member 1: Your turn to type! Team Member 1 should be a different person than last lab.
What is the predictor variable? What is the response variable?
Let’s begin by examining the distribution of the predictor variable. Make a histogram to display the distribution of the predictor variable. Describe the shape of the distribution.
Note: Inlcude an informative title and informative labels for the x and y axes. This applies to all plots in the lab.
summarise
function to calculate measures of center and spread for the predictor variable. Only include the measures of center and spread that are appropriate for describing the distribution of the variable.See the dplyr reference page for more information about the summarise
function.
Below is example code for finding the minimum value of the response.
Team Member 1: Knit, commit, and push your work.
All team members: Pull to see the updated .Rmd and .pdf files.
Team Member 2: Your turn to type!
On a histogram, the range of values are divided into bins of equal width, and the number of observations in each bin is shown. From a histogram, one can see the shape of the data. One can also get an idea of the approximate center and spread of the data.
Compare the features of the distribution that are visible on the plot you chose versus a histogram. Which plot do you think is more effective for visualizing the distribution of a quantitative variable? Briefly explain your choice.
Make a plot displaying the relationship between the response variable and predictor variable. Describe the relationship between the two variables.
From the plot in the previous question, what assumption for regression might be violated? Briefly explain your reasoning. Note: We still need to examine the residuals before making a final determination about the model assumptions; however, we can start to get intuition using during the exploratory data analysis.
Team Member 2: Knit, commit, and push your work.
All team members: Pull to see the updated .Rmd and .pdf files.
Team Member 3: Your turn to type!
Fit the regression model and display the output including the 95% confidence interval for the slope. Write the model equation. Use words/variable names when you write the equation (not “x” and “y”).
Below are plots of the residuals needed to check the model assumptions. Recall the assumption you mentioned in Exercise 8. Which plot will you use to assess that assummption? What is your conclusion about whether this model assumption is satisfied? Briefly explain your reasoning.
Team Member 3: Knit, commit, and push your work.
All team members: Pull to see the updated .Rmd and .pdf files.
Team Member 4: Your turn to type!
\[\begin{aligned}&H_0: \beta_1 = 0 \\ &H_a: \beta_1 \neq 0 \end{aligned}\]
State the null and alternative hypotheses using words in the context of the data.
You’re done and ready to submit your work! Knit, commit, and push all remaining changes. You can use the commit message “Done with Lab 3!”, and make sure you have pushed all the files to GitHub (your Git pane in RStudio should be empty) and that all documents are updated in your repo on GitHub. Then submit the assignment on Gradescope following the instructions below.
Once your work is finalized in your GitHub repo, you will submit it to Gradescope. Your assignment must be submitted on Gradescope by the deadline to be considered “on time”.
To submit your assignment:
Go to http://www.gradescope.com and click Log in in the top right corner.
Click School Credentials ➡️ Duke NetID and log in using your NetID credentials.
Click on the STA 210 Regression Analysis course.
Click on the assignment, and you’ll be prompted to submit it.
Select your assignment repo and choose “master” for the branch.
Make sure to include the names of all group members who participated in the assignment. Click here for help on adding group members to an assignment.
Click Upload. You should receive an email to confirm that the assignment has been submitted.
Notes:
Exploratory Data Analysis | 16 |
Regression Model & Assumptions | 10 |
Statistical Inference | 13 |
Prediction | 4 |
Lab attendance & participation | 3 |
Narrative in full sentences | 2 |
Commit messages from every member | 2 |
Total | 50 |