Inference for regression

Data

2010 GSS:

gss = read.csv("https://stat.duke.edu/~mc301/data/gss2010.csv")

Data dictionary at https://gssdataexplorer.norc.org/variables/vfilter

Test your vocabulary

The GSS gives the following 10 question vocabulary test:

SPACE (school, noon, captain, room, board, don’t know)
BROADEN (efface, make level, elapse, embroider, widen, don’t know)
EMANATE (populate, free, prominent, rival, come, don’t know)
EDIBLE (auspicious, eligible, fit to eat, sagacious, able to speak, don’t know)
ANIMOSITY (hatred, animation, disobedience, diversity, friendship, don’t know)
PACT (puissance, remonstrance, agreement, skillet, pressure, don’t know)
CLOISTERED (miniature, bunched, arched, malady, secluded, don’t know)
CAPRICE (value, a star, grimace, whim, inducement, don’t know)
ACCUSTOM (disappoint, customary, encounter, get used to, business, don’t know)
ALLUSION (reference, dream, eulogy, illusion, aria, don’t know)

Distribution of scores on vocabulary test

The variable of interest is wordsum:

ggplot(data = gss, aes(x = wordsum)) +
  geom_histogram(binwidth = 1)

Inference for regression with single predictor

Number of words correct in a vocabulary test (0 - 10) vs. age:

vocab_age = lm(wordsum ~ age, data = gss)
summary(vocab_age)

## 
## Call:
## lm(formula = wordsum ~ age, data = gss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1186 -1.1859  0.0562  1.2713  4.3251 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.392481   0.159257  33.860  < 2e-16 ***
## age         0.013447   0.003132   4.294 1.88e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.053 on 1383 degrees of freedom
##   (659 observations deleted due to missingness)
## Multiple R-squared:  0.01315,    Adjusted R-squared:  0.01244 
## F-statistic: 18.44 on 1 and 1383 DF,  p-value: 1.88e-05

plot(gss$age, gss$wordsum, pch=16, cex=0.5)
abline(lm(wordsum ~ age, data=gss), lwd=2, col='red')

plot(jitter(gss$age), jitter(gss$wordsum), pch=16, cex=0.5)
abline(lm(wordsum ~ age, data=gss), lwd=2, col='red')

Hypothesis test for a slope

\(H_0: \beta_{age} = 0\), Slope of age is 0 \(\Rightarrow\) age is not related to vocab score

\(H_A: \beta_{age} \ne 0\), Slope of age is different than 0 \(\Rightarrow\) age is related to vocab score

p-value = 1.88e-05 \(\rightarrow\) Reject \(H_0\) at 5% significance level.

The data provide convincing evidence that the slope of age is different than 0 (hence useful) for predicting score on the vocabulary test.

Confidence interval for a slope

confint(vocab_age, level = 0.95)

##                   2.5 %    97.5 %
## (Intercept) 5.080070478 5.7048924
## age         0.007303302 0.0195906

We are 95% confident that for each year a person is older their vocabulary score is expected to be higher, on average, by between 0.0073 to 0.0196 points.

Bootstrap Coefficients

nsim=1000
resample = function(d) sample_n(d, nrow(d), replace=TRUE)
df = data.frame(b0=rep(NA,nsim), b1=rep(NA,nsim))

for(i in 1:nsim)
{
    df[i,] = lm(wordsum ~ age, data = resample(gss))$coef
}

quantile(df$b0, probs=c(0.025,0.975))

##     2.5%    97.5% 
## 5.080381 5.695514

quantile(df$b1, probs=c(0.025,0.975))

##        2.5%       97.5% 
## 0.007351654 0.019979551

Bootstrap regression lines

plot(jitter(gss$age), jitter(gss$wordsum), pch=16, cex=0.5)
abline(lm(wordsum ~ age, data=gss), lwd=2, col='red')
for(i in 1:nsim)
{
    abline(a=df$b0[i], b=df$b1[i], col=adjustcolor("blue",alpha=0.01))
}

Bootstrap regression lines (zoom)

CLT confidence interval

CLT prediction interval

Inference for regression with multiple predictors

Number of words correct in a vocabulary test (0 - 10) vs. age + unemployed over the last 10 years:

vocab_age_unemp = lm(wordsum ~ age + unemp, data = gss)
summary(vocab_age_unemp)

## 
## Call:
## lm(formula = wordsum ~ age + unemp, data = gss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1105 -1.2748  0.0904  1.3782  4.4321 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.124384   0.266168  19.252  < 2e-16 ***
## age          0.018261   0.004657   3.921 9.64e-05 ***
## unempYES    -0.049498   0.174337  -0.284    0.777    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.148 on 734 degrees of freedom
##   (1307 observations deleted due to missingness)
## Multiple R-squared:  0.0248, Adjusted R-squared:  0.02215 
## F-statistic: 9.335 on 2 and 734 DF,  p-value: 9.926e-05

Hypothesis test for multiple slopes

Age:

\(H_0: \beta_{age} = 0\), Slope of age is 0, given that unemployment is in the model \(H_A: \beta_{age} \ne 0\), Slope of age is not 0, given that unemployment is in the model

p-value = 9.64e-05 \(\rightarrow\) Reject \(H_0\) at 5% significance level. The data provide convincing evidence that the slope of age is different than 0 for predicting score on vocabulary test, given that unemployment is in the model.

Unemployment:

\(H_0: \beta_{unempYES} = 0\), Slope of unemployment is 0, given that age is in the model \ \(H_A: \beta_{unempYES} \ne 0\), Slope of unemployment is not 0, given that age is in the model

p-value = 0.777 \(\rightarrow\) Fail to reject \(H_0\) at 5% significance level. The data do not provide convincing evidence that the slope of unemployment is different than 0 for predicting score on vocabulary test, given that age is in the model.

Confidence interval multiple slopes

confint(vocab_age_unemp, level = 0.95)

##                    2.5 %     97.5 %
## (Intercept)  4.601842116 5.64692636
## age          0.009117995 0.02740331
## unempYES    -0.391757193 0.29276161

All else held constant (i.e. given no change in unemployment status), we are 95% confident that for each year a person is older their vocabulary score is expected to be higher, on average, by 0.0091 to 0.0274 points.

Interpret the confidence interval for the slope of unemployment.

Bootstrap coefficient and confidence intervals

Not going to bother constructing these here, already seen how it can be done - 1,2,50 coefficients doesn’t change anything significant about the procedure.

What about the other p-values?

summary(vocab_age_unemp)

## 
## Call:
## lm(formula = wordsum ~ age + unemp, data = gss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1105 -1.2748  0.0904  1.3782  4.4321 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.124384   0.266168  19.252  < 2e-16 ***
## age          0.018261   0.004657   3.921 9.64e-05 ***
## unempYES    -0.049498   0.174337  -0.284    0.777    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.148 on 734 degrees of freedom
##   (1307 observations deleted due to missingness)
## Multiple R-squared:  0.0248, Adjusted R-squared:  0.02215 
## F-statistic: 9.335 on 2 and 734 DF,  p-value: 9.926e-05

Equivalent to testing if all slope coefficients are equal to 0 or at least one is different from zero.

\(H_0: \beta_{1} = \beta_2 = \ldots = \beta_k = 0\) \(H_A: \beta_1 \ne 0 \text{ or } \ldots \text{ or } \beta_k \ne 0\)

SLR

summary(vocab_age)

## 
## Call:
## lm(formula = wordsum ~ age, data = gss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1186 -1.1859  0.0562  1.2713  4.3251 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.392481   0.159257  33.860  < 2e-16 ***
## age         0.013447   0.003132   4.294 1.88e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.053 on 1383 degrees of freedom
##   (659 observations deleted due to missingness)
## Multiple R-squared:  0.01315,    Adjusted R-squared:  0.01244 
## F-statistic: 18.44 on 1 and 1383 DF,  p-value: 1.88e-05

What else can we do with these p-values?

Model selection based on p-value method:

Backwards elimination: Remove the variable with the highest p-value, re-fit, repeat until all variables are significant at the chosen significance level.
Forward selection: Start with the variable with the lowest p-value, re-fit, repeat until all no more significant variables left at the chosen significance level.

This is sometimes seen in the literature, but is not recommended!

Relies on arbitrary significance level cutoff
Multiple testing!

Instead use adjusted \(R^2\), or AIC, or other criterion based model selection.

My thoughts

Both sets of p-values are largely useless other than in a few very narrow circumstances

Coefficient p-value
- If you truly want to know if a coefficient is significantly different from zero (taking the other predictors into account) then great
- If you want to know which predictors are important - use model selection
Overall model p-value
- Strawman comparison, I don’t really care that my model is better than a mean only model, the latter is almost always going to be terrible

Inference for Regression