Modeling Longitudinal Data

class: center, middle, inverse, title-slide

# Modeling Longitudinal Data
### Dr. Maria Tackett
### 04.24.19

---

### Announcements

- Regular office hours this week

- My office hours next week: 
    - Mon, 10a - 11:30a
    - Tues, 1p - 2:30p

- All grades (except project) by next Monday

- Exam 2 extra credit: If the response rate for the course evaluations is 90%, everyone gets +1 pt on their Exam 2 score (out of 40 pts)

---

### Announcements cont'd

- Project write up and presentation due May 1 at 2p

- Project presentations on May 1
    - Lab 01L: 2p - 3:30p
    - Lab 02L: 3:30p - 5p

---

## US college graduation rates

**What factors most effect graduation rates at US colleges?**

Response variable:

- **`rate`**: graduation rate, i.e. number of degrees awarded per 100 students enrolled

Predictor variables:

- **`year02`**: number of years since 2002
- **`faculty`**: mean number of full-time faculty in 2002 - 2009
- **`tuition`**: mean yearly tuition between 2002 and 2009

```r
college <- read_csv("data/colleges.csv") %>%
 filter(rate < 100) %>%
 mutate(year02 = year - 2002)
```

---

## `college` data

```r
college %>% 
  filter(instname %in% c("Duke University", "The University of Tennessee"))
```

```
## # A tibble: 14 x 5
## instname year02 faculty tuition rate
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Duke University 2 5.17 25.3 30.2
## 2 Duke University 3 5.17 25.3 27.1
## 3 Duke University 4 5.17 25.3 25.2
## 4 Duke University 5 5.17 25.3 28.7
## 5 Duke University 6 5.17 25.3 28.4
## 6 Duke University 7 5.17 25.3 29.3
## 7 The University of Tennessee 0 10.9 23.8 24.4
## 8 The University of Tennessee 1 10.9 23.8 24.3
## 9 The University of Tennessee 2 10.9 23.8 24.9
## 10 The University of Tennessee 3 10.9 23.8 23.2
## 11 The University of Tennessee 4 10.9 23.8 22.5
## 12 The University of Tennessee 5 10.9 23.8 22.2
## 13 The University of Tennessee 6 10.9 23.8 21.3
## 14 The University of Tennessee 7 10.9 23.8 22.8
```

---

### What makes this model different?

- **Goals**:  
    - Understand how the number of faculty and tuition of a college affects its graduation rate
    - How the graduation rate has changed over time

- There are multiple observations for each college (so multiple regression not appropriate)

- The are only a few time points in the dataset (so time series model not appropriate)

- We will use a multilevel model to model the relationship between `faculty`, `tuition` and `rate`.

---

## Multilevel Model

We will fit a two-level model that includes the following model components:

- **Level One**: include time (`year02`) and any other predictors that change within a college over the time period in the data

- **Level Two**: includes predictors that differ between colleges but that remain the same within a college over the time period in the data (`faculty` and `tuition`)

---

### Modeling Approach

**Approach:** Start with simple, preliminary models to establish a baseline that can be used to evaluate more complex models. Work toward the final model by adding predictors and checking model assumptions at each step. We can take the following steps:

1. Exploratory data analysis

2. Fit unconditional means model - model with no predictors

3. Fit unconditional growth model - add time

4. Fit "final" model with time and predictors

---

### 1. Exploratory Data Analysis

- Given the longitudinal structure of the data, we have observations at different time points for each college in the data set.

- When we do EDA, in addition to an univariate analysis of each variable, we want to look at the following: 
    - **within college**: changes over time within a school
    - **between college**: effects of school-specific predictors (e.g. `faculty`)

---

### 1. EDA: Univariate analysis

---

### 1. EDA: Graduation Rate by Year

- Let's look at the rate over time for 20 randomly selected colleges

---

### 1. EDA: Graduation Rate vs. Faculty in 2002

---

### 1. EDA: Graduation Rate vs. Tuition in 2002

---

## Unconditional Means Model

- In an unconditional means model, there are no predictors at any level

- The goal of this model is to compare variability with colleges and variability between colleges

.alert[
Let `$Y_{ij}$` be the graduation rate of college `$i$` in year `$j$`

`$$Y_{ij} = \alpha_0 + u_i + \epsilon_{ij}\\[10pt]
u_i \sim N(0, \sigma^2_u) \hspace{10mm} \epsilon_{ij} \sim N(0, \sigma^2)$$`
]

- `$\sigma^2_u$`: variability between colleges

- `$\sigma^2$`: variability within a college

---

### Unconditional Means Model

We can fit a multilevel model using the `lmer` function in the **lme4** package.

```r
library(lme4)

model_0 <- lmer(rate ~ 1 + (1|instname), data = college)
summary(model_0)
```

---

### Rates: Unconditional Means Model

```
## Linear mixed model fit by REML ['lmerMod']
## Formula: rate ~ 1 + (1 | instname)
##    Data: college
## 
## REML criterion at convergence: 45218.7
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -8.2836 -0.3820 -0.0227  0.3498 17.4261 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  instname (Intercept) 53.485   7.313   
##  Residual              9.938   3.152   
## Number of obs: 7928, groups:  instname, 1337
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  25.4904     0.2043   124.8
```

---

## Understanding the model

- Coefficients
 - `$\hat{\alpha}_0 = 25.49$`: mean graduation rate across all colleges

- `$\hat{\sigma}^2_u = 53.49$`: variance in the between-college deviations between the college means and the overall mean across all colleges and all years

- `$\hat{\sigma}^2 = 9.94$`: variance in within-school deviations between individual rate and college mean across all years
 
--

- Intraclass correlation
`$$\hat{\rho} = \frac{\hat{\sigma}^2_u}{\hat{\sigma}_u^2 + \hat{\sigma}^2} = \frac{53.49}{53.49 + 9.94} = 0.843$$`

About 84.3% of the total variation in graduation rates can be attributed to the difference among schools rather than the change over time within schools.

---

## Unconditional growth model

- In an unconditional growth model, 
time is added to the Level One model but no predictors in the Level Two model

- The goal of this model is to determine how much of the within-school variability in graduation rate can be attributed to changes over time

- We can think of this as building individual models for the change in rate over time for each of the colleges. 
    - We assume the same form of the relationship between `rate` and `year` for every college
    
.alert[
Let `$Y_{ij}$` be the `rate` for college `$i$` in year `$j$`

`$$Y_{ij} = a_i + b_i \text{year02}_{ij} + \epsilon_{ij}\\
\epsilon_{ij} \sim N(0, \sigma^2)$$`
]

---

## Unconditional growth model

.alert[
Let `$Y_{ij}$` be the `rate` for college `$i$` in year `$j$`

`$$Y_{ij} = a_i + b_i \text{year02}_{ij} + \epsilon_{ij}\\
\epsilon_{ij} \sim N(0, \sigma^2)$$`
]

- `$a_i$`: expected graduation rate for college `$i$` at time 0

- `$b_i$`: slope for college `$i$`, i.e. the rate of change in graduation rate for college `$i$` over the time period

- `$\epsilon_{ij}$`: deviation in college `$i$`'s expected and actual graduation rate at time `$j$`
    - `$\sigma^2$` is the variability in the deviations
    
---

## UT: Rate over Time

```r
college %>%
  filter(instname == "The University of Tennessee") %>%
  ggplot(aes(x = year02, y = rate)) +
           geom_point() + geom_line() + 
           labs(x = "Years Since 2002")
```

---

## Unconditional growth model

- We will let `$a_i$` and `$b_i$` vary by college, so we can fit Level Two models that incorporate college-level variables to estimate these values

.alert[
Let `$Y_{ij}$` be the graduation rate for college `$i$` in year `$j$`

**Level One**

`$$Y_{ij} = a_i + b_i \text{year02}_{ij} + \epsilon_{ij}$$`

**Level Two**
`$$a_i = \alpha_0 + u_i \\
b_i = \beta_0 + v_i$$`
]

where `$\epsilon_{ij} \sim N(0, \sigma^2)$` and

`$$\bigg[\begin{matrix}u \\ v  \end{matrix}\bigg] \sim N\Bigg(\bigg[\begin{matrix}0 \\ 0 \end{matrix}\bigg], \bigg[\begin{matrix}\sigma^2_u &  \sigma_{vu} \\ \sigma_{uv} & \sigma^2_{v} \end{matrix} \bigg] \Bigg)$$`

---

## Unconditional growth model

- `$\sigma^2$`: within-school variability

- `$\sigma^2_u$`: variability in initial rate

- `$\sigma^2_v$`: variability in slopes over time

- `$\sigma^2_u$` and `$\sigma^2_v$` make up the between-school variability

---

### Rate: Unconditional growth model

```r
library(lme4)
model_1 <- lmer(rate ~ year02 + (year02|instname), 
 data = college)
summary(model_1)
```

---

```
## Linear mixed model fit by REML ['lmerMod']
## Formula: rate ~ year02 + (year02 | instname)
##    Data: college
## 
## REML criterion at convergence: 44669.4
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -7.9247 -0.3424 -0.0177  0.3194 14.9125 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  instname (Intercept) 59.1021  7.6878        
##           year02       0.4807  0.6933   -0.31
##  Residual              7.7405  2.7822        
## Number of obs: 7928, groups:  instname, 1337
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 24.98437    0.22650 110.307
## year02       0.13751    0.02645   5.198
## 
## Correlation of Fixed Effects:
##        (Intr)
## year02 -0.433
```

---

## Understanding model

What do each of the following values tell you?

- `$\hat{\alpha}_0 = 24.98$`:

- `$\hat{\beta}_0 = 0.14$`:

- `$\hat{\sigma}^2 = 7.74$`:

- `$\hat{\sigma}^2_u = 59.10$`:

- `$\hat{\sigma}^2_v= 0.48$`:

- `$\rho_{uv} = -0.31$`:

---

## Add predictors

- Do `faculty` and `tuition` affect graduation rates?

- We will add these predictor variables to the Level Two model, since they differ by college but didn't change within a college (based on how they're defined)

.alert[
Let `$Y_{ij}$` be the `rate` for college `$i$` in year `$j$`

**Level One**

`$$Y_{ij} = a_i + b_i \text{year02}_{ij} + \epsilon_{ij}$$`

**Level Two**
`$$a_i = \alpha_0 + \alpha_1 \text{faculty}_i + \alpha_2 \text{tuition}_i + u_i \\
b_i = \beta_0 + \beta_1 \text{faculty}_i + \beta_2 \text{tuition}_i + v_i$$`
]

---

## Add predictors

```r
library(lme4)
model_2 <- lmer(rate ~ year02 + faculty + tuition +
 faculty:year02 + tuition:year02 +
 (year02|instname), data = college)
summary(model_2)
```

---

```
## Linear mixed model fit by REML ['lmerMod']
## Formula: 
## rate ~ year02 + faculty + tuition + faculty:year02 + tuition:year02 +  
##     (year02 | instname)
##    Data: college
## 
## REML criterion at convergence: 44689.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -7.9022 -0.3414 -0.0162  0.3172 14.9148 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  instname (Intercept) 58.9803  7.6799        
##           year02       0.4843  0.6959   -0.31
##  Residual              7.7326  2.7808        
## Number of obs: 7928, groups:  instname, 1337
## 
## Fixed effects:
##                  Estimate Std. Error t value
## (Intercept)     2.416e+01  4.397e-01  54.934
## year02          2.132e-01  6.323e-02   3.372
## faculty         1.663e-01  6.472e-02   2.569
## tuition        -1.061e-02  2.309e-02  -0.459
## year02:faculty -1.284e-02  9.308e-03  -1.379
## year02:tuition  1.169e-05  3.209e-03   0.004
## 
## Correlation of Fixed Effects:
##             (Intr) year02 faclty tuitin yr02:f
## year02      -0.521                            
## faculty     -0.516  0.302                     
## tuition     -0.370  0.215 -0.461              
## yer02:fclty  0.304 -0.591 -0.512  0.185       
## year02:tutn  0.220 -0.415  0.194 -0.506 -0.377
```

---

## Write out the model

---

## Predicted values

We can use the `augment` function to get predicted values and residuals

```r
model_2_aug <- augment(model_2)
model_2_aug %>% slice(1:5)
```

```
##        rate year02   faculty tuition
## 1 28.048732      5  5.771898 3.53450
## 2 13.949032      4 14.829452 4.36400
## 3 24.836729      5 14.829452 4.36400
## 4 28.798360      6 14.829452 4.36400
## 5  4.072385      6  5.334042 4.58075
##                                                  instname  .fitted
## 1 Arizona State University at the Downtown Phoenix Campus 27.76660
## 2                                    Ave Maria University 21.99504
## 3                                    Ave Maria University 22.77558
## 4                                    Ave Maria University 23.55612
## 5                         University of California-Merced  8.62289
##       .resid    .cooksd   .fixed      .mu .offset .sqrtXwt .sqrtrwt
## 1  0.2821315 0.09806445 25.77340 27.76660       0        1        1
## 2 -8.0460041 1.26990520 26.66643 21.99504       0        1        1
## 3  2.0611509 0.06275487 26.68927 22.77558       0        1        1
## 4  5.2422411 0.59302676 26.71211 23.55612       0        1        1
## 5 -4.5505051 0.75145767 25.86241  8.62289       0        1        1
##   .weights     .wtres
## 1        1  0.2821315
## 2        1 -8.0460041
## 3        1  2.0611509
## 4        1  5.2422411
## 5        1 -4.5505051
```

---

## Check Residuals

---

## Residuals

---

## Actual vs. Predicted

---

## References

These notes were based on the chapters ["Introduction to Multilevel Models"](https://bookdown.org/roback/bookdown-bysh/ch-multilevelintro.html) and ["Two Level Longitudinal Data"](https://bookdown.org/roback/bookdown-bysh/ch-lon.html) in *Broadening Your Statistical Horizons*.

---

class: middle, center

### Congrats on completing STA 210! 😄