class: center, middle, inverse, title-slide # Categorical Data (4) ### Yue Jiang ### Duke University --- ### Roadmap for today It's often the case that in the real world, you may be making predictions about some sort of binary/multinomial outcome, for instance modeled using binary/ multinomial regression or some other machine learning categorization approach. Today we'll be talking about some methods of evaluating "how well we do" in terms of prediction. --- ### Licorice and post-op sore throat <img src="img/licorice.jpg" width="80%" style="display: block; margin: auto;" /> --- ### Medical diagnostics Suppose we're interested in the performance of a diagnostic test. Let `\(D+\)` be the event that someone has a condition of interest, and let `\(T+\)` be the event that a test for that condition is positive - .copper[Prevalence]: `\(P(D+)\)` The proportion of individuals with the condition - .copper[Sensitivity]: `\(P(T+|D+)\)`, or the true positive rate - .copper[Specificity]: `\(P(T-|D-)\)`, or 1 minus the false positive rate - .copper[Positive Predictive Value]: `\(P(D+|T+)\)` - .copper[Negative Predictive Value]: `\(P(D-|T-)\)` Of course, we can generalize this to other situations with binary classifiers, but it's often most intuitive to think of a diagnostic test. --- ### Rapid self-administered HIV tests .question[ Suppose I told you I had an HIV test that was >99.98% accurate. Would you think this is a very good test? Why or why not? ] --- ### Rapid self-administered HIV tests .pull-left[ From the FDA package insert for the OraQuick ADVANCE Rapid HIV-1/2 Antibody Test: - Sensitivity, `\(P(Test+|HIV+)\)` 99.3% - Specificity, `\(P(Test-|HIV-)\)` 99.8% From CDC statistics for 2016, 14.3/100,000 Americans aged 13 or older are HIV+. ] .pull-right[ <img src="img/oraquickstick.jpg" width="20%" style="display: block; margin: auto;" /> ] .question[ Suppose a randomly selected American aged 13 or older has a positive test on this test. What do you think is the probability that they have HIV? ] --- ### Rapid self-administered HIV tests .center[ .eno[6.6%] ] -- - Is this calculation surprising? - What is the explanation? - Was this calculation reasonable to perform? .question[ What if a randomly selected adult in Botswana tested positive (HIV prevalence `\(\approx\)` 25%)? ] --- ### Licorice and post-op sore throat ```r dat$anypain = ifelse(dat$postOp4hour_throatPain == 0, 0, 1) table(dat$anypain) ``` ``` ## ## 0 1 ## 157 76 ``` --- ### Predicting throat pain ```r m1 <- glm(anypain ~ preOp_gender + preOp_asa + intraOp_surgerySize + treat, data = dat) round(summary(m1)$coef, 3) ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.172 0.124 1.387 0.167 ## preOp_gender -0.027 0.061 -0.441 0.660 ## preOp_asamild_disease 0.155 0.082 1.878 0.062 ## preOp_asasystemic_disease 0.131 0.094 1.387 0.167 ## intraOp_surgerySize 0.086 0.056 1.552 0.122 ## treat -0.232 0.059 -3.918 0.000 ``` --- ### Predicting throat pain ```r round(predict(m1), 3) ``` ``` ## 1 2 3 4 5 6 7 8 9 10 11 12 13 ## 0.244 0.181 0.268 0.354 0.113 0.354 0.330 0.181 0.157 0.244 0.268 0.181 0.181 ## 14 15 16 17 18 19 20 21 22 23 24 25 26 ## 0.268 0.244 0.268 0.244 0.181 0.268 0.026 0.026 0.113 0.113 0.113 0.244 0.268 ## 27 28 29 30 31 32 33 34 35 36 37 38 39 ## 0.026 0.026 0.181 0.026 0.268 0.157 0.113 0.244 0.268 0.244 0.268 0.268 0.268 ## 40 41 42 43 44 45 46 47 48 49 50 51 52 ## 0.181 0.268 0.268 0.268 0.157 0.113 0.268 0.244 0.181 0.268 0.354 0.330 0.154 ## 53 54 55 56 57 58 59 60 61 62 63 64 65 ## 0.241 0.241 0.086 0.154 0.217 0.000 0.217 0.130 0.154 0.241 0.217 0.217 0.086 ## 66 67 68 69 70 71 72 73 74 75 76 77 78 ## 0.086 0.241 0.241 0.086 0.241 0.217 0.217 0.217 0.130 0.241 0.241 0.217 0.241 ## 79 80 81 82 83 84 85 86 87 88 89 90 91 ## 0.327 0.000 0.241 0.241 0.086 0.241 0.241 0.086 0.303 0.000 0.241 0.154 0.086 ## 92 93 94 95 96 97 98 99 100 101 102 103 104 ## 0.154 0.241 0.154 0.241 0.268 0.330 0.268 0.181 0.181 0.113 0.244 0.244 0.181 ## 105 106 107 108 109 110 111 112 113 114 115 116 117 ## 0.268 0.354 0.268 0.330 0.268 0.244 0.268 0.268 0.154 0.241 0.154 0.241 0.241 ## 118 119 120 121 122 123 124 125 126 127 128 129 130 ## 0.414 0.345 0.345 0.500 0.259 0.389 0.500 0.259 0.500 0.345 0.259 0.476 0.500 ## 131 132 133 134 135 136 137 138 139 140 141 142 143 ## 0.500 0.500 0.259 0.500 0.389 0.500 0.500 0.500 0.500 0.500 0.345 0.414 0.476 ## 144 145 146 147 148 149 150 151 152 153 154 155 156 ## 0.500 0.500 0.500 0.500 0.500 0.476 0.476 0.500 0.476 0.562 0.414 0.500 0.500 ## 157 158 159 160 161 162 163 164 165 166 167 168 169 ## 0.500 0.449 0.449 0.473 0.232 0.449 0.387 0.319 0.473 0.473 0.319 0.473 0.473 ## 170 171 172 173 174 175 176 177 178 179 180 181 182 ## 0.319 0.387 0.473 0.473 0.449 0.449 0.473 0.473 0.387 0.449 0.387 0.473 0.473 ## 183 184 185 186 187 188 189 190 191 192 193 194 195 ## 0.560 0.473 0.387 0.363 0.363 0.449 0.473 0.449 0.473 0.345 0.500 0.389 0.562 ## 196 197 198 199 200 201 202 203 204 205 206 207 208 ## 0.500 0.476 0.500 0.476 0.500 0.500 0.345 0.500 0.587 0.345 0.500 0.345 0.587 ## 209 210 211 212 213 214 215 216 217 218 219 220 221 ## 0.476 0.476 0.414 0.500 0.500 0.476 0.500 0.500 0.389 0.259 0.562 0.587 0.587 ## 222 223 224 225 226 227 228 229 230 231 232 233 ## 0.500 0.476 0.473 0.473 0.387 0.449 0.473 0.473 0.536 0.387 0.232 0.449 ``` --- ### Predicting throat pain ```r dat$pred50 <- ifelse(predict(m1) > 0.5, "Pain", "No pain") table(dat$pred50, dat$anypain == TRUE) ``` ``` ## ## FALSE TRUE ## No pain 136 55 ## Pain 21 21 ``` .question[ What is the accuracy of our procedure here? What are the sensitivity and specificity? The positive/negative predictive values? ] --- ### Predicting throat pain ```r dat$pred25 <- ifelse(predict(m1) > 0.25, "Pain", "No pain") table(dat$pred25, dat$anypain == TRUE) ``` ``` ## ## FALSE TRUE ## No pain 72 14 ## Pain 85 62 ``` .question[ What is the accuracy of our procedure here? What are the sensitivity and specificity? The positive/negative predictive values? ] --- ### Predicting throat pain ```r dat$pred10 <- ifelse(predict(m1) > 0.1, "Pain", "No pain") table(dat$pred10, dat$anypain == TRUE) ``` ``` ## ## FALSE TRUE ## No pain 14 1 ## Pain 143 75 ``` .question[ What is the accuracy of our procedure here? What are the sensitivity and specificity? The positive/negative predictive values? What are you noticing? ] --- ### Discrimination thresholds Oral HIV tests give positive or negative results depending on levels of HIV antibodies detected in saliva - If antibody levels are above a certain threshold, it is classified as a positive test - Varying the threshold for a positive vs. negative test will result in a test in different characteristics - But what is this threshold of "high" antibody level that classifies them as positive? - We are looking for a binary classifier that classifies a patient as being positive or negative based on a threshold value of a continuous variable. - At each threshold value, there is a trade-off between sensitivity and specificity --- ### ROC curves .copper[Receiver Operating Characteristic] curves show how specificity and specificity change as the discrimination threshold changes; that is, for each false positive rate (1 - specificity), what the true positive rate (sensitivity) is corresponding to that threshold value. <img src="img/radar1.jpg" width="75%" style="display: block; margin: auto;" /> --- ### ROC curves ```r library(PRROC) dat$preds <- predict(m1) roc1 <- roc.curve(dat$preds[dat$anypain == 1], dat$preds[dat$anypain == 0], curve = TRUE) plot(roc1) ``` --- ### ROC curves <img src="cat-4_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- ### PR curves .copper[Precision-Recall] curves are also used to evaluate binary classification models. Instead of modeling the true positive rate (sensitivity) as a function of the false positive rate (1 - specificity), precision-recall curves model the positive predictive value as a function of the true positive rate. .question[ Why might we care about a precision-recall curve vs. an ROC curve? ] --- ### PR curves Think back to the HIV example: a test that has >99.98% accuracy could simply be a sheet of paper with the word "no" on it given to random Americans aged 13 or older. It's often the case that there's an imbalance between the two statuses "in truth," and we are less interested in the more common status. --- ### PR curves ```r pr1 <- pr.curve(dat$preds[dat$anypain == 1], dat$preds[dat$anypain == 0], curve = TRUE) plot(pr1) ``` --- ### PR curves <img src="cat-4_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- ### PR curves ```r pr.vals <- pr1$curve head(pr.vals) ``` ``` ## [,1] [,2] [,3] ## [1,] 1.0000000 0.3261803 -0.0003507909 ## [2,] 1.0000000 0.3261803 -0.0003507909 ## [3,] 1.0000000 0.3304348 0.0264218380 ## [4,] 1.0000000 0.3377778 0.0861142977 ## [5,] 0.9868421 0.3440367 0.1128869265 ## [6,] 0.9736842 0.3507109 0.1302833652 ``` --- ### PR curves ```r ggplot(data.frame(pr1$curve), aes(x = X1, y = X2, color = X3)) + geom_line() + labs(x = "Sensitivity", y = "Positive Predictive Value", color = "p-hat") + scale_color_gradient(low = "darkblue", high = "deeppink") + theme_bw() ``` --- ### PR curves <img src="cat-4_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> --- ### Comparing ROC Curves ```r m2 <- glm(anypain ~ preOp_gender + preOp_asa + intraOp_surgerySize, data = dat) round(summary(m2)$coef, 3) ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.033 0.123 0.267 0.790 ## preOp_gender -0.036 0.063 -0.570 0.569 ## preOp_asamild_disease 0.165 0.085 1.948 0.053 ## preOp_asasystemic_disease 0.143 0.097 1.475 0.141 ## intraOp_surgerySize 0.096 0.057 1.669 0.097 ``` --- ### Comparing ROC Curves ```r dat$preds2 <- predict(m2) library(pROC) roc1 <- roc(dat$anypain ~ dat$preds) roc2 <- roc(dat$anypain ~ dat$preds2) ``` --- ### Comparing ROC Curves ```r plot(roc1, main = "Using Treatment in Model") ``` <img src="cat-4_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> --- ### Comparing ROC Curves ```r plot(roc2, main = "No Treatment in Model") ``` <img src="cat-4_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> --- ### Comparing ROC Curves ```r roc.test(roc1, roc2, method = "bootstrap") ``` ``` ## ## Bootstrap test for two correlated ROC curves ## ## data: roc1 and roc2 ## D = 2.0736, boot.n = 2000, boot.stratified = 1, p-value = 0.03811 ## alternative hypothesis: true difference in AUC is not equal to 0 ## sample estimates: ## AUC of roc1 AUC of roc2 ## 0.6886943 0.6141468 ``` --- ### Brier score For binary outcomes, the Brier score is simply the MSE of the prediction; a 0 is perfectly predicted and a 1 corresponds to being "perfectly incorrect." `\begin{align*} B = \frac{1}{n}\sum_{i = 1}^n(\hat{p}_i - Y_i)^2 \end{align*}` ```r mean( (dat$preds - dat$anypain)^2 ) ``` ``` ## [1] 0.1985282 ```