--- title: "Residuals and Diagnostics" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) library(BAS) ``` Load data and libraries ```{r data} library(MASS) library(car) data(stackloss) # variables Air.Flow Water.Temp Acid.Conc. stack.loss ``` EDA/Pairs plot ```{r} pairs(stackloss) ``` Linear Model and Added Variable Plots (See Christensen Ch 13 p 368) ```{r} stack.lm <- lm(stack.loss ~ ., data=stackloss) avPlots(stack.lm) plot(stack.lm) ``` Can you identify which point(s) are "outliers" from the above plots? R's `identify` function can help with this. Create a basic plot. (I have `attach`ed the dataframe to avoid using the indexing by df$variablename) uncomment out the line with `identify` ```{r} attach(stackloss) plot(Water.Temp, stack.loss) #identify(Water.Temp, stack.loss) detach(stackloss) ``` Uncomment the line above with the `identify()` function. This is an interactive function, so click on points that you wish to identify. Press `Esc` or control click depending on your OS will exit from the function and return the indices of the points. ## Other Functions for Diagnostics ```{r} im = influence.measures(stack.lm) plot(rstudent(stack.lm) ~ hatvalues(stack.lm), ylab="Externally Studentized Residual", xlab="Leverage") #identify(rstudent(stack.lm) ~hatvalues(stack.lm) ) # Prob that observation with largest studentized residual is an outlier 2*(1- pt(max(abs(rstudent(stack.lm))), stack.lm$df - 1)) # Bonferonni .05/21 ``` Point would not be an outlier based on Bonferonni or other multiplcity adjustments. ## Bayesian Approaches ### Chaloner & Brant Chaloner & Brant (1988) ["A Bayesian approach to outlier detection and residual analysis" ](http://biomet.oxfordjournals.org/content/75/4/651.full.pdf+html) provides an approach to identify outliers or surprising variables by looking at the probabilty that the error for a case is more than $k$ standard deviations above or below zero. Cases with a high probability (absolute or relative to a multiplicity correction) are then investigated. ### Simultaneous Variable and Outlier Detection using MCMC References: * Bayesian Model Averaging for Linear Regression Models Adrian E. Raftery, David Madigan, and Jennifer A. Hoeting (1997). Journal of the American Statistical Association, 92, 179-191. * A Method for Simultaneous Variable and Transformation Selection in Linear Regression Jennifer Hoeting, Adrian E. Raftery and David Madigan (2002). Journal of Computational and Graphical Statistics 11 (485-507) * A Method for Simultaneous Variable Selection and Outlier Identification in Linear Regression Jennifer Hoeting, Adrian E. Raftery and David Madigan (1996). Computational Statistics and Data Analysis, 22, 251-270 This is implemented in the package `BMA` from CRAN. ```{r} library(BMA) #help(package=BMA) #help(MC3.REG) attach(stackloss) stack.MC3= MC3.REG(stack.loss, stackloss[,-4] ,num.its=10000, outliers=TRUE, M0.out=rep(FALSE, 21), outs.list=1:21, M0.var=rep(TRUE, 3)) summary(stack.MC3) detach(stackloss) ``` This extends variable selection to include additional variables for indicators for wheterh case $i$ is an outlier. The output allows one to identify the highest probability models in terms of variables and outliers. The highest probability model indicates 4 outliers and has a probability of $0.18470$ and includes both `Air.Flow` and `Water.Temp`. ### Using BAS We can coerce `BAS` to also handle outlier identification by adding additional columns to the dataframe that are indicators of case $i$ being an outlier. Here we use the `diag` function to create the indicators and bind these columns to the original dataframe. We check to see that the result actually is a dataframe. ```{r BAS-prelim} library(BAS) n = nrow(stackloss) stack.out = cbind(stackloss, diag(n)) #add indicators is.data.frame(stack.out) ``` We can run BAS using the shorthand `.` to specify that all variables (predictors and outlier indicators) are potentially included. ```{r BAS} BAS.stack = bas.lm(stack.loss ~ ., data=stack.out, prior="hyper-g-n", a=3, modelprior=tr.beta.binomial(1, 1, 15), # modelprior=tr.poisson(4, 15), method="MCMC", MCMC.iterations =10000) ``` With all of the outlier indicators and variables there are models that are non-full rank. This is not a problem with any mixture of $g$-priors, but we then have to rely on the prior on the space of models to reduce their contribution to posterior probabilities, etc. To limit the number of outliers that may be included in a model, we will use the truncated Beta-Binomial prior so that at most 15 predictors are included; e.g. all models with more than 15 variables will have zero probability and will not be sampled. (This only works properly with `method="MCMC"` at this time, with method `BAS` they are still sampled, but do not contribute to the estimates so that this is less efficient). Check convergence diagnostics to make sure we have run long enough. ```{r} diagnostics(BAS.stack, type="pip") diagnostics(BAS.stack, type="model") ``` Could run longer. Typing the name of the object returns the posterior inclusion probabilties, while `summary` gives more details. ```{r} BAS.stack t(summary(BAS.stack)) # top 5 models ``` The image plot can help to visualize the top 20 models. ```{r} image(BAS.stack) ``` Do BAS and MC3.REG reach the same conclusions? What if you use the truncated Poisson distribution? (see comment in code from `bas.lm` above) Which points do you think are outliers? How does variable selection and outlier selection work together? i.e draw scatter plots of the data with outliers labeled that highlight what is happening!