--- title: "MCMC in Baysian Variable Selection/Model Averaging" output: beamer_presentation author: "Merlise Clyde (Duke University)" date: 11/14/2017 --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## Example with Diabetes data First lets load the data (from Hoff) and coerce them to be dataframes. ```{r} set.seed(8675309) source("yX.diabetes.train.txt") diabetes.train = as.data.frame(diabetes.train) dim(diabetes.train) source("yX.diabetes.test.txt") diabetes.test = as.data.frame(diabetes.test) colnames(diabetes.test)[1] = "y" ``` Too many models $`r 2^65`$ to enumerate! ## BMA using BAS ```{r runbas} library(BAS) diabetes.bas = bas.lm(y ~ ., data=diabetes.train, method="MCMC", prior='hyper-g-n', n.models = 10000, MCMC.iteration=500000, thin = 10, initprobs="eplogp") ``` * run a MCMC sampler that combines a Random Walk Metropolis algorithm with a random swap step. * The algorithm stops when the number of iterations exceeds `MCMC.iterations` or `n.models` have been visited. * `thin` save every 10th model. * `initprobs` argument uses the p-values from OLS to compute an approximate Bayes Factor. These are used internally to sort the variables which provides improved memory usage. ## Checking Convergence The function `diagnostics` compares * estimates of posterior probabilities estimated from Monte Carlo frequencies (asymptotically unbiased) to * estimates based on normalizing marginal likelihoods times prior probabilities of just the sampled models. (Fisher consistent; recovers exact under enumeration) * should be equal if the chain has converged. ## Convergence of Posterior Inclusion Probabilities ```{r} diagnostics(diabetes.bas, type="pip") ``` Close, but could run longer! ## Convergence of model probabilities ```{r} diagnostics(diabetes.bas, type="model") ``` Clearly, not long enough. ## rerun longer ```{r runlongbas, cache=T} diabetes.bas = bas.lm(y ~ ., data=diabetes.train, method="MCMC", prior='hyper-g-n', n.models = 150000, MCMC.iterations=10^7, thin = 64, initprobs="eplogp") ``` ## Diagnostics ```{r} diagnostics(diabetes.bas, type="model") ``` ## Highest Probability Model One problem with MCMC and model selection is that the "best" model may be visited very few times. ```{r} summary(diabetes.bas$freq) ``` ## Comments * I routinely use much larger numbers of MCMC iterations than I see in papers if looking at BMA or model selection (i.e. millions rather than say 10,000) * Explore different numbers of models - does it make a difference for estimating marginal inclusion probabilities or posterior inclusion probabilities? What about predictions under BMA? * Check memory ! items can be large in R memory ## What variables are included? ```{r} image(diabetes.bas) ``` ## Prediction BMA for prediction and compute the MSE between the predicted and actual responses for the test data using BMA: ```{r} pred.bas = predict(diabetes.bas, newdata=diabetes.test, estimator="BMA", se.fit=T) cv.summary.bas(pred.bas$fit, diabetes.test$y) ``` ## Scoring ```{r} BAS::cv.summary.bas ``` ## Fit the Lasso ```{r lasso} suppressMessages(library(lars)) diabetes.lasso = lars(as.matrix(diabetes.train[, -1]), diabetes.train[,1], type="lasso") ``` ## Plot of Lasso ```{r} plot(diabetes.lasso) ``` ## Best Cp Model for Lasso ```{r} best = which.min(diabetes.lasso$Cp) best out.lasso = predict(diabetes.lasso, s=best[1], as.matrix(diabetes.test[,-1])) cv.summary.bas(out.lasso$fit, diabetes.test$y) ``` Close ## Choice of estimator * BMA - optimal for squared error loss Bayes * Highest Posterior Probability model (not optimal for prediction) but for selection * Median Probabilty model (select model where PIP > 0.5) (optimal under certain conditions; nested models) * Best Probability Model - Model closest to BMA under loss ## HPM ```{r} pred.HPM = predict(diabetes.bas, newdata=diabetes.test, estimator="HPM", se.fit=TRUE) cv.summary.bas(pred.HPM$fit, diabetes.test$y) ``` ## MPM ```{r} pred.MPM = predict(diabetes.bas, newdata=diabetes.test, estimator="MPM") cv.summary.bas(pred.MPM$fit, diabetes.test$y) ``` ## BPM ```{r} pred.BPM = predict(diabetes.bas, newdata=diabetes.test, estimator="BPM") cv.summary.bas(pred.BPM$fit, diabetes.test$y) ``` ## Coverage ```{r CI-BAS, echo=TRUE} CI.bas = confint(pred.bas) mean(CI.bas[,1] < diabetes.test$y & diabetes.test$y <= CI.bas[,2]) ``` ## Credible Intervals ```{r} plot(CI.bas) points(1:nrow(diabetes.test), diabetes.test$y, col="red", pch=20) ``` ## Coverage with HPM ```{r CI-HPM, echo=TRUE} CI.HPM = confint(pred.HPM) mean(CI.HPM[,1] < diabetes.test$y & diabetes.test$y <= CI.HPM[,2]) ``` ## Credible Intervals HPM ```{r, echo=TRUE} plot(CI.HPM) points(1:nrow(diabetes.test), diabetes.test$y, col="red", pch=20) ``` ## Calculation of CI ## Summary * prediction intervals account for model uncertainty * multiple shrinkage with BMA * choice of estimator and loss