--- title: "Subsetting" author: "Colin Rundel" date: "2019-01-24" output: xaringan::moon_reader: css: "slides.css" lib_dir: libs nature: highlightStyle: github highlightLines: true countIncrementalSlides: false --- exclude: true ```{r, message=FALSE, warning=FALSE, include=FALSE} options( htmltools.dir.version = FALSE, # for blogdown width=80 ) htmltools::tagList(rmarkdown::html_dependency_font_awesome()) ``` --- ## Subsetting in General R has three primary subsetting operators (`[`, `[[`, and `$`). The behavior of these operators will depend on the object (class) they are being used with.
-- In general there are 6 different data types that can be used to subset: * Positive integers * Negative integers * Logical values * Empty / NULL * Zero * Character values (names) --- ## Positive Integer subsetting Returns elements at the given location(s) (Note - R uses a 1-based indexing scheme). ```{r} x = c(1,4,7) y = list(1,4,7) ``` .pull-left[.small[ ```{r} x[c(1,3)] x[c(1,1)] x[c(1.9,2.1)] ``` ] ] .pull-right[ .small[ ```{r} str( y[c(1,3)] ) str( y[c(1,1)] ) str( y[c(1.9,2.1)] ) ``` ] ] --- ## Negative Integer subsetting Excludes elements at the given location .pull-left[ ```{r, error=TRUE} x = c(1,4,7) x[-1] x[-c(1,3)] x[c(-1,-1)] ``` ] .pull-right[ ```{r, error=TRUE} y = list(1,4,7) str( y[-1] ) str( y[-c(1,3)] ) ``` ] ```{r error=TRUE} x[c(-1,2)] ``` --- ## Logical Value Subsetting Returns elements that correspond to `TRUE` in the logical vector. Length of the logical vector is expected to be the same of the vector being subsetted. .pull-left[ ```{r} x = c(1,4,7,12) x[c(TRUE,TRUE,FALSE,TRUE)] x[c(TRUE,FALSE)] x[x %% 2 == 0] ``` ] .pull-right[ ```{r, error=TRUE} y = list(1,4,7,12) str( y[c(TRUE,TRUE,FALSE,TRUE)] ) str( y[c(TRUE,FALSE)] ) ``` ] -- ```{r error=TRUE} str( y[y %% 2 == 0] ) ``` --- ## Empty Subsetting Returns the original vector. ```{r} x = c(1,4,7) x[] y = list(1,4,7) str(y[]) ``` --- ## Zero subsetting Returns an empty vector of the same type as the vector being subseted. .pull-left[ ```{r} x = c(1,4,7) x[0] y = list(1,4,7) str(y[0]) ``` ] .pull-right[ ```{r} x[c(0,1)] y[c(0,1)] ``` ] --- ## Character subsetting If the vector has names, select elements whose names correspond to the character vector. .pull-left[ ```{r} x = c(a=1,b=4,c=7) x["a"] x[c("a","a")] x[c("b","c")] ``` ] .pull-right[ ```{r} y = list(a=1,b=4,c=7) str(y["a"]) str(y[c("a","a")]) str(y[c("b","c")]) ``` ] --- ## Out of bound subsetting .pull-left[ ```{r} x = c(1,4,7) x[4] x["a"] x[c(1,4)] ``` ] .pull-right[ ```{r} y = list(1,4,7) str(y[4]) str(y["a"]) str(y[c(1,4)]) ``` ] --- ## Missing and NULL subsetting .pull-left[ ```{r} x = c(1,4,7) x[NA] x[NULL] x[c(1,NA)] ``` ] .pull-right[ ```{r} y = list(1,4,7) str(y[NA]) str(y[NULL]) str(y[c(1,NA)]) ``` ] --- ## Atomic vectors - [ vs. [[ `[[` subsets like `[` except it can only subset a single value. ```{r, error=TRUE} x = c(a=1,b=4,c=7) x[[1]] x[["a"]] x[[1:2]] ``` --- ## Generic Vectors - [ vs. [[ Subsets a single value, but returns the value - not a list containing that value. ```{r, error=TRUE} y = list(a=1,b=4,c=7) y[2] y[[2]] y[["b"]] y[[1:2]] ``` --- ## Hadley's Analogy ```{r echo=FALSE, fig.align="center", outwidth="80%"} knitr::include_graphics("imgs/pepper_subset.png") ``` --- ## [[ vs. $ `$` is equivalent to `[[` but it only works for named *lists* and it has a terrible default where it uses partial matching (`exact=FALSE`) to access the underlying value. ```{r, error=TRUE} x = c("abc"=1, "def"=5) x$abc y = list("abc"=1, "def"=5) y[["abc"]] y$abc y$d ``` --- ## A common gotcha Why does the following code not work? ```{r error=TRUE} x = list(abc = 1:10, def = 10:1) y = "abc" x$y ``` -- $$ x$y \Leftrightarrow x[["y"]] \ne x[[y]] $$ ```{r} x[[y]] ``` --- ## Exercise 1 Below are 100 values, ```{r} x = c(56, 3, 17, 2, 4, 9, 6, 5, 19, 5, 2, 3, 5, 0, 13, 12, 6, 31, 10, 21, 8, 4, 1, 1, 2, 5, 16, 1, 3, 8, 1, 3, 4, 8, 5, 2, 8, 6, 18, 40, 10, 20, 1, 27, 2, 11, 14, 5, 7, 0, 3, 0, 7, 0, 8, 10, 10, 12, 8, 82, 21, 3, 34, 55, 18, 2, 9, 29, 1, 4, 7, 14, 7, 1, 2, 7, 4, 74, 5, 0, 3, 13, 2, 8, 1, 6, 13, 7, 1, 10, 5, 2, 4, 4, 14, 15, 4, 17, 1, 9) ``` write down how you would create a subset to accomplish each of the following: * Select every third value starting at position 2 in `x`. * Remove all values with an odd index (e.g. 1, 3, etc.) * Remove every 4th value, but only if it is odd. --- class: middle count: false # Subsetting Matrices, Data Frames, and Arrays --- ## Subsetting Matrices ```{r} (x = matrix(1:6, nrow=2, ncol=3)) ``` .pull-left[ ```{r} x[1,3] x[1:2, 1:2] ``` ] .pull-right[ ```{r} x[, 1:2] x[-1,-3] ``` ] --- ## Preserving vs Simplifying Most of the time, R's `[` subset operator is a *preserving* operator, in that the returned object will have the same type/class as the parent. Confusingly, when used with a matrix or array `[` becomes a *simplifying* operator (does not preserve type) - this behavior is controlled by the `drop` argument. .pull-left[ ```{r} x[1, ] x[1, , drop=TRUE] x[1, , drop=FALSE] ``` ] .pull-right[ ```{r} str(x[1, ]) str(x[1, , drop=TRUE]) str(x[1, , drop=FALSE]) ``` ] --- ## Factor Subsetting ```{r} (x = factor(c("BS", "MS", "PhD", "MS"))) x[1:2] x[1:2, drop=TRUE] ``` --- ## Data Frame Subsetting If provided with a single value, data frames assume you want to subset a column or columns - multiple values then the data frame is treated as a matrix. ```{r} df = data.frame(a = 1:2, b = 3:4) df[1] df[[1]] df[, "a"] ``` --- ```{r} df[, "a"] df[, "a", drop = FALSE] df[1,] df[c("a","b","a")] ``` --- ## Tibble Subsetting As we mentioned last time when introducing tibbles, one of the design principals is that tibbles are lazy - they don't do anything unless explicitly asked for. In this case this means that they will not simplify unless you specify `drop=TRUE`. .small[ ```{r} library(tibble) tbl = tibble(a = 1:2, b = 3:4) ``` ```{r} tbl[1] tbl[[1]] tbl[, "a"] ``` ] --- .small[ ```{r} tbl[, "a"] tbl[, "a", drop = TRUE] tbl[1,] tbl[c("a","b","a")] ``` ] --- ## Preserving vs Simplifying Subsets Type | Simplifying | Preserving :----------------|:-------------------------|:----------------------------------------------------- Atomic Vector | | `x[[1]]`
`x[1]` List | `x[[1]]` | `x[1]` Matrix / Array | `x[[1]]`
`x[1, ]`
`x[, 1]` | `x[1, , drop=FALSE]`
`x[, 1, drop=FALSE]` Factor | `x[1:4, drop=TRUE]` | `x[1:4]`
`x[[1]]` Data frame | `x[, 1]`
`x[[1]]` | `x[, 1, drop=FALSE]`
`x[1]` Tibble | `x[, 1, drop=TRUE]`
`x[[1]]` | `x[, 1]`
`x[1]` --- class: middle count: false # Subsetting and assignment --- ## Subsetting and assignment Subsets can also be used with assignment to update specific values within an object. ```{r} x = c(1, 4, 7) ``` ```{r} x[2] = 2 x x[x %% 2 != 0] = x[x %% 2 != 0] + 1 x x[c(1,1)] = c(2,3) x ``` --- .pull-left[ ```{r} x = 1:6 x[c(2,NA)] = 1 x ``` ```{r} x = 1:6 x[c(TRUE,NA)] = 1 x ``` ] .pull-right[ ```{r} x = 1:6 x[c(-1,-3)] = 3 x ``` ```{r} x = 1:6 x[] = 6:1 x ``` ] --- ## Deleting list (df) elements ```{r} df = data.frame(a = 1:2, b = TRUE, c = c("A", "B")) ``` ```{r} df[["b"]] = NULL str(df) ``` ```{r} df[,"c"] = NULL str(df) ``` --- ## Subsets of Subsets ```{r} df = data.frame(a = c(5,1,NA,3)) ``` ```{r} df$a[df$a == 5] = 0 df ``` ```{r} df[1][df[1] == 3] = 0 df ``` --- ## Exercise 2 Some data providers choose to encode missing values using values like `-999`. Below is a sample data frame with missing values encoded in this way. ```{r} d = data.frame( patient_id = c(1, 2, 3, 4, 5), age = c(32, 27, 56, 19, 65), bp = c(110, 100, 125, -999, -999), o2 = c(97, 95, -999, -999, 99) ) ``` * *Task 1* - using the subsetting tools we've discussed come up with code that will replace the `-999` values in the `bp` and `o2` column with actual `NA` values. Save this as `d_na`. * *Task 2* - Once you have created `d_na` come up with code that translate it back into the original data frame `d`, i.e. replace the `NA`s with `-999`. --- ## Acknowledgments Above materials are derived in part from the following sources: * Hadley Wickham - [Advanced R](http://adv-r.had.co.nz/) * [R Language Definition](http://stat.ethz.ch/R-manual/R-devel/doc/manual/R-lang.html)