class: center, middle, inverse, title-slide # Subsetting ### Colin Rundel ### 2019-01-24 --- exclude: true --- ## Subsetting in General R has three primary subsetting operators (`[`, `[[`, and `$`). The behavior of these operators will depend on the object (class) they are being used with. <br/> -- In general there are 6 different data types that can be used to subset: * Positive integers * Negative integers * Logical values * Empty / NULL * Zero * Character values (names) --- ## Positive Integer subsetting Returns elements at the given location(s) (Note - R uses a 1-based indexing scheme). ```r x = c(1,4,7) y = list(1,4,7) ``` .pull-left[.small[ ```r x[c(1,3)] ``` ``` ## [1] 1 7 ``` ```r x[c(1,1)] ``` ``` ## [1] 1 1 ``` ```r x[c(1.9,2.1)] ``` ``` ## [1] 1 4 ``` ] ] .pull-right[ .small[ ```r str( y[c(1,3)] ) ``` ``` ## List of 2 ## $ : num 1 ## $ : num 7 ``` ```r str( y[c(1,1)] ) ``` ``` ## List of 2 ## $ : num 1 ## $ : num 1 ``` ```r str( y[c(1.9,2.1)] ) ``` ``` ## List of 2 ## $ : num 1 ## $ : num 4 ``` ] ] --- ## Negative Integer subsetting Excludes elements at the given location .pull-left[ ```r x = c(1,4,7) x[-1] ``` ``` ## [1] 4 7 ``` ```r x[-c(1,3)] ``` ``` ## [1] 4 ``` ```r x[c(-1,-1)] ``` ``` ## [1] 4 7 ``` ] .pull-right[ ```r y = list(1,4,7) str( y[-1] ) ``` ``` ## List of 2 ## $ : num 4 ## $ : num 7 ``` ```r str( y[-c(1,3)] ) ``` ``` ## List of 1 ## $ : num 4 ``` ] ```r x[c(-1,2)] ``` ``` ## Error in x[c(-1, 2)]: only 0's may be mixed with negative subscripts ``` --- ## Logical Value Subsetting Returns elements that correspond to `TRUE` in the logical vector. Length of the logical vector is expected to be the same of the vector being subsetted. .pull-left[ ```r x = c(1,4,7,12) x[c(TRUE,TRUE,FALSE,TRUE)] ``` ``` ## [1] 1 4 12 ``` ```r x[c(TRUE,FALSE)] ``` ``` ## [1] 1 7 ``` ```r x[x %% 2 == 0] ``` ``` ## [1] 4 12 ``` ] .pull-right[ ```r y = list(1,4,7,12) str( y[c(TRUE,TRUE,FALSE,TRUE)] ) ``` ``` ## List of 3 ## $ : num 1 ## $ : num 4 ## $ : num 12 ``` ```r str( y[c(TRUE,FALSE)] ) ``` ``` ## List of 2 ## $ : num 1 ## $ : num 7 ``` ] -- ```r str( y[y %% 2 == 0] ) ``` ``` ## Error in y%%2: non-numeric argument to binary operator ``` --- ## Empty Subsetting Returns the original vector. ```r x = c(1,4,7) x[] ``` ``` ## [1] 1 4 7 ``` ```r y = list(1,4,7) str(y[]) ``` ``` ## List of 3 ## $ : num 1 ## $ : num 4 ## $ : num 7 ``` --- ## Zero subsetting Returns an empty vector of the same type as the vector being subseted. .pull-left[ ```r x = c(1,4,7) x[0] ``` ``` ## numeric(0) ``` ```r y = list(1,4,7) str(y[0]) ``` ``` ## list() ``` ] .pull-right[ ```r x[c(0,1)] ``` ``` ## [1] 1 ``` ```r y[c(0,1)] ``` ``` ## [[1]] ## [1] 1 ``` ] --- ## Character subsetting If the vector has names, select elements whose names correspond to the character vector. .pull-left[ ```r x = c(a=1,b=4,c=7) x["a"] ``` ``` ## a ## 1 ``` ```r x[c("a","a")] ``` ``` ## a a ## 1 1 ``` ```r x[c("b","c")] ``` ``` ## b c ## 4 7 ``` ] .pull-right[ ```r y = list(a=1,b=4,c=7) str(y["a"]) ``` ``` ## List of 1 ## $ a: num 1 ``` ```r str(y[c("a","a")]) ``` ``` ## List of 2 ## $ a: num 1 ## $ a: num 1 ``` ```r str(y[c("b","c")]) ``` ``` ## List of 2 ## $ b: num 4 ## $ c: num 7 ``` ] --- ## Out of bound subsetting .pull-left[ ```r x = c(1,4,7) x[4] ``` ``` ## [1] NA ``` ```r x["a"] ``` ``` ## [1] NA ``` ```r x[c(1,4)] ``` ``` ## [1] 1 NA ``` ] .pull-right[ ```r y = list(1,4,7) str(y[4]) ``` ``` ## List of 1 ## $ : NULL ``` ```r str(y["a"]) ``` ``` ## List of 1 ## $ : NULL ``` ```r str(y[c(1,4)]) ``` ``` ## List of 2 ## $ : num 1 ## $ : NULL ``` ] --- ## Missing and NULL subsetting .pull-left[ ```r x = c(1,4,7) x[NA] ``` ``` ## [1] NA NA NA ``` ```r x[NULL] ``` ``` ## numeric(0) ``` ```r x[c(1,NA)] ``` ``` ## [1] 1 NA ``` ] .pull-right[ ```r y = list(1,4,7) str(y[NA]) ``` ``` ## List of 3 ## $ : NULL ## $ : NULL ## $ : NULL ``` ```r str(y[NULL]) ``` ``` ## list() ``` ```r str(y[c(1,NA)]) ``` ``` ## List of 2 ## $ : num 1 ## $ : NULL ``` ] --- ## Atomic vectors - [ vs. [[ `[[` subsets like `[` except it can only subset a single value. ```r x = c(a=1,b=4,c=7) x[[1]] ``` ``` ## [1] 1 ``` ```r x[["a"]] ``` ``` ## [1] 1 ``` ```r x[[1:2]] ``` ``` ## Error in x[[1:2]]: attempt to select more than one element in vectorIndex ``` --- ## Generic Vectors - [ vs. [[ Subsets a single value, but returns the value - not a list containing that value. ```r y = list(a=1,b=4,c=7) y[2] ``` ``` ## $b ## [1] 4 ``` ```r y[[2]] ``` ``` ## [1] 4 ``` ```r y[["b"]] ``` ``` ## [1] 4 ``` ```r y[[1:2]] ``` ``` ## Error in y[[1:2]]: subscript out of bounds ``` --- ## Hadley's Analogy <img src="imgs/pepper_subset.png" width="2617" style="display: block; margin: auto;" /> --- ## [[ vs. $ `$` is equivalent to `[[` but it only works for named *lists* and it has a terrible default where it uses partial matching (`exact=FALSE`) to access the underlying value. ```r x = c("abc"=1, "def"=5) x$abc ``` ``` ## Error in x$abc: $ operator is invalid for atomic vectors ``` ```r y = list("abc"=1, "def"=5) y[["abc"]] ``` ``` ## [1] 1 ``` ```r y$abc ``` ``` ## [1] 1 ``` ```r y$d ``` ``` ## [1] 5 ``` --- ## A common gotcha Why does the following code not work? ```r x = list(abc = 1:10, def = 10:1) y = "abc" x$y ``` ``` ## NULL ``` -- $$ x$y \Leftrightarrow x[["y"]] \ne x[[y]] $$ ```r x[[y]] ``` ``` ## [1] 1 2 3 4 5 6 7 8 9 10 ``` --- ## Exercise 1 Below are 100 values, ```r x = c(56, 3, 17, 2, 4, 9, 6, 5, 19, 5, 2, 3, 5, 0, 13, 12, 6, 31, 10, 21, 8, 4, 1, 1, 2, 5, 16, 1, 3, 8, 1, 3, 4, 8, 5, 2, 8, 6, 18, 40, 10, 20, 1, 27, 2, 11, 14, 5, 7, 0, 3, 0, 7, 0, 8, 10, 10, 12, 8, 82, 21, 3, 34, 55, 18, 2, 9, 29, 1, 4, 7, 14, 7, 1, 2, 7, 4, 74, 5, 0, 3, 13, 2, 8, 1, 6, 13, 7, 1, 10, 5, 2, 4, 4, 14, 15, 4, 17, 1, 9) ``` write down how you would create a subset to accomplish each of the following: * Select every third value starting at position 2 in `x`. * Remove all values with an odd index (e.g. 1, 3, etc.) * Remove every 4th value, but only if it is odd. --- class: middle count: false # Subsetting Matrices, Data Frames, and Arrays --- ## Subsetting Matrices ```r (x = matrix(1:6, nrow=2, ncol=3)) ``` ``` ## [,1] [,2] [,3] ## [1,] 1 3 5 ## [2,] 2 4 6 ``` .pull-left[ ```r x[1,3] ``` ``` ## [1] 5 ``` ```r x[1:2, 1:2] ``` ``` ## [,1] [,2] ## [1,] 1 3 ## [2,] 2 4 ``` ] .pull-right[ ```r x[, 1:2] ``` ``` ## [,1] [,2] ## [1,] 1 3 ## [2,] 2 4 ``` ```r x[-1,-3] ``` ``` ## [1] 2 4 ``` ] --- ## Preserving vs Simplifying Most of the time, R's `[` subset operator is a *preserving* operator, in that the returned object will have the same type/class as the parent. Confusingly, when used with a matrix or array `[` becomes a *simplifying* operator (does not preserve type) - this behavior is controlled by the `drop` argument. .pull-left[ ```r x[1, ] ``` ``` ## [1] 1 3 5 ``` ```r x[1, , drop=TRUE] ``` ``` ## [1] 1 3 5 ``` ```r x[1, , drop=FALSE] ``` ``` ## [,1] [,2] [,3] ## [1,] 1 3 5 ``` ] .pull-right[ ```r str(x[1, ]) ``` ``` ## int [1:3] 1 3 5 ``` ```r str(x[1, , drop=TRUE]) ``` ``` ## int [1:3] 1 3 5 ``` ```r str(x[1, , drop=FALSE]) ``` ``` ## int [1, 1:3] 1 3 5 ``` ] --- ## Factor Subsetting ```r (x = factor(c("BS", "MS", "PhD", "MS"))) ``` ``` ## [1] BS MS PhD MS ## Levels: BS MS PhD ``` ```r x[1:2] ``` ``` ## [1] BS MS ## Levels: BS MS PhD ``` ```r x[1:2, drop=TRUE] ``` ``` ## [1] BS MS ## Levels: BS MS ``` --- ## Data Frame Subsetting If provided with a single value, data frames assume you want to subset a column or columns - multiple values then the data frame is treated as a matrix. ```r df = data.frame(a = 1:2, b = 3:4) df[1] ``` ``` ## a ## 1 1 ## 2 2 ``` ```r df[[1]] ``` ``` ## [1] 1 2 ``` ```r df[, "a"] ``` ``` ## [1] 1 2 ``` --- ```r df[, "a"] ``` ``` ## [1] 1 2 ``` ```r df[, "a", drop = FALSE] ``` ``` ## a ## 1 1 ## 2 2 ``` ```r df[1,] ``` ``` ## a b ## 1 1 3 ``` ```r df[c("a","b","a")] ``` ``` ## a b a.1 ## 1 1 3 1 ## 2 2 4 2 ``` --- ## Tibble Subsetting As we mentioned last time when introducing tibbles, one of the design principals is that tibbles are lazy - they don't do anything unless explicitly asked for. In this case this means that they will not simplify unless you specify `drop=TRUE`. .small[ ```r library(tibble) tbl = tibble(a = 1:2, b = 3:4) ``` ```r tbl[1] ``` ``` ## # A tibble: 2 x 1 ## a ## <int> ## 1 1 ## 2 2 ``` ```r tbl[[1]] ``` ``` ## [1] 1 2 ``` ```r tbl[, "a"] ``` ``` ## # A tibble: 2 x 1 ## a ## <int> ## 1 1 ## 2 2 ``` ] --- .small[ ```r tbl[, "a"] ``` ``` ## # A tibble: 2 x 1 ## a ## <int> ## 1 1 ## 2 2 ``` ```r tbl[, "a", drop = TRUE] ``` ``` ## [1] 1 2 ``` ```r tbl[1,] ``` ``` ## # A tibble: 1 x 2 ## a b ## <int> <int> ## 1 1 3 ``` ```r tbl[c("a","b","a")] ``` ``` ## # A tibble: 2 x 3 ## a b a ## <int> <int> <int> ## 1 1 3 1 ## 2 2 4 2 ``` ] --- ## Preserving vs Simplifying Subsets Type | Simplifying | Preserving :----------------|:-------------------------|:----------------------------------------------------- Atomic Vector | | `x[[1]]` <br/> `x[1]` List | `x[[1]]` | `x[1]` Matrix / Array | `x[[1]]` <br/> `x[1, ]` <br/> `x[, 1]` | `x[1, , drop=FALSE]` <br/> `x[, 1, drop=FALSE]` Factor | `x[1:4, drop=TRUE]` | `x[1:4]` <br/> `x[[1]]` Data frame | `x[, 1]` <br/> `x[[1]]` | `x[, 1, drop=FALSE]` <br/> `x[1]` Tibble | `x[, 1, drop=TRUE]` <br/> `x[[1]]` | `x[, 1]` <br/> `x[1]` --- class: middle count: false # Subsetting and assignment --- ## Subsetting and assignment Subsets can also be used with assignment to update specific values within an object. ```r x = c(1, 4, 7) ``` ```r x[2] = 2 x ``` ``` ## [1] 1 2 7 ``` ```r x[x %% 2 != 0] = x[x %% 2 != 0] + 1 x ``` ``` ## [1] 2 2 8 ``` ```r x[c(1,1)] = c(2,3) x ``` ``` ## [1] 3 2 8 ``` --- .pull-left[ ```r x = 1:6 x[c(2,NA)] = 1 x ``` ``` ## [1] 1 1 3 4 5 6 ``` ```r x = 1:6 x[c(TRUE,NA)] = 1 x ``` ``` ## [1] 1 2 1 4 1 6 ``` ] .pull-right[ ```r x = 1:6 x[c(-1,-3)] = 3 x ``` ``` ## [1] 1 3 3 3 3 3 ``` ```r x = 1:6 x[] = 6:1 x ``` ``` ## [1] 6 5 4 3 2 1 ``` ] --- ## Deleting list (df) elements ```r df = data.frame(a = 1:2, b = TRUE, c = c("A", "B")) ``` ```r df[["b"]] = NULL str(df) ``` ``` ## 'data.frame': 2 obs. of 2 variables: ## $ a: int 1 2 ## $ c: Factor w/ 2 levels "A","B": 1 2 ``` ```r df[,"c"] = NULL str(df) ``` ``` ## 'data.frame': 2 obs. of 1 variable: ## $ a: int 1 2 ``` --- ## Subsets of Subsets ```r df = data.frame(a = c(5,1,NA,3)) ``` ```r df$a[df$a == 5] = 0 df ``` ``` ## a ## 1 0 ## 2 1 ## 3 NA ## 4 3 ``` ```r df[1][df[1] == 3] = 0 df ``` ``` ## a ## 1 0 ## 2 1 ## 3 NA ## 4 0 ``` --- ## Exercise 2 Some data providers choose to encode missing values using values like `-999`. Below is a sample data frame with missing values encoded in this way. ```r d = data.frame( patient_id = c(1, 2, 3, 4, 5), age = c(32, 27, 56, 19, 65), bp = c(110, 100, 125, -999, -999), o2 = c(97, 95, -999, -999, 99) ) ``` * *Task 1* - using the subsetting tools we've discussed come up with code that will replace the `-999` values in the `bp` and `o2` column with actual `NA` values. Save this as `d_na`. * *Task 2* - Once you have created `d_na` come up with code that translate it back into the original data frame `d`, i.e. replace the `NA`s with `-999`. --- ## Acknowledgments Above materials are derived in part from the following sources: * Hadley Wickham - [Advanced R](http://adv-r.had.co.nz/) * [R Language Definition](http://stat.ethz.ch/R-manual/R-devel/doc/manual/R-lang.html)