---
title: "Subsetting"
author: "Colin Rundel"
date: "2019-01-24"
output:
xaringan::moon_reader:
css: "slides.css"
lib_dir: libs
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
---
exclude: true
```{r, message=FALSE, warning=FALSE, include=FALSE}
options(
htmltools.dir.version = FALSE, # for blogdown
width=80
)
htmltools::tagList(rmarkdown::html_dependency_font_awesome())
```
---
## Subsetting in General
R has three primary subsetting operators (`[`, `[[`, and `$`).
The behavior of these operators will depend on the object (class) they are being used with.
--
In general there are 6 different data types that can be used to subset:
* Positive integers
* Negative integers
* Logical values
* Empty / NULL
* Zero
* Character values (names)
---
## Positive Integer subsetting
Returns elements at the given location(s) (Note - R uses a 1-based indexing scheme).
```{r}
x = c(1,4,7)
y = list(1,4,7)
```
.pull-left[.small[
```{r}
x[c(1,3)]
x[c(1,1)]
x[c(1.9,2.1)]
```
] ]
.pull-right[ .small[
```{r}
str( y[c(1,3)] )
str( y[c(1,1)] )
str( y[c(1.9,2.1)] )
```
] ]
---
## Negative Integer subsetting
Excludes elements at the given location
.pull-left[
```{r, error=TRUE}
x = c(1,4,7)
x[-1]
x[-c(1,3)]
x[c(-1,-1)]
```
]
.pull-right[
```{r, error=TRUE}
y = list(1,4,7)
str( y[-1] )
str( y[-c(1,3)] )
```
]
```{r error=TRUE}
x[c(-1,2)]
```
---
## Logical Value Subsetting
Returns elements that correspond to `TRUE` in the logical vector. Length of the logical vector is expected to be the same of the vector being subsetted.
.pull-left[
```{r}
x = c(1,4,7,12)
x[c(TRUE,TRUE,FALSE,TRUE)]
x[c(TRUE,FALSE)]
x[x %% 2 == 0]
```
]
.pull-right[
```{r, error=TRUE}
y = list(1,4,7,12)
str( y[c(TRUE,TRUE,FALSE,TRUE)] )
str( y[c(TRUE,FALSE)] )
```
]
--
```{r error=TRUE}
str( y[y %% 2 == 0] )
```
---
## Empty Subsetting
Returns the original vector.
```{r}
x = c(1,4,7)
x[]
y = list(1,4,7)
str(y[])
```
---
## Zero subsetting
Returns an empty vector of the same type as the vector being subseted.
.pull-left[
```{r}
x = c(1,4,7)
x[0]
y = list(1,4,7)
str(y[0])
```
]
.pull-right[
```{r}
x[c(0,1)]
y[c(0,1)]
```
]
---
## Character subsetting
If the vector has names, select elements whose names correspond to the character vector.
.pull-left[
```{r}
x = c(a=1,b=4,c=7)
x["a"]
x[c("a","a")]
x[c("b","c")]
```
]
.pull-right[
```{r}
y = list(a=1,b=4,c=7)
str(y["a"])
str(y[c("a","a")])
str(y[c("b","c")])
```
]
---
## Out of bound subsetting
.pull-left[
```{r}
x = c(1,4,7)
x[4]
x["a"]
x[c(1,4)]
```
]
.pull-right[
```{r}
y = list(1,4,7)
str(y[4])
str(y["a"])
str(y[c(1,4)])
```
]
---
## Missing and NULL subsetting
.pull-left[
```{r}
x = c(1,4,7)
x[NA]
x[NULL]
x[c(1,NA)]
```
]
.pull-right[
```{r}
y = list(1,4,7)
str(y[NA])
str(y[NULL])
str(y[c(1,NA)])
```
]
---
## Atomic vectors - [ vs. [[
`[[` subsets like `[` except it can only subset a single value.
```{r, error=TRUE}
x = c(a=1,b=4,c=7)
x[[1]]
x[["a"]]
x[[1:2]]
```
---
## Generic Vectors - [ vs. [[
Subsets a single value, but returns the value - not a list containing that value.
```{r, error=TRUE}
y = list(a=1,b=4,c=7)
y[2]
y[[2]]
y[["b"]]
y[[1:2]]
```
---
## Hadley's Analogy
```{r echo=FALSE, fig.align="center", outwidth="80%"}
knitr::include_graphics("imgs/pepper_subset.png")
```
---
## [[ vs. $
`$` is equivalent to `[[` but it only works for named *lists* and it has a terrible default where it uses partial matching (`exact=FALSE`) to access the underlying value.
```{r, error=TRUE}
x = c("abc"=1, "def"=5)
x$abc
y = list("abc"=1, "def"=5)
y[["abc"]]
y$abc
y$d
```
---
## A common gotcha
Why does the following code not work?
```{r error=TRUE}
x = list(abc = 1:10, def = 10:1)
y = "abc"
x$y
```
--
$$ x$y \Leftrightarrow x[["y"]] \ne x[[y]] $$
```{r}
x[[y]]
```
---
## Exercise 1
Below are 100 values,
```{r}
x = c(56, 3, 17, 2, 4, 9, 6, 5, 19, 5, 2, 3, 5, 0, 13, 12, 6, 31, 10, 21, 8, 4, 1, 1, 2, 5, 16, 1, 3, 8, 1,
3, 4, 8, 5, 2, 8, 6, 18, 40, 10, 20, 1, 27, 2, 11, 14, 5, 7, 0, 3, 0, 7, 0, 8, 10, 10, 12, 8, 82,
21, 3, 34, 55, 18, 2, 9, 29, 1, 4, 7, 14, 7, 1, 2, 7, 4, 74, 5, 0, 3, 13, 2, 8, 1, 6, 13, 7, 1, 10,
5, 2, 4, 4, 14, 15, 4, 17, 1, 9)
```
write down how you would create a subset to accomplish each of the following:
* Select every third value starting at position 2 in `x`.
* Remove all values with an odd index (e.g. 1, 3, etc.)
* Remove every 4th value, but only if it is odd.
---
class: middle
count: false
# Subsetting Matrices, Data Frames, and Arrays
---
## Subsetting Matrices
```{r}
(x = matrix(1:6, nrow=2, ncol=3))
```
.pull-left[
```{r}
x[1,3]
x[1:2, 1:2]
```
]
.pull-right[
```{r}
x[, 1:2]
x[-1,-3]
```
]
---
## Preserving vs Simplifying
Most of the time, R's `[` subset operator is a *preserving* operator, in that the returned object will have the same type/class as the parent. Confusingly, when used with a matrix or array `[` becomes a *simplifying* operator (does not preserve type) - this behavior is controlled by the `drop` argument.
.pull-left[
```{r}
x[1, ]
x[1, , drop=TRUE]
x[1, , drop=FALSE]
```
]
.pull-right[
```{r}
str(x[1, ])
str(x[1, , drop=TRUE])
str(x[1, , drop=FALSE])
```
]
---
## Factor Subsetting
```{r}
(x = factor(c("BS", "MS", "PhD", "MS")))
x[1:2]
x[1:2, drop=TRUE]
```
---
## Data Frame Subsetting
If provided with a single value, data frames assume you want to subset a column or columns - multiple values then the data frame is treated as a matrix.
```{r}
df = data.frame(a = 1:2, b = 3:4)
df[1]
df[[1]]
df[, "a"]
```
---
```{r}
df[, "a"]
df[, "a", drop = FALSE]
df[1,]
df[c("a","b","a")]
```
---
## Tibble Subsetting
As we mentioned last time when introducing tibbles, one of the design principals is that tibbles are lazy - they don't do anything unless explicitly asked for. In this case this means that they will not simplify unless you specify `drop=TRUE`.
.small[
```{r}
library(tibble)
tbl = tibble(a = 1:2, b = 3:4)
```
```{r}
tbl[1]
tbl[[1]]
tbl[, "a"]
```
]
---
.small[
```{r}
tbl[, "a"]
tbl[, "a", drop = TRUE]
tbl[1,]
tbl[c("a","b","a")]
```
]
---
## Preserving vs Simplifying Subsets
Type | Simplifying | Preserving
:----------------|:-------------------------|:-----------------------------------------------------
Atomic Vector | | `x[[1]]`
`x[1]`
List | `x[[1]]` | `x[1]`
Matrix / Array | `x[[1]]`
`x[1, ]`
`x[, 1]` | `x[1, , drop=FALSE]`
`x[, 1, drop=FALSE]`
Factor | `x[1:4, drop=TRUE]` | `x[1:4]`
`x[[1]]`
Data frame | `x[, 1]`
`x[[1]]` | `x[, 1, drop=FALSE]`
`x[1]`
Tibble | `x[, 1, drop=TRUE]`
`x[[1]]` | `x[, 1]`
`x[1]`
---
class: middle
count: false
# Subsetting and assignment
---
## Subsetting and assignment
Subsets can also be used with assignment to update specific values within an object.
```{r}
x = c(1, 4, 7)
```
```{r}
x[2] = 2
x
x[x %% 2 != 0] = x[x %% 2 != 0] + 1
x
x[c(1,1)] = c(2,3)
x
```
---
.pull-left[
```{r}
x = 1:6
x[c(2,NA)] = 1
x
```
```{r}
x = 1:6
x[c(TRUE,NA)] = 1
x
```
]
.pull-right[
```{r}
x = 1:6
x[c(-1,-3)] = 3
x
```
```{r}
x = 1:6
x[] = 6:1
x
```
]
---
## Deleting list (df) elements
```{r}
df = data.frame(a = 1:2, b = TRUE, c = c("A", "B"))
```
```{r}
df[["b"]] = NULL
str(df)
```
```{r}
df[,"c"] = NULL
str(df)
```
---
## Subsets of Subsets
```{r}
df = data.frame(a = c(5,1,NA,3))
```
```{r}
df$a[df$a == 5] = 0
df
```
```{r}
df[1][df[1] == 3] = 0
df
```
---
## Exercise 2
Some data providers choose to encode missing values using values like `-999`. Below is a sample data frame with missing values encoded in this way.
```{r}
d = data.frame(
patient_id = c(1, 2, 3, 4, 5),
age = c(32, 27, 56, 19, 65),
bp = c(110, 100, 125, -999, -999),
o2 = c(97, 95, -999, -999, 99)
)
```
* *Task 1* - using the subsetting tools we've discussed come up with code that will replace the `-999` values in the `bp` and `o2` column with actual `NA` values. Save this as `d_na`.
* *Task 2* - Once you have created `d_na` come up with code that translate it back into the original data frame `d`, i.e. replace the `NA`s with `-999`.
---
## Acknowledgments
Above materials are derived in part from the following sources:
* Hadley Wickham - [Advanced R](http://adv-r.had.co.nz/)
* [R Language Definition](http://stat.ethz.ch/R-manual/R-devel/doc/manual/R-lang.html)