---
title: "Subsetting"
author: "Colin Rundel"
date: "2019-01-24"
output:
  xaringan::moon_reader:
    css: "slides.css"
    lib_dir: libs
    nature:
      highlightStyle: github
      highlightLines: true
      countIncrementalSlides: false
---
exclude: true

```{r, message=FALSE, warning=FALSE, include=FALSE}
options(
  htmltools.dir.version = FALSE, # for blogdown
  width=80
)

htmltools::tagList(rmarkdown::html_dependency_font_awesome())
```

---

## Subsetting in General

R has three primary subsetting operators (`[`, `[[`, and `$`).

The behavior of these operators will depend on the object (class) they are being used with.

<br/>

--

In general there are 6 different data types that can be used to subset:

* Positive integers

* Negative integers

* Logical values

* Empty / NULL

* Zero

* Character values (names)

---

## Positive Integer subsetting

Returns elements at the given location(s) (Note - R uses a 1-based indexing scheme).

```{r}
x = c(1,4,7)
y = list(1,4,7)
```

.pull-left[.small[
```{r}
x[c(1,3)]
x[c(1,1)]
x[c(1.9,2.1)]
```
] ]

.pull-right[ .small[
```{r}
str( y[c(1,3)] )
str( y[c(1,1)] )
str( y[c(1.9,2.1)] )
```
] ]

---

## Negative Integer subsetting

Excludes elements at the given location

.pull-left[
```{r, error=TRUE}
x = c(1,4,7)
x[-1]
x[-c(1,3)]
x[c(-1,-1)]
```
]

.pull-right[
```{r, error=TRUE}
y = list(1,4,7)
str( y[-1] )
str( y[-c(1,3)] )
```
]

```{r error=TRUE}
x[c(-1,2)]
```

---

## Logical Value Subsetting

Returns elements that correspond to `TRUE` in the logical vector. Length of the logical vector is expected to be the same of the vector being subsetted.

.pull-left[
```{r}
x = c(1,4,7,12)
x[c(TRUE,TRUE,FALSE,TRUE)]
x[c(TRUE,FALSE)]
x[x %% 2 == 0]
```
]

.pull-right[
```{r, error=TRUE}
y = list(1,4,7,12)
str( y[c(TRUE,TRUE,FALSE,TRUE)] )
str( y[c(TRUE,FALSE)] )
```
]

--

```{r error=TRUE}
str( y[y %% 2 == 0] )
```

---

## Empty Subsetting

Returns the original vector.

```{r}
x = c(1,4,7)
x[]

y = list(1,4,7)
str(y[])
```

---

## Zero subsetting

Returns an empty vector of the same type as the vector being subseted.

.pull-left[
```{r}
x = c(1,4,7)
x[0]

y = list(1,4,7)
str(y[0])
```
]

.pull-right[
```{r}
x[c(0,1)]

y[c(0,1)]
```
]

---

## Character subsetting

If the vector has names, select elements whose names correspond to the character vector.

.pull-left[
```{r}
x = c(a=1,b=4,c=7)
x["a"]
x[c("a","a")]
x[c("b","c")]
```
]

.pull-right[
```{r}
y = list(a=1,b=4,c=7)
str(y["a"])
str(y[c("a","a")])
str(y[c("b","c")])
```
]

---

## Out of bound subsetting

.pull-left[
```{r}
x = c(1,4,7)
x[4]
x["a"]
x[c(1,4)]
```
]

.pull-right[
```{r}
y = list(1,4,7)
str(y[4])
str(y["a"])
str(y[c(1,4)])
```
]

---

## Missing and NULL subsetting

.pull-left[
```{r}
x = c(1,4,7)
x[NA]
x[NULL]
x[c(1,NA)]
```
]

.pull-right[
```{r}
y = list(1,4,7)
str(y[NA])
str(y[NULL])
str(y[c(1,NA)])
```
]

---

## Atomic vectors - [ vs. [[

`[[` subsets like `[` except it can only subset a single value. 

```{r, error=TRUE}
x = c(a=1,b=4,c=7)
x[[1]]
x[["a"]]
x[[1:2]]
```

---

## Generic Vectors - [ vs. [[

Subsets a single value, but returns the value - not a list containing that value.

```{r, error=TRUE}
y = list(a=1,b=4,c=7)
y[2]
y[[2]]
y[["b"]]
y[[1:2]]
```

---

## Hadley's Analogy

```{r echo=FALSE, fig.align="center", outwidth="80%"}
knitr::include_graphics("imgs/pepper_subset.png")
```

---

## [[ vs. $

`$` is equivalent to `[[` but it only works for named *lists* and it has a terrible default where it uses partial matching (`exact=FALSE`) to access the underlying value.

```{r, error=TRUE}
x = c("abc"=1, "def"=5)
x$abc
y = list("abc"=1, "def"=5)
y[["abc"]]
y$abc
y$d
```
---

## A common gotcha

Why does the following code not work?

```{r error=TRUE}
x = list(abc = 1:10, def = 10:1)
y = "abc"

x$y
```

--

$$ x$y \Leftrightarrow x[["y"]] \ne x[[y]] $$

```{r}
x[[y]]
```

---

## Exercise 1

Below are 100 values,

```{r}
x = c(56, 3, 17, 2, 4, 9, 6, 5, 19, 5, 2, 3, 5, 0, 13, 12, 6, 31, 10, 21, 8, 4, 1, 1, 2, 5, 16, 1, 3, 8, 1,
      3, 4, 8, 5, 2, 8, 6, 18, 40, 10, 20, 1, 27, 2, 11, 14, 5, 7, 0, 3, 0, 7, 0, 8, 10, 10, 12, 8, 82,
      21, 3, 34, 55, 18, 2, 9, 29, 1, 4, 7, 14, 7, 1, 2, 7, 4, 74, 5, 0, 3, 13, 2, 8, 1, 6, 13, 7, 1, 10,
      5, 2, 4, 4, 14, 15, 4, 17, 1, 9)
```

write down how you would create a subset to accomplish each of the following:

* Select every third value starting at position 2 in `x`.

* Remove all values with an odd index (e.g. 1, 3, etc.)

* Remove every 4th value, but only if it is odd.

---
class: middle
count: false

# Subsetting Matrices, Data Frames, and Arrays

---

## Subsetting Matrices

```{r}
(x = matrix(1:6, nrow=2, ncol=3))
```

.pull-left[
```{r}
x[1,3]
x[1:2, 1:2]
```
]

.pull-right[
```{r}
x[, 1:2]
x[-1,-3]
```
]

---

## Preserving vs Simplifying

Most of the time, R's `[` subset operator is a *preserving* operator, in that the returned object will have the same type/class as the parent. Confusingly, when used with a matrix or array `[` becomes a *simplifying* operator (does not preserve type) - this behavior is controlled by the `drop` argument.

.pull-left[
```{r}
x[1, ]
x[1, , drop=TRUE]
x[1, , drop=FALSE]
```
]

.pull-right[
```{r}
str(x[1, ])
str(x[1, , drop=TRUE])
str(x[1, , drop=FALSE])
```
]

---

## Factor Subsetting

```{r}
(x = factor(c("BS", "MS", "PhD", "MS")))
x[1:2]
x[1:2, drop=TRUE]
```

---

## Data Frame Subsetting

If provided with a single value, data frames assume you want to subset a column or columns - multiple values then the data frame is treated as a matrix.

```{r}
df = data.frame(a = 1:2, b = 3:4)
df[1]
df[[1]]
df[, "a"]
```

---

```{r}
df[, "a"]
df[, "a", drop = FALSE]
df[1,]
df[c("a","b","a")]
```

---

## Tibble Subsetting

As we mentioned last time when introducing tibbles, one of the design principals is that tibbles are lazy - they don't do anything unless explicitly asked for. In this case this means that they will not simplify unless you specify `drop=TRUE`.

.small[
```{r}
library(tibble)
tbl = tibble(a = 1:2, b = 3:4)
```

```{r}
tbl[1]
tbl[[1]]
tbl[, "a"]
```
]

---

.small[
```{r}
tbl[, "a"]
tbl[, "a", drop = TRUE]
tbl[1,]
tbl[c("a","b","a")]
```
]

---

## Preserving vs Simplifying Subsets

Type             |  Simplifying             |  Preserving
:----------------|:-------------------------|:-----------------------------------------------------
Atomic Vector    |                          |  `x[[1]]` <br/> `x[1]`
List             |  `x[[1]]`                |  `x[1]`
Matrix / Array   |  `x[[1]]` <br/> `x[1, ]` <br/> `x[, 1]` |  `x[1, , drop=FALSE]` <br/> `x[, 1, drop=FALSE]`
Factor           |  `x[1:4, drop=TRUE]`     |  `x[1:4]` <br/> `x[[1]]`
Data frame       |  `x[, 1]` <br/> `x[[1]]` |  `x[, 1, drop=FALSE]` <br/> `x[1]`
Tibble           |  `x[, 1, drop=TRUE]` <br/> `x[[1]]` | `x[, 1]` <br/> `x[1]`

---
class: middle
count: false

# Subsetting and assignment

---

## Subsetting and assignment

Subsets can also be used with assignment to update specific values within an object.

```{r}
x = c(1, 4, 7)
```

```{r}
x[2] = 2
x
x[x %% 2 != 0] = x[x %% 2 != 0] + 1
x
x[c(1,1)] = c(2,3)
x
```

---

.pull-left[
```{r}
x = 1:6
x[c(2,NA)] = 1
x
```

```{r}
x = 1:6
x[c(TRUE,NA)] = 1
x
```
]

.pull-right[
```{r}
x = 1:6
x[c(-1,-3)] = 3
x
```

```{r}
x = 1:6
x[] = 6:1
x
```
]

---

## Deleting list (df) elements

```{r}
df = data.frame(a = 1:2, b = TRUE, c = c("A", "B"))
```

```{r}
df[["b"]] = NULL
str(df)
```

```{r}
df[,"c"] = NULL
str(df)
```

---

## Subsets of Subsets

```{r}
df = data.frame(a = c(5,1,NA,3))
```

```{r}
df$a[df$a == 5] = 0
df
```


```{r}
df[1][df[1] == 3] = 0
df
```

---

## Exercise 2

Some data providers choose to encode missing values using values like `-999`. Below is a sample data frame with missing values encoded in this way. 

```{r}
d = data.frame(
  patient_id = c(1, 2, 3, 4, 5),
  age = c(32, 27, 56, 19, 65),
  bp = c(110, 100, 125, -999, -999),
  o2 = c(97, 95, -999, -999, 99)
)
```

* *Task 1* - using the subsetting tools we've discussed come up with code that will replace the `-999` values in the `bp` and `o2` column with actual `NA` values. Save this as `d_na`.

* *Task 2* - Once you have created `d_na` come up with code that translate it back into the original data frame `d`, i.e. replace the `NA`s with `-999`.

---

## Acknowledgments

Above materials are derived in part from the following sources:

* Hadley Wickham - [Advanced R](http://adv-r.had.co.nz/)
* [R Language Definition](http://stat.ethz.ch/R-manual/R-devel/doc/manual/R-lang.html)