--- title: "Data structures and subsetting" subtitle: "Statistical Computing & Programming" author: "Shawn Santo" institute: "" date: "05-18-20" output: xaringan::moon_reader: css: "slides.css" lib_dir: libs nature: highlightStyle: github highlightLines: true countIncrementalSlides: false editor_options: chunk_output_type: console --- ```{r include=FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = TRUE, comment = "#>", highlight = TRUE, fig.align = "center") ``` ## Supplementary materials Companion videos - [Attributes](https://warpwire.duke.edu/w/YcUDAA/) - [Data frames](https://warpwire.duke.edu/w/Y8UDAA/) - [Subsetting atomic vectors and lists](https://warpwire.duke.edu/w/ZcUDAA/) Additional resources - [Sections 3.3 - 3.4](https://adv-r.hadley.nz/vectors-chap.html#attributes) Advanced R - [Chapter 4](https://adv-r.hadley.nz/subsetting.html) Advanced R --- class: inverse, center, middle # Recall --- ## Atomic vector creation We can use functions such as `c()`, `vector()`, and `:` to create atomic vectors. ```{r} c(5, 10, pi, 0, -sqrt(3)) vector(mode = "character", length = 4) vector(mode = "integer", length = 3) -10:-3 ``` --- ## Generic vector creation Function `list()` allows us to create a generic vector. ```{r} x <- list( a = -100:100, b = list(lower = letters, upper = LETTERS), cars_data = cars ) str(x) ``` --- class: inverse, center, middle # Attributes --- ## Data structures You may have heard of factors, matrices, arrays, and date-times. These are just atomic vectors with special attributes. - Attributes attach metadata to an object. -- - Function `attr()` can retrieve and modify a single attribute. ```{r eval=FALSE} attr(x, which) # get attribute attr(x, which) <- value # set / modify attribute ``` -- - Function `attributes()` can retrieve and set attributes en masse. ```{r eval=FALSE} attributes(x) # get attributes attributes(x) <- value # set / modify attributes ``` --- ## Attribute: `names` Get or set the names of an object. **One option:** ```{r} x <- 1:4 attributes(x) attr(x = x, which = "names") <- c("a", "b", "c", "d") attributes(x) x ``` --- **Another option:** ```{r} a <- 1:4 names(a) <- c("a", "b", "c", "d") attributes(a) a ```
Either method is okay to use. --- ## Attribute: `dim` Get or set the dimension of an object. ```{r} z <- 1:9 z attr(x = z, which = "dim") <- c(3, 3) attributes(z) z ``` -- We have a 3 x 3 matrix. --- ```{r} y <- matrix(z, nrow = 3, ncol = 3) attributes(y) y ``` --- ## Exercise Create a 3 x 3 x 2 array using the `dim` attribute with the vector below. ```{r} x <- c(5, 1, 5, 5, 1, 1, 5, 3, 2, 3, 2, 6, 4, 4, 1, 2, 1, 3) ```
Try to create the same array using function `array()`. What do you notice about how the array object is populated? ??? ## Solution .tiny[ ```{r} x <- c(5, 1, 5, 5, 1, 1, 5, 3, 2, 3, 2, 6, 4, 4, 1, 2, 1, 3) attr(x = x, which = "dim") <- c(3, 3, 2) x attributes(x) ``` ```{r} array(x, dim = c(3, 3, 2)) ``` ] --- ## Factors Factors are built on top of integer vectors with two attributes: `class` and `levels`. Factors are how R stores and represents categorical data. A quick way to create a categorical variable as a factor is with function `factor()`. ```{r} x <- factor(c("walk", "single", "double", "triple", "home run")) x ``` -- ```{r} typeof(x) attributes(x) ``` --- ## Ordered factors To induce an ordering we can use function `ordered()` as opposed to `factor()`. ```{r} y <- ordered(c("walk", "single", "double", "triple", "home run"), levels = c("walk", "single", "double", "triple", "home run")) y ``` -- ```{r} attributes(y) str(y) ``` --- ## Exercise Create a factor vector based on the vector of airport codes below. Try to do it without using function `factor()`. ```{r} airports <- c("RDU", "ABE", "DTW", "GRR", "RDU", "GRR", "GNV", "JFK", "JFK", "SFO", "DTW") ``` Assume all the possible levels are ```{r eval=FALSE} c("RDU", "ABE", "DTW", "GRR", "GNV", "JFK", "SFO") ``` *Hint*: Think about what type of object factors are built on.
What if the possible levels are ```{r eval=FALSE} c("RDU", "ABE", "DTW", "GRR", "GNV", "JFK", "SFO", "GSO", "ORD", "PHL") ``` ??? ## Solution .tiny[ ```{r} z <- as.integer(c(1,2,3,4,1,4,5,6,6,7,3)) attr(x = z, which = "levels") <- c("RDU", "ABE", "DTW", "GRR", "GNV", "JFK", "SFO") attr(x = z, which = "class") <- "factor" z attributes(z) ``` ] --- ## Matrices and arrays - Homogeneous in their type. - Matrices are populated based on column major ordering (use `byrow` argument to change this). - Arrays can have one, two or more dimensions. --- ## Data frames Data frames are built on top of lists with attributes: `names`, `row.names`, and `class`. Here the class is `data.frame`. ```{r} typeof(longley) attributes(longley) ``` -- Here `names` refers to variable names. --- ## Data frame characteristics - Data frames can be heterogeneous across columns. - Data frames are rectangular in structure (not always tidy). - They have column names and row names. - Data frames can be subset by name or position. --- ## Data frame creation by setting attributes Start with a list ```{r} x <- list(c("48501", "48507", "48505"), c(3, 4, 21), c(2, 1, 2)) str(x) ``` -- Add attributes ```{r} attributes(x) <- list(class = "data.frame", names = c("zip", "lead_value", "time"), row.names = 1:3) ``` --- Then we have a data frame ```{r} x str(x) ``` Of course, we could have used function `data.frame()` to create our data frame object. There is also function `tidyverse::tibble()` - it creates a tibble object. Similar to a data frame but with two addition class components. --- ## Character vectors and data frames ```{r} y <- data.frame(zip = c("48501", "48507", "48505"), lead_value = c(3, 4, 21), time = c(2, 1, 2)) str(y) ``` Why are my strings (characters) factors? -- ```{r} y <- data.frame(zip = c("48501", "48507", "48505"), lead_value = c(3, 4, 21), time = c(2, 1, 2), stringsAsFactors = FALSE) str(y) ``` --- ## Length coercion Coercion is slightly different for data frames. .pull-left[ ```{r} data.frame(x = 1:3, y = c("a")) ``` ] .pull-right[ ```{r eval=FALSE} data.frame(x = 1:3, y = c("a","b")) ``` ``` #> Error in #> data.frame(x = 1:3, #> y = c("a", "b")) : #> arguments imply differing number of #> rows: 3, 2 ``` ] If a shorter vector is not a multiple of the longest vector an error will occur. --
What do you think will happen here? ```{r eval=FALSE} data.frame(num = 1:6, treatment = c(0, 10, 20), type = c("a", "b")) ``` --- ## Summary .small-text[ | Data Structure | Built On | Attribute(s) | Quick creation | |----------------|-----------------------|-------------------------------|--------------------------------| | Matrix, Array | Atomic vector | `dim` | `matrix()`, `array()` | | Factor | Atomic integer vector | `class`, `levels` | `factor()`, `ordered()` | | Date | Atomic double vector | `class` | `as.Date()` | | Date-times | Atomic double vector | `class` | `as.POSIXct()`, `as.POSIXlt()` | | Data frame | List | `class`, `names`, `row.names` | `data.frame()` | ] --- class: inverse, center, middle # Subsetting --- ## Subsetting techniques R has three operators (functions) for subsetting: 1. `[` 2. `[[` 3. `$` Which one you use will depend on the object you are working with, its attributes, and what you want as a result. We can subset with - integers - logicals - `NULL`, `NA` - character values --- ## Numeric (positive) subsetting **Indexing begins at 1, not 0.** .tiny[ ```{r} x <- c("NC", "SC", "VA", "TN") y <- list(states = x, rank = 1:4, message = "") ``` ] -- .tiny.pull-left[ **Atomic vector** ```{r} x[1] x[c(1, 3)] x[c(1:5)] x[c(2.2, 3.9)] ``` ] .tiny.pull-right[ **List** ```{r} str(y[1]) str(y[c(1, 3)]) str(y[c(1:4)]) ``` ] --- ## Numeric (negative) subsetting .tiny[ ```{r} x <- c("NC", "SC", "VA", "TN") y <- list(states = x, rank = 1:4, message = "") ``` ] .tiny.pull-left[ **Atomic vector** ```{r error=TRUE} x[-1] x[-c(1, 3)] x[c(-1, 3)] x[-c(2.2, 3.9)] #<< ``` ] .tiny.pull-right[ **List** ```{r error=TRUE} str(y[-1]) str(y[-c(1, 3)]) str(y[c(-1, 3)]) str(y[-c(2.2, 3.9)]) #<< ``` ] --- ## Logical subsetting It returns elements that correspond to `TRUE` in the logical vector. The length of the logical vector is expected to be of the same length as the vector being subset. .tiny.pull-left[ **Atomic vector** ```{r} x <- c(1, 4, 7, 12) x[c(TRUE, TRUE, FALSE, TRUE)] x[c(TRUE, FALSE)] x[x %% 2 == 0] ``` ] .tiny.pull-right[ **List** ```{r error=TRUE} y <- list(1, 4, 7, 12) str(y[c(TRUE, TRUE, FALSE, TRUE)]) str(y[c(TRUE, FALSE)]) ``` ```{r eval=FALSE} str(y[y %% 2 == 0]) ``` ``` #> Error in y%%2: non-numeric #> argument to binary operator ``` ] --- ## Empty subsetting It returns the original vector. ```{r} x <- c(1,4,7) x[] y <- list(1,4,7) str(y[]) ``` --- ## Zero subsetting Returns an empty vector of the same type as the vector being subset. ```{r} x <- c(1,4,7) y <- list(1,4,7) ``` .pull-left[ ```{r} x[0] str(y[0]) ``` ] .pull-right[ ```{r} x[c(0,1)] y[c(0,1)] ``` ] --- ## Character subsetting If a vector has names, you can select elements whose names correspond to the character vector. .pull-left[ **Atomic vector** ```{r} x <- c(a = 1, b = 4, c = 7) x["a"] x[c("a", "a")] x[c("c", "b")] ``` ] .pull-right[ **List** ```{r} y <- list(a = 1, b = 4, c = 7) str(y["a"]) str(y[c("a", "a")]) str(y[c("c", "b")]) ``` ] --- ## Missing and NULL subsetting .pull-left[ **Atomic vector** ```{r} x <- c(1, 4, 7) x[NA] x[NULL] x[c(1, NA)] ``` ] .pull-right[ **List** ```{r} y <- list(1, 4, 7) str(y[NA]) str(y[NULL]) str(y[c(1, NA)]) ``` ] --- ## Exercise Consider the vectors `x` and `y` below. ```{r} x <- letters[1:5] y <- list(i = 1:5, j = -3:3, k = rep(0, 4)) ``` What is difference between subsetting with `[` and `[[` using integers? Try various indices. --- ## Understanding `[` vs. `[[` with lists .center[ ] -- How do you get a shopping cart with only the cheese and bananas? -- How do you get the bananas out of the cart? --- ## Using `$` for subsetting lists The `$` operator only works with named lists and works similar to `[[`. .tiny.pull-left[ ```{r} x <- list(a = 1:3, ab = 4:6, abc = 7:9) x x$a x$ab ``` ] .tiny.pull-right[ ```{r} y <- list(a = 1:3, abc = 4:6, abde = 7:9) y y$a y$abd #<< ``` ] --- ## References - Wickham, H. (2019). Advanced R. https://adv-r.hadley.nz/