class: center, middle, inverse, title-slide # Data structures and subsetting ## Programming for Statistical Science ### Shawn Santo --- ## Supplementary materials Full video lecture available in Zoom Cloud Recordings Companion videos - [Git from the command line](https://warpwire.duke.edu/w/V04EAA/) Additional resources - [Sections 3.3 - 3.4](https://adv-r.hadley.nz/vectors-chap.html#attributes) Advanced R - [Chapter 4](https://adv-r.hadley.nz/subsetting.html) Advanced R --- class: inverse, center, middle # Recall --- ## Atomic vector creation We can use functions such as `c()`, `vector()`, and `:` to create atomic vectors. ```r c(5, 10, pi, 0, -sqrt(3)) ``` ``` #> [1] 5.000000 10.000000 3.141593 0.000000 -1.732051 ``` ```r vector(mode = "character", length = 4) ``` ``` #> [1] "" "" "" "" ``` ```r vector(mode = "integer", length = 3) ``` ``` #> [1] 0 0 0 ``` ```r -10:-3 ``` ``` #> [1] -10 -9 -8 -7 -6 -5 -4 -3 ``` --- ## Generic vector creation Function `list()` allows us to create a generic vector. ```r x <- list( a = -100:100, b = list(lower = letters, upper = LETTERS), cars_data = cars ) str(x) ``` ``` #> List of 3 #> $ a : int [1:201] -100 -99 -98 -97 -96 -95 -94 -93 -92 -91 ... #> $ b :List of 2 #> ..$ lower: chr [1:26] "a" "b" "c" "d" ... #> ..$ upper: chr [1:26] "A" "B" "C" "D" ... #> $ cars_data:'data.frame': 50 obs. of 2 variables: #> ..$ speed: num [1:50] 4 4 7 7 8 9 10 10 10 11 ... #> ..$ dist : num [1:50] 2 10 4 22 16 10 18 26 34 17 ... ``` --- class: inverse, center, middle # Attributes --- ## Data structures You may have heard of factors, matrices, arrays, and date-times. These are just atomic vectors with special attributes. - Attributes attach metadata to an object. -- - Function `attr()` can retrieve and modify a single attribute. ```r attr(x, which) # get attribute attr(x, which) <- value # set / modify attribute ``` -- - Function `attributes()` can retrieve and set attributes en masse. ```r attributes(x) # get attributes attributes(x) <- value # set / modify attributes ``` --- ## Attribute: `names` Get or set the names of an object. **One option:** ```r x <- 1:4 attributes(x) ``` ``` #> NULL ``` ```r attr(x = x, which = "names") <- c("a", "b", "c", "d") attributes(x) ``` ``` #> $names #> [1] "a" "b" "c" "d" ``` ```r x ``` ``` #> a b c d #> 1 2 3 4 ``` --- **Another option:** ```r a <- 1:4 names(a) <- c("a", "b", "c", "d") attributes(a) ``` ``` #> $names #> [1] "a" "b" "c" "d" ``` ```r a ``` ``` #> a b c d #> 1 2 3 4 ``` <br/> Either method is okay to use, but stick with using the replacement function. --- ## Attribute: `dim` Get or set the dimension of an object. ```r z <- 1:9 z ``` ``` #> [1] 1 2 3 4 5 6 7 8 9 ``` ```r attr(x = z, which = "dim") <- c(3, 3) attributes(z) ``` ``` #> $dim #> [1] 3 3 ``` ```r z ``` ``` #> [,1] [,2] [,3] #> [1,] 1 4 7 #> [2,] 2 5 8 #> [3,] 3 6 9 ``` -- We have a 3 x 3 matrix. --- ```r y <- matrix(z, nrow = 3, ncol = 3) attributes(y) ``` ``` #> $dim #> [1] 3 3 ``` ```r y ``` ``` #> [,1] [,2] [,3] #> [1,] 1 4 7 #> [2,] 2 5 8 #> [3,] 3 6 9 ``` --- ## Exercise Create a 3 x 3 x 2 array using the `dim` attribute with the vector below. ```r x <- c(5, 1, 5, 5, 1, 1, 5, 3, 2, 3, 2, 6, 4, 4, 1, 2, 1, 3) ``` <br/> Try to create the same array using function `array()`. What do you notice about how the array object is populated? ??? ## Solution .tiny[ ```r x <- c(5, 1, 5, 5, 1, 1, 5, 3, 2, 3, 2, 6, 4, 4, 1, 2, 1, 3) attr(x = x, which = "dim") <- c(3, 3, 2) x ``` ``` #> , , 1 #> #> [,1] [,2] [,3] #> [1,] 5 5 5 #> [2,] 1 1 3 #> [3,] 5 1 2 #> #> , , 2 #> #> [,1] [,2] [,3] #> [1,] 3 4 2 #> [2,] 2 4 1 #> [3,] 6 1 3 ``` ```r attributes(x) ``` ``` #> $dim #> [1] 3 3 2 ``` ```r array(x, dim = c(3, 3, 2)) ``` ``` #> , , 1 #> #> [,1] [,2] [,3] #> [1,] 5 5 5 #> [2,] 1 1 3 #> [3,] 5 1 2 #> #> , , 2 #> #> [,1] [,2] [,3] #> [1,] 3 4 2 #> [2,] 2 4 1 #> [3,] 6 1 3 ``` ] --- ## Factors Factors are built on top of integer vectors with two attributes: `class` and `levels`. Factors are how R stores and represents categorical data. A quick way to create a categorical variable as a factor is with function `factor()`. ```r x <- factor(c("walk", "single", "double", "triple", "home run")) x ``` ``` #> [1] walk single double triple home run #> Levels: double home run single triple walk ``` -- ```r typeof(x) ``` ``` #> [1] "integer" ``` ```r attributes(x) ``` ``` #> $levels #> [1] "double" "home run" "single" "triple" "walk" #> #> $class #> [1] "factor" ``` --- ## Ordered factors To induce an ordering we can use function `ordered()` as opposed to `factor()`. ```r y <- ordered(c("walk", "single", "double", "triple", "home run"), levels = c("walk", "single", "double", "triple", "home run")) y ``` ``` #> [1] walk single double triple home run #> Levels: walk < single < double < triple < home run ``` -- ```r attributes(y) ``` ``` #> $levels #> [1] "walk" "single" "double" "triple" "home run" #> #> $class #> [1] "ordered" "factor" ``` ```r str(y) ``` ``` #> Ord.factor w/ 5 levels "walk"<"single"<..: 1 2 3 4 5 ``` --- ## Exercise Create a factor vector based on the vector of airport codes below. Try to do it without using function `factor()`. ```r airports <- c("RDU", "ABE", "DTW", "GRR", "RDU", "GRR", "GNV", "JFK", "JFK", "SFO", "DTW") ``` Assume all the possible levels are ```r c("RDU", "ABE", "DTW", "GRR", "GNV", "JFK", "SFO") ``` *Hint*: Think about what type of object factors are built on. <br/> What if the possible levels are ```r c("RDU", "ABE", "DTW", "GRR", "GNV", "JFK", "SFO", "GSO", "ORD", "PHL") ``` ??? ## Solution .tiny[ ```r z <- as.integer(c(1,2,3,4,1,4,5,6,6,7,3)) attr(x = z, which = "levels") <- c("RDU", "ABE", "DTW", "GRR", "GNV", "JFK", "SFO") attr(x = z, which = "class") <- "factor" z ``` ``` #> [1] RDU ABE DTW GRR RDU GRR GNV JFK JFK SFO DTW #> Levels: RDU ABE DTW GRR GNV JFK SFO ``` ```r attributes(z) ``` ``` #> $levels #> [1] "RDU" "ABE" "DTW" "GRR" "GNV" "JFK" "SFO" #> #> $class #> [1] "factor" ``` ] --- ## Matrices and arrays - Homogeneous in their type. - Matrices are populated based on column major ordering (use `byrow` argument to change this). - Arrays can have one, two or more dimensions. --- ## Data frames Data frames are built on top of lists with attributes: `names`, `row.names`, and `class`. Here the class is `data.frame`. ```r typeof(longley) ``` ``` #> [1] "list" ``` ```r attributes(longley) ``` ``` #> $names #> [1] "GNP.deflator" "GNP" "Unemployed" "Armed.Forces" "Population" #> [6] "Year" "Employed" #> #> $class #> [1] "data.frame" #> #> $row.names #> [1] 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 #> [16] 1962 ``` -- Here `names` refers to variable names. --- ## Data frame characteristics - Data frames can be heterogeneous across columns. - Data frames are rectangular in structure (not always tidy). - They have column names and row names. - Data frames can be subset by name or position. --- ## Data frame creation by setting attributes Start with a list ```r x <- list(c("48501", "48507", "48505"), c(3, 4, 21), c(2, 1, 2)) str(x) ``` ``` #> List of 3 #> $ : chr [1:3] "48501" "48507" "48505" #> $ : num [1:3] 3 4 21 #> $ : num [1:3] 2 1 2 ``` -- Add attributes ```r attributes(x) <- list(class = "data.frame", names = c("zip", "lead_value", "time"), row.names = 1:3) ``` --- Then we have a data frame ```r x ``` ``` #> zip lead_value time #> 1 48501 3 2 #> 2 48507 4 1 #> 3 48505 21 2 ``` ```r str(x) ``` ``` #> 'data.frame': 3 obs. of 3 variables: #> $ zip : chr "48501" "48507" "48505" #> $ lead_value: num 3 4 21 #> $ time : num 2 1 2 ``` Of course, we could have used function `data.frame()` to create our data frame object. There is also function `tidyverse::tibble()` - it creates a tibble object. Similar to a data frame but with two addition class components. --- ## Length coercion Coercion is slightly different for data frames. .pull-left[ ```r data.frame(x = 1:3, y = c("a")) ``` ``` #> x y #> 1 1 a #> 2 2 a #> 3 3 a ``` ] .pull-right[ ```r data.frame(x = 1:3, y = c("a","b")) ``` ``` #> Error in #> data.frame(x = 1:3, #> y = c("a", "b")) : #> arguments imply differing number of #> rows: 3, 2 ``` ] If a shorter vector is not a multiple of the longest vector an error will occur. -- <br/> What do you think will happen here? ```r data.frame(num = 1:6, treatment = c(0, 10, 20), type = c("a", "b")) ``` --- ## Summary .small-text[ | Data Structure | Built On | Attribute(s) | Quick creation | |----------------|-----------------------|-------------------------------|--------------------------------| | Matrix, Array | Atomic vector | `dim` | `matrix()`, `array()` | | Factor | Atomic integer vector | `class`, `levels` | `factor()`, `ordered()` | | Date | Atomic double vector | `class` | `as.Date()` | | Date-times | Atomic double vector | `class` | `as.POSIXct()`, `as.POSIXlt()` | | Data frame | List | `class`, `names`, `row.names` | `data.frame()` | ] --- class: inverse, center, middle # Subsetting --- ## Subsetting techniques R has three operators (functions) for subsetting: 1. `[` 2. `[[` 3. `$` Which one you use will depend on the object you are working with, its attributes, and what you want as a result. We can subset with - integers - logicals - `NULL`, `NA` - character values --- ## Numeric (positive) subsetting **Indexing begins at 1, not 0.** .tiny[ ```r x <- c("NC", "SC", "VA", "TN") y <- list(states = x, rank = 1:4, message = "") ``` ] -- .tiny.pull-left[ **Atomic vector** ```r x[1] ``` ``` #> [1] "NC" ``` ```r x[c(1, 3)] ``` ``` #> [1] "NC" "VA" ``` ```r x[c(1:5)] ``` ``` #> [1] "NC" "SC" "VA" "TN" NA ``` ```r x[c(2.2, 3.9)] ``` ``` #> [1] "SC" "VA" ``` ] .tiny.pull-right[ **List** ```r str(y[1]) ``` ``` #> List of 1 #> $ states: chr [1:4] "NC" "SC" "VA" "TN" ``` ```r str(y[c(1, 3)]) ``` ``` #> List of 2 #> $ states : chr [1:4] "NC" "SC" "VA" "TN" #> $ message: chr "" ``` ```r str(y[c(1:4)]) ``` ``` #> List of 4 #> $ states : chr [1:4] "NC" "SC" "VA" "TN" #> $ rank : int [1:4] 1 2 3 4 #> $ message: chr "" #> $ NA : NULL ``` ] --- ## Numeric (negative) subsetting .tiny[ ```r x <- c("NC", "SC", "VA", "TN") y <- list(states = x, rank = 1:4, message = "") ``` ] .tiny.pull-left[ **Atomic vector** ```r x[-1] ``` ``` #> [1] "SC" "VA" "TN" ``` ```r x[-c(1, 3)] ``` ``` #> [1] "SC" "TN" ``` ```r x[c(-1, 3)] ``` ``` #> Error in x[c(-1, 3)]: only 0's may be mixed with negative subscripts ``` ```r *x[-c(2.2, 3.9)] ``` ``` #> [1] "NC" "TN" ``` ] .tiny.pull-right[ **List** ```r str(y[-1]) ``` ``` #> List of 2 #> $ rank : int [1:4] 1 2 3 4 #> $ message: chr "" ``` ```r str(y[-c(1, 3)]) ``` ``` #> List of 1 #> $ rank: int [1:4] 1 2 3 4 ``` ```r str(y[c(-1, 3)]) ``` ``` #> Error in y[c(-1, 3)]: only 0's may be mixed with negative subscripts ``` ```r *str(y[-c(2.2, 3.9)]) ``` ``` #> List of 2 #> $ states : chr [1:4] "NC" "SC" "VA" "TN" #> $ message: chr "" ``` ] --- ## Logical subsetting It returns elements that correspond to `TRUE` in the logical vector. The length of the logical vector is expected to be of the same length as the vector being subset. .tiny.pull-left[ **Atomic vector** ```r x <- c(1, 4, 7, 12) x[c(TRUE, TRUE, FALSE, TRUE)] ``` ``` #> [1] 1 4 12 ``` ```r x[c(TRUE, FALSE)] ``` ``` #> [1] 1 7 ``` ```r x[x %% 2 == 0] ``` ``` #> [1] 4 12 ``` ] .tiny.pull-right[ **List** ```r y <- list(1, 4, 7, 12) str(y[c(TRUE, TRUE, FALSE, TRUE)]) ``` ``` #> List of 3 #> $ : num 1 #> $ : num 4 #> $ : num 12 ``` ```r str(y[c(TRUE, FALSE)]) ``` ``` #> List of 2 #> $ : num 1 #> $ : num 7 ``` ```r str(y[y %% 2 == 0]) ``` ``` #> Error in y%%2: non-numeric #> argument to binary operator ``` ] --- ## Empty subsetting It returns the original vector. ```r x <- c(1,4,7) x[] ``` ``` #> [1] 1 4 7 ``` ```r y <- list(1,4,7) str(y[]) ``` ``` #> List of 3 #> $ : num 1 #> $ : num 4 #> $ : num 7 ``` --- ## Zero subsetting Returns an empty vector of the same type as the vector being subset. ```r x <- c(1,4,7) y <- list(1,4,7) ``` .pull-left[ ```r x[0] ``` ``` #> numeric(0) ``` ```r str(y[0]) ``` ``` #> list() ``` ] .pull-right[ ```r x[c(0,1)] ``` ``` #> [1] 1 ``` ```r y[c(0,1)] ``` ``` #> [[1]] #> [1] 1 ``` ] --- ## Character subsetting If a vector has names, you can select elements whose names correspond to the character vector. .pull-left[ **Atomic vector** ```r x <- c(a = 1, b = 4, c = 7) x["a"] ``` ``` #> a #> 1 ``` ```r x[c("a", "a")] ``` ``` #> a a #> 1 1 ``` ```r x[c("c", "b")] ``` ``` #> c b #> 7 4 ``` ] .pull-right[ **List** ```r y <- list(a = 1, b = 4, c = 7) str(y["a"]) ``` ``` #> List of 1 #> $ a: num 1 ``` ```r str(y[c("a", "a")]) ``` ``` #> List of 2 #> $ a: num 1 #> $ a: num 1 ``` ```r str(y[c("c", "b")]) ``` ``` #> List of 2 #> $ c: num 7 #> $ b: num 4 ``` ] --- ## Missing and NULL subsetting .pull-left[ **Atomic vector** ```r x <- c(1, 4, 7) x[NA] ``` ``` #> [1] NA NA NA ``` ```r x[NULL] ``` ``` #> numeric(0) ``` ```r x[c(1, NA)] ``` ``` #> [1] 1 NA ``` ] .pull-right[ **List** ```r y <- list(1, 4, 7) str(y[NA]) ``` ``` #> List of 3 #> $ : NULL #> $ : NULL #> $ : NULL ``` ```r str(y[NULL]) ``` ``` #> list() ``` ```r str(y[c(1, NA)]) ``` ``` #> List of 2 #> $ : num 1 #> $ : NULL ``` ] --- ## Exercise Consider the vectors `x` and `y` below. ```r x <- letters[1:5] y <- list(i = 1:5, j = -3:3, k = rep(0, 4)) ``` What is difference between subsetting with `[` and `[[` using integers? Try various indices. --- ## Understanding `[` vs. `[[` with lists .center[ <img src="images/shopping_cart.png" width="400" height="400"> ] -- How do you get a shopping cart with only the cheese and bananas? -- How do you get the bananas out of the cart? --- ## Using `$` for subsetting lists The `$` operator only works with named lists and works similar to `[[`. .tiny.pull-left[ ```r x <- list(a = 1:3, ab = 4:6, abc = 7:9) x ``` ``` #> $a #> [1] 1 2 3 #> #> $ab #> [1] 4 5 6 #> #> $abc #> [1] 7 8 9 ``` ```r x$a ``` ``` #> [1] 1 2 3 ``` ```r x$ab ``` ``` #> [1] 4 5 6 ``` ] .tiny.pull-right[ ```r y <- list(a = 1:3, abc = 4:6, abde = 7:9) y ``` ``` #> $a #> [1] 1 2 3 #> #> $abc #> [1] 4 5 6 #> #> $abde #> [1] 7 8 9 ``` ```r y$a ``` ``` #> [1] 1 2 3 ``` ```r *y$abd ``` ``` #> [1] 7 8 9 ``` ] --- ## References 1. Wickham, H. (2020). Advanced R. https://adv-r.hadley.nz/