class: center, middle, inverse, title-slide # Data structures & Subsetting ### Colin Rundel ### 2018-09-05 --- exclude: true --- class: middle count: false # Attributes --- ## Attributes Attributes are metadata that can be attached to objects in R. Some are special (e.g. `class`, `comment`, `dim`, `dimnames`, `names`, etc.) and change the way in which an object is treated by R. Attributes are a named list that is attached to an R object, they can be accessed (get and set) individually via the `attr` and collectively via `attributes`. .midi[ ```r (x = c(L=1,M=2,N=3)) ``` ``` ## L M N ## 1 2 3 ``` ```r attr(x,"names") = c("A","B","C") x ``` ``` ## A B C ## 1 2 3 ``` ```r names(x) ``` ``` ## [1] "A" "B" "C" ``` ] --- ## ```r str(x) ``` ``` ## Named num [1:3] 1 2 3 ## - attr(*, "names")= chr [1:3] "A" "B" "C" ``` ```r attributes(x) ``` ``` ## $names ## [1] "A" "B" "C" ``` ```r str(attributes(x)) ``` ``` ## List of 1 ## $ names: chr [1:3] "A" "B" "C" ``` --- ## Factors Factor objects are how R represents categorical data (e.g. a variable where there are a fixed #s of possible outcomes). ```r (x = factor(c("BS", "MS", "PhD", "MS"))) ``` ``` ## [1] BS MS PhD MS ## Levels: BS MS PhD ``` ```r str(x) ``` ``` ## Factor w/ 3 levels "BS","MS","PhD": 1 2 3 2 ``` ```r typeof(x) ``` ``` ## [1] "integer" ``` --- ## A factor is just an integer vector with two attributes: `class = "factor"` and `levels = ` a character vector. ```r attributes(x) ``` ``` ## $levels ## [1] "BS" "MS" "PhD" ## ## $class ## [1] "factor" ``` --- ## Exercise 1 Construct a factor variable (without using `factor`, `as.factor`, or related functions) that contains the weather forecast for Los Angeles over the next 7 days. <br/> <img src="imgs/darksky_forecast.png" width="60%" style="display: block; margin: auto;" /> * There should be 5 levels - `sun`, `partial clouds`, `clouds`, `rain`, `snow`. * Start with an *integer* vector and add the appropriate attributes. --- class: middle count: false # Data Frames --- ## Data Frames A data frame is how R handles heterogeneous tabular data (i.e. rows and columns) and is one of the most commonly used data structure in R. At their core R represents data frames as a list of equal length vectors (usually atomic, but you can use lists as well). Most often a data frame will be constructed by reading in from a file, but we can also create them from scratch. ```r df = data.frame(x = 1:3, y = c("a", "b", "c")) str(df) ``` ``` ## 'data.frame': 3 obs. of 2 variables: ## $ x: int 1 2 3 ## $ y: Factor w/ 3 levels "a","b","c": 1 2 3 ``` --- ```r typeof(df) ``` ``` ## [1] "list" ``` ```r attributes(df) ``` ``` ## $names ## [1] "x" "y" ## ## $class ## [1] "data.frame" ## ## $row.names ## [1] 1 2 3 ``` --- ## Roll your own data.frame ```r df2 = list(x = 1:3, y = factor(c("a", "b", "c"))) ``` -- .pull-left[ ```r attr(df2,"class") = "data.frame" df2 ``` ``` ## [1] x y ## <0 rows> (or 0-length row.names) ``` ] -- .pull-right[ ```r attr(df2,"row.names") = 1:3 df2 ``` ``` ## x y ## 1 1 a ## 2 2 b ## 3 3 c ``` ] ```r str(df2) ``` ``` ## 'data.frame': 3 obs. of 2 variables: ## $ x: int 1 2 3 ## $ y: Factor w/ 3 levels "a","b","c": 1 2 3 ``` --- ## Strings (Characters) vs Factors By default R will convert character vectors into factors when they are included in a data frame. Sometimes this is useful, usually it isn't -- either way it is important to know what type/class you are working with. This behavior can be changed using the `stringsAsFactors` argument. ```r df = data.frame(x = 1:3, y = c("a", "b", "c"), stringsAsFactors = FALSE) str(df) ``` ``` ## 'data.frame': 3 obs. of 2 variables: ## $ x: int 1 2 3 ## $ y: chr "a" "b" "c" ``` --- ## Some general advice ... <br/> <br/> <img src="imgs/stringsasfactors.jpg" align="center" width="650px"/> --- ## Length Coercion As we have seen before, if a vector is shorter than expected, R will increase the length by repeating elements of the short vector. If the longer length is a multiple of the shorter then this coercion will occur without any warnings / errors. For data frames if the lengths are not evenly divisible then there will be an error (previous examples this only produced a warning). ```r data.frame(x = 1:3, y = c("a")) ``` ``` ## x y ## 1 1 a ## 2 2 a ## 3 3 a ``` ```r data.frame(x = 1:3, y = c("a","b")) ``` ``` ## Error in data.frame(x = 1:3, y = c("a", "b")): arguments imply differing number of rows: 3, 2 ``` --- ## Growing data frames We can add rows or columns to a data frame using `rbind` and `cbind` respectively. ```r df = data.frame(x = 1:3, y = c("a","b","c")) rbind(df, c(TRUE,FALSE)) ``` ``` ## Warning in `[<-.factor`(`*tmp*`, ri, value = FALSE): invalid factor level, NA ## generated ``` ``` ## x y ## 1 1 a ## 2 2 b ## 3 3 c ## 4 1 <NA> ``` ```r cbind(df, z=TRUE) ``` ``` ## x y z ## 1 1 a TRUE ## 2 2 b TRUE ## 3 3 c TRUE ``` --- ```r df1 = data.frame(x = 1:3, y = c("a","b","c")) df2 = data.frame(m = 3:1, n = c(TRUE,TRUE,FALSE)) cbind(df1,df2) ``` ``` ## x y m n ## 1 1 a 3 TRUE ## 2 2 b 2 TRUE ## 3 3 c 1 FALSE ``` --- ## Matrices A matrix is a 2 dimensional equivalent of an atomic vector (i.e. all entries must share the same type). ```r (m = matrix(c(1,2,3,4), ncol=2, nrow=2)) ``` ``` ## [,1] [,2] ## [1,] 1 3 ## [2,] 2 4 ``` ```r attributes(m) ``` ``` ## $dim ## [1] 2 2 ``` --- ## Column major ordering A matrix is therefore just an atomic vector with a `dim` attribute where the data is stored in column major order (fill the first column starting at row one, then the next column and so on). Data in a matrix is always stored in this format but we can fill by rows instead by using the `byrow` argument .pull-left[ ```r cm = matrix(c(1,2,3,4), ncol=2, nrow=2) cm ``` ``` ## [,1] [,2] ## [1,] 1 3 ## [2,] 2 4 ``` ```r c(cm) ``` ``` ## [1] 1 2 3 4 ``` ] .pull-right[ ```r rm = matrix(c(1,2,3,4), ncol=2, nrow=2, byrow=TRUE) rm ``` ``` ## [,1] [,2] ## [1,] 1 2 ## [2,] 3 4 ``` ```r c(rm) ``` ``` ## [1] 1 3 2 4 ``` ] --- class: middle count: false # Subsetting --- ## Subsetting in General R has several different subsetting operators (`[`, `[[`, and `$`). The behavior of these operators will depend on the object they are being used with. <br/> -- In general there are 6 different data types that can be used to subset: * Positive integers * Negative integers * Logical values * Empty / NULL * Zero * Character values (names) --- ## Positive Integer subsetting Returns elements at the given location(s) (Note - R uses a 1-based indexing scheme). ```r x = c(1,4,7) y = list(1,4,7) ``` .pull-left[.small[ ```r x[c(1,3)] ``` ``` ## [1] 1 7 ``` ```r x[c(1,1)] ``` ``` ## [1] 1 1 ``` ```r x[c(1.9,2.1)] ``` ``` ## [1] 1 4 ``` ] ] .pull-right[ .small[ ```r str( y[c(1,3)] ) ``` ``` ## List of 2 ## $ : num 1 ## $ : num 7 ``` ```r str( y[c(1,1)] ) ``` ``` ## List of 2 ## $ : num 1 ## $ : num 1 ``` ```r str( y[c(1.9,2.1)] ) ``` ``` ## List of 2 ## $ : num 1 ## $ : num 4 ``` ] ] --- ## Negative Integer subsetting Excludes elements at the given location .pull-left[ ```r x = c(1,4,7) x[-1] ``` ``` ## [1] 4 7 ``` ```r x[-c(1,3)] ``` ``` ## [1] 4 ``` ```r x[c(-1,-1)] ``` ``` ## [1] 4 7 ``` ] .pull-right[ ```r y = list(1,4,7) str( y[-1] ) ``` ``` ## List of 2 ## $ : num 4 ## $ : num 7 ``` ```r str( y[-c(1,3)] ) ``` ``` ## List of 1 ## $ : num 4 ``` ] ```r x[c(-1,2)] ``` ``` ## Error in x[c(-1, 2)]: only 0's may be mixed with negative subscripts ``` --- ## Logical Value Subsetting Returns elements that correspond to `TRUE` in the logical vector. Length of the logical vector is expected to be the same of the vector being subsetted. .pull-left[ ```r x = c(1,4,7,12) x[c(TRUE,TRUE,FALSE,TRUE)] ``` ``` ## [1] 1 4 12 ``` ```r x[c(TRUE,FALSE)] ``` ``` ## [1] 1 7 ``` ```r x[x %% 2 == 0] ``` ``` ## [1] 4 12 ``` ] .pull-right[ ```r y = list(1,4,7,12) str( y[c(TRUE,TRUE,FALSE,TRUE)] ) ``` ``` ## List of 3 ## $ : num 1 ## $ : num 4 ## $ : num 12 ``` ```r str( y[c(TRUE,FALSE)] ) ``` ``` ## List of 2 ## $ : num 1 ## $ : num 7 ``` ] -- ```r str( y[y %% 2 == 0] ) ``` ``` ## Error in y%%2: non-numeric argument to binary operator ``` --- ## Empty Subsetting Returns the original vector. ```r x = c(1,4,7) x[] ``` ``` ## [1] 1 4 7 ``` ```r y = list(1,4,7) str(y[]) ``` ``` ## List of 3 ## $ : num 1 ## $ : num 4 ## $ : num 7 ``` --- ## Zero subsetting Returns an empty vector of the same type as the vector being subseted. .pull-left[ ```r x = c(1,4,7) x[0] ``` ``` ## numeric(0) ``` ```r y = list(1,4,7) str(y[0]) ``` ``` ## list() ``` ] .pull-right[ ```r x[c(0,1)] ``` ``` ## [1] 1 ``` ```r y[c(0,1)] ``` ``` ## [[1]] ## [1] 1 ``` ] --- ## Character subsetting If the vector has names, select elements whose names correspond to the character vector. .pull-left[ ```r x = c(a=1,b=4,c=7) x["a"] ``` ``` ## a ## 1 ``` ```r x[c("a","a")] ``` ``` ## a a ## 1 1 ``` ```r x[c("b","c")] ``` ``` ## b c ## 4 7 ``` ] .pull-right[ ```r y = list(a=1,b=4,c=7) str(y["a"]) ``` ``` ## List of 1 ## $ a: num 1 ``` ```r str(y[c("a","a")]) ``` ``` ## List of 2 ## $ a: num 1 ## $ a: num 1 ``` ```r str(y[c("b","c")]) ``` ``` ## List of 2 ## $ b: num 4 ## $ c: num 7 ``` ] --- ## Out of bound subsetting .pull-left[ ```r x = c(1,4,7) x[4] ``` ``` ## [1] NA ``` ```r x["a"] ``` ``` ## [1] NA ``` ```r x[c(1,4)] ``` ``` ## [1] 1 NA ``` ] .pull-right[ ```r y = list(1,4,7) str(y[4]) ``` ``` ## List of 1 ## $ : NULL ``` ```r str(y["a"]) ``` ``` ## List of 1 ## $ : NULL ``` ```r str(y[c(1,4)]) ``` ``` ## List of 2 ## $ : num 1 ## $ : NULL ``` ] --- ## Missing and NULL subsetting .pull-left[ ```r x = c(1,4,7) x[NA] ``` ``` ## [1] NA NA NA ``` ```r x[NULL] ``` ``` ## numeric(0) ``` ```r x[c(1,NA)] ``` ``` ## [1] 1 NA ``` ] .pull-right[ ```r y = list(1,4,7) str(y[NA]) ``` ``` ## List of 3 ## $ : NULL ## $ : NULL ## $ : NULL ``` ```r str(y[NULL]) ``` ``` ## list() ``` ```r str(y[c(1,NA)]) ``` ``` ## List of 2 ## $ : num 1 ## $ : NULL ``` ] --- ## Atomic vectors - [ vs. [[ `[[` subsets like `[` except it can only subset a single value. ```r x = c(a=1,b=4,c=7) x[[1]] ``` ``` ## [1] 1 ``` ```r x[["a"]] ``` ``` ## [1] 1 ``` ```r x[[1:2]] ``` ``` ## Error in x[[1:2]]: attempt to select more than one element in vectorIndex ``` --- ## Generic Vectors - [ vs. [[ Subsets a single value, but returns the value - not a list containing that value. ```r y = list(a=1,b=4,c=7) y[2] ``` ``` ## $b ## [1] 4 ``` ```r y[[2]] ``` ``` ## [1] 4 ``` ```r y[["b"]] ``` ``` ## [1] 4 ``` ```r y[[1:2]] ``` ``` ## Error in y[[1:2]]: subscript out of bounds ``` --- ## Hadley's Analogy <img src="imgs/pepper_subset.png" width="2617" style="display: block; margin: auto;" /> --- ## [[ vs. $ `$` is equivalent to `[[` but it only works for named *lists* and it has a terrible default where it uses partial matching (`exact=FALSE`) to access the underlying value. ```r x = c("abc"=1, "def"=5) x$abc ``` ``` ## Error in x$abc: $ operator is invalid for atomic vectors ``` ```r y = list("abc"=1, "def"=5) y[["abc"]] ``` ``` ## [1] 1 ``` ```r y$abc ``` ``` ## [1] 1 ``` ```r y$d ``` ``` ## [1] 5 ``` --- ## A common gotcha Why does the following code not work? ```r x = list(abc = 1:10, def = 10:1) y = "abc" x$y ``` ``` ## NULL ``` -- $$ x$y \Leftrightarrow x[["y"]] \ne x[[y]] $$ ```r x[[y]] ``` ``` ## [1] 1 2 3 4 5 6 7 8 9 10 ``` --- ## Exercise 2 Below are 100 values, ```r x = c(56, 3, 17, 2, 4, 9, 6, 5, 19, 5, 2, 3, 5, 0, 13, 12, 6, 31, 10, 21, 8, 4, 1, 1, 2, 5, 16, 1, 3, 8, 1, 3, 4, 8, 5, 2, 8, 6, 18, 40, 10, 20, 1, 27, 2, 11, 14, 5, 7, 0, 3, 0, 7, 0, 8, 10, 10, 12, 8, 82, 21, 3, 34, 55, 18, 2, 9, 29, 1, 4, 7, 14, 7, 1, 2, 7, 4, 74, 5, 0, 3, 13, 2, 8, 1, 6, 13, 7, 1, 10, 5, 2, 4, 4, 14, 15, 4, 17, 1, 9) ``` write down how you would create a subset to accomplish each of the following: * Select every third value starting at position 2 in `x`. * Remove all values with an odd index (e.g. 1, 3, etc.) * Remove every 4th value, but only if it is odd. --- ## Acknowledgments Above materials are derived in part from the following sources: * Hadley Wickham - [Advanced R](http://adv-r.had.co.nz/) * [R Language Definition](http://stat.ethz.ch/R-manual/R-devel/doc/manual/R-lang.html)