Data structures & Subsetting

class: center, middle, inverse, title-slide

# Data structures & Subsetting
### Colin Rundel
### 2018-09-05

---

exclude: true

---
class: middle
count: false

# Attributes

---

## Attributes

Attributes are metadata that can be attached to objects in R. Some are special (e.g. `class`, `comment`, `dim`, `dimnames`, `names`, etc.) and change the way in which an object is treated by R.

Attributes are a named list that is attached to an R object, they can be accessed (get and set) individually via the `attr` and collectively via `attributes`.

.midi[

```r
(x = c(L=1,M=2,N=3))
```

```
## L M N 
## 1 2 3
```

```r
attr(x,"names") = c("A","B","C")
x
```

```
## A B C 
## 1 2 3
```

```r
names(x)
```

```
## [1] "A" "B" "C"
```
]

---

```r
str(x)
```

```
##  Named num [1:3] 1 2 3
##  - attr(*, "names")= chr [1:3] "A" "B" "C"
```

```r
attributes(x)
```

```
## $names
## [1] "A" "B" "C"
```

```r
str(attributes(x))
```

```
## List of 1
##  $ names: chr [1:3] "A" "B" "C"
```

---

## Factors

Factor objects are how R represents categorical data (e.g. a variable where there are a fixed #s of possible outcomes).

```r
(x = factor(c("BS", "MS", "PhD", "MS")))
```

```
## [1] BS  MS  PhD MS 
## Levels: BS MS PhD
```

```r
str(x)
```

```
##  Factor w/ 3 levels "BS","MS","PhD": 1 2 3 2
```

```r
typeof(x)
```

```
## [1] "integer"
```

---

A factor is just an integer vector with two attributes: `class = "factor"` and `levels = ` a character vector.

```r
attributes(x)
```

```
## $levels
## [1] "BS"  "MS"  "PhD"
## 
## $class
## [1] "factor"
```

---

## Exercise 1

Construct a factor variable (without using `factor`, `as.factor`, or related functions) that contains the weather forecast for Los Angeles over the next 7 days.

<br/>

* There should be 5 levels - `sun`, `partial clouds`, `clouds`, `rain`, `snow`.

* Start with an *integer* vector and add the appropriate attributes.

---
class: middle
count: false

# Data Frames

---

## Data Frames

A data frame is how R handles heterogeneous tabular data (i.e. rows and columns) and is one of the most commonly used data structure in R.

At their core R represents data frames as a list of equal length vectors (usually atomic, but you can use lists as well).

Most often a data frame will be constructed by reading in from a file, but we can also create them from scratch.

```r
df = data.frame(x = 1:3, y = c("a", "b", "c"))
str(df)
```

```
## 'data.frame':	3 obs. of  2 variables:
##  $ x: int  1 2 3
##  $ y: Factor w/ 3 levels "a","b","c": 1 2 3
```
---

```r
typeof(df)
```

```
## [1] "list"
```

```r
attributes(df)
```

```
## $names
## [1] "x" "y"
## 
## $class
## [1] "data.frame"
## 
## $row.names
## [1] 1 2 3
```

---

## Roll your own data.frame

```r
df2 = list(x = 1:3, y = factor(c("a", "b", "c")))
```

.pull-left[

```r
attr(df2,"class") = "data.frame"
df2
```

```
## [1] x y
## <0 rows> (or 0-length row.names)
```
]

.pull-right[

```r
attr(df2,"row.names") = 1:3
df2
```

```
##   x y
## 1 1 a
## 2 2 b
## 3 3 c
```
]

```r
str(df2)
```

```
## 'data.frame':	3 obs. of  2 variables:
##  $ x: int  1 2 3
##  $ y: Factor w/ 3 levels "a","b","c": 1 2 3
```

---

## Strings (Characters) vs Factors

By default R will convert character vectors into factors when they are included in a data frame.

Sometimes this is useful, usually it isn't -- either way it is important to know what type/class you are working with. This behavior can be changed using the `stringsAsFactors` argument.

```r
df = data.frame(x = 1:3, y = c("a", "b", "c"), stringsAsFactors = FALSE)
str(df)
```

```
## 'data.frame':	3 obs. of  2 variables:
##  $ x: int  1 2 3
##  $ y: chr  "a" "b" "c"
```
---

## Some general advice ...

---

## Length Coercion

As we have seen before, if a vector is shorter than expected, R will increase the length by repeating elements of the short vector. If the longer length is a multiple of the shorter then this coercion will occur without any warnings / errors.

For data frames if the lengths are not evenly divisible then there will be an error (previous examples this only produced a warning).

```r
data.frame(x = 1:3, y = c("a"))
```

```
##   x y
## 1 1 a
## 2 2 a
## 3 3 a
```

```r
data.frame(x = 1:3, y = c("a","b"))
```

```
## Error in data.frame(x = 1:3, y = c("a", "b")): arguments imply differing number of rows: 3, 2
```
---

## Growing data frames

We can add rows or columns to a data frame using `rbind` and `cbind` respectively.

```r
df = data.frame(x = 1:3, y = c("a","b","c"))
rbind(df, c(TRUE,FALSE))
```

```
## Warning in `[<-.factor`(`*tmp*`, ri, value = FALSE): invalid factor level, NA
## generated
```

```
##   x    y
## 1 1    a
## 2 2    b
## 3 3    c
## 4 1 <NA>
```

```r
cbind(df, z=TRUE)
```

```
##   x y    z
## 1 1 a TRUE
## 2 2 b TRUE
## 3 3 c TRUE
```

---

```r
df1 = data.frame(x = 1:3, y = c("a","b","c"))
df2 = data.frame(m = 3:1, n = c(TRUE,TRUE,FALSE))
cbind(df1,df2)
```

```
##   x y m     n
## 1 1 a 3  TRUE
## 2 2 b 2  TRUE
## 3 3 c 1 FALSE
```

---

## Matrices

A matrix is a 2 dimensional equivalent of an atomic vector (i.e. all entries must share the same type).

```r
(m = matrix(c(1,2,3,4), ncol=2, nrow=2))
```

```
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
```

```r
attributes(m)
```

```
## $dim
## [1] 2 2
```

---

## Column major ordering

A matrix is therefore just an atomic vector with a `dim` attribute where the data is stored in column major order (fill the first column starting at row one, then the next column and so on).

Data in a matrix is always stored in this format but we can fill by rows instead by using the `byrow` argument

.pull-left[

```r
cm = matrix(c(1,2,3,4), 
            ncol=2, nrow=2)

cm
```

```
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
```

```r
c(cm)
```

```
## [1] 1 2 3 4
```
]

.pull-right[

```r
rm = matrix(c(1,2,3,4), 
            ncol=2, nrow=2, 
            byrow=TRUE)
rm
```

```
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4
```

```r
c(rm)
```

```
## [1] 1 3 2 4
```
]

---
class: middle
count: false

# Subsetting

---

## Subsetting in General

R has several different subsetting operators (`[`, `[[`, and `$`).

The behavior of these operators will depend on the object they are being used with.

<br/>

In general there are 6 different data types that can be used to subset:

* Positive integers

* Negative integers

* Logical values

* Empty / NULL

* Zero

* Character values (names)

---

## Positive Integer subsetting

Returns elements at the given location(s) (Note - R uses a 1-based indexing scheme).

```r
x = c(1,4,7)
y = list(1,4,7)
```

.pull-left[.small[

```r
x[c(1,3)]
```

```
## [1] 1 7
```

```r
x[c(1,1)]
```

```
## [1] 1 1
```

```r
x[c(1.9,2.1)]
```

```
## [1] 1 4
```
] ]

.pull-right[ .small[

```r
str( y[c(1,3)] )
```

```
## List of 2
##  $ : num 1
##  $ : num 7
```

```r
str( y[c(1,1)] )
```

```
## List of 2
##  $ : num 1
##  $ : num 1
```

```r
str( y[c(1.9,2.1)] )
```

```
## List of 2
##  $ : num 1
##  $ : num 4
```
] ]

---

## Negative Integer subsetting

Excludes elements at the given location

.pull-left[

```r
x = c(1,4,7)
x[-1]
```

```
## [1] 4 7
```

```r
x[-c(1,3)]
```

```
## [1] 4
```

```r
x[c(-1,-1)]
```

```
## [1] 4 7
```
]

.pull-right[

```r
y = list(1,4,7)
str( y[-1] )
```

```
## List of 2
##  $ : num 4
##  $ : num 7
```

```r
str( y[-c(1,3)] )
```

```
## List of 1
##  $ : num 4
```
]

```r
x[c(-1,2)]
```

```
## Error in x[c(-1, 2)]: only 0's may be mixed with negative subscripts
```

---

## Logical Value Subsetting

Returns elements that correspond to `TRUE` in the logical vector. Length of the logical vector is expected to be the same of the vector being subsetted.

.pull-left[

```r
x = c(1,4,7,12)
x[c(TRUE,TRUE,FALSE,TRUE)]
```

```
## [1]  1  4 12
```

```r
x[c(TRUE,FALSE)]
```

```
## [1] 1 7
```

```r
x[x %% 2 == 0]
```

```
## [1]  4 12
```
]

.pull-right[

```r
y = list(1,4,7,12)
str( y[c(TRUE,TRUE,FALSE,TRUE)] )
```

```
## List of 3
##  $ : num 1
##  $ : num 4
##  $ : num 12
```

```r
str( y[c(TRUE,FALSE)] )
```

```
## List of 2
##  $ : num 1
##  $ : num 7
```
]

```r
str( y[y %% 2 == 0] )
```

```
## Error in y%%2: non-numeric argument to binary operator
```

---

## Empty Subsetting

Returns the original vector.

```r
x = c(1,4,7)
x[]
```

```
## [1] 1 4 7
```

```r
y = list(1,4,7)
str(y[])
```

```
## List of 3
##  $ : num 1
##  $ : num 4
##  $ : num 7
```

---

## Zero subsetting

Returns an empty vector of the same type as the vector being subseted.

.pull-left[

```r
x = c(1,4,7)
x[0]
```

```
## numeric(0)
```

```r
y = list(1,4,7)
str(y[0])
```

```
##  list()
```
]

.pull-right[

```r
x[c(0,1)]
```

```
## [1] 1
```

```r
y[c(0,1)]
```

```
## [[1]]
## [1] 1
```
]

---

## Character subsetting

If the vector has names, select elements whose names correspond to the character vector.

.pull-left[

```r
x = c(a=1,b=4,c=7)
x["a"]
```

```
## a 
## 1
```

```r
x[c("a","a")]
```

```
## a a 
## 1 1
```

```r
x[c("b","c")]
```

```
## b c 
## 4 7
```
]

.pull-right[

```r
y = list(a=1,b=4,c=7)
str(y["a"])
```

```
## List of 1
##  $ a: num 1
```

```r
str(y[c("a","a")])
```

```
## List of 2
##  $ a: num 1
##  $ a: num 1
```

```r
str(y[c("b","c")])
```

```
## List of 2
##  $ b: num 4
##  $ c: num 7
```
]

---

## Out of bound subsetting

.pull-left[

```r
x = c(1,4,7)
x[4]
```

```
## [1] NA
```

```r
x["a"]
```

```
## [1] NA
```

```r
x[c(1,4)]
```

```
## [1]  1 NA
```
]

.pull-right[

```r
y = list(1,4,7)
str(y[4])
```

```
## List of 1
##  $ : NULL
```

```r
str(y["a"])
```

```
## List of 1
##  $ : NULL
```

```r
str(y[c(1,4)])
```

```
## List of 2
##  $ : num 1
##  $ : NULL
```
]

---

## Missing and NULL subsetting

.pull-left[

```r
x = c(1,4,7)
x[NA]
```

```
## [1] NA NA NA
```

```r
x[NULL]
```

```
## numeric(0)
```

```r
x[c(1,NA)]
```

```
## [1]  1 NA
```
]

.pull-right[

```r
y = list(1,4,7)
str(y[NA])
```

```
## List of 3
##  $ : NULL
##  $ : NULL
##  $ : NULL
```

```r
str(y[NULL])
```

```
##  list()
```

```r
str(y[c(1,NA)])
```

```
## List of 2
##  $ : num 1
##  $ : NULL
```
]

---

## Atomic vectors - [ vs. [[

`[[` subsets like `[` except it can only subset a single value.

```r
x = c(a=1,b=4,c=7)
x[[1]]
```

```
## [1] 1
```

```r
x[["a"]]
```

```
## [1] 1
```

```r
x[[1:2]]
```

```
## Error in x[[1:2]]: attempt to select more than one element in vectorIndex
```

---

## Generic Vectors - [ vs. [[

Subsets a single value, but returns the value - not a list containing that value.

```r
y = list(a=1,b=4,c=7)
y[2]
```

```
## $b
## [1] 4
```

```r
y[[2]]
```

```
## [1] 4
```

```r
y[["b"]]
```

```
## [1] 4
```

```r
y[[1:2]]
```

```
## Error in y[[1:2]]: subscript out of bounds
```

---

## Hadley's Analogy

---

## [[ vs. $

`$` is equivalent to `[[` but it only works for named *lists* and it has a terrible default where it uses partial matching (`exact=FALSE`) to access the underlying value.

```r
x = c("abc"=1, "def"=5)
x$abc
```

```
## Error in x$abc: $ operator is invalid for atomic vectors
```

```r
y = list("abc"=1, "def"=5)
y[["abc"]]
```

```
## [1] 1
```

```r
y$abc
```

```
## [1] 1
```

```r
y$d
```

```
## [1] 5
```
---

## A common gotcha

Why does the following code not work?

```r
x = list(abc = 1:10, def = 10:1)
y = "abc"

x$y
```

```
## NULL
```

$$ x$y \Leftrightarrow x[["y"]] \ne x[[y]] $$

```r
x[[y]]
```

```
##  [1]  1  2  3  4  5  6  7  8  9 10
```

---

## Exercise 2

Below are 100 values,

```r
x = c(56, 3, 17, 2, 4, 9, 6, 5, 19, 5, 2, 3, 5, 0, 13, 12, 6, 31, 10, 21, 8, 4, 1, 1, 2, 5, 16, 1, 3, 8, 1,
      3, 4, 8, 5, 2, 8, 6, 18, 40, 10, 20, 1, 27, 2, 11, 14, 5, 7, 0, 3, 0, 7, 0, 8, 10, 10, 12, 8, 82,
      21, 3, 34, 55, 18, 2, 9, 29, 1, 4, 7, 14, 7, 1, 2, 7, 4, 74, 5, 0, 3, 13, 2, 8, 1, 6, 13, 7, 1, 10,
      5, 2, 4, 4, 14, 15, 4, 17, 1, 9)
```

write down how you would create a subset to accomplish each of the following:

* Select every third value starting at position 2 in `x`.

* Remove all values with an odd index (e.g. 1, 3, etc.)

* Remove every 4th value, but only if it is odd.

---

## Acknowledgments

Above materials are derived in part from the following sources:

* Hadley Wickham - [Advanced R](http://adv-r.had.co.nz/)
* [R Language Definition](http://stat.ethz.ch/R-manual/R-devel/doc/manual/R-lang.html)