---
title: "Data structures & Subsetting"
author: "Colin Rundel"
date: "2018-09-05"
output:
xaringan::moon_reader:
css: "slides.css"
lib_dir: libs
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
---
exclude: true
```{r, message=FALSE, warning=FALSE, include=FALSE}
options(
htmltools.dir.version = FALSE, # for blogdown
width=80
)
htmltools::tagList(rmarkdown::html_dependency_font_awesome())
```
---
class: middle
count: false
# Attributes
---
## Attributes
Attributes are metadata that can be attached to objects in R. Some are special (e.g. `class`, `comment`, `dim`, `dimnames`, `names`, etc.) and change the way in which an object is treated by R.
Attributes are a named list that is attached to an R object, they can be accessed (get and set) individually via the `attr` and collectively via `attributes`.
.midi[
```{r}
(x = c(L=1,M=2,N=3))
attr(x,"names") = c("A","B","C")
x
names(x)
```
]
---
##
```{r}
str(x)
attributes(x)
str(attributes(x))
```
---
## Factors
Factor objects are how R represents categorical data (e.g. a variable where there are a fixed #s of possible outcomes).
```{r}
(x = factor(c("BS", "MS", "PhD", "MS")))
str(x)
```
```{r}
typeof(x)
```
---
##
A factor is just an integer vector with two attributes: `class = "factor"` and `levels = ` a character vector.
```{r}
attributes(x)
```
---
## Exercise 1
Construct a factor variable (without using `factor`, `as.factor`, or related functions) that contains the weather forecast for Los Angeles over the next 7 days.
```{r out.width="60%", fig.align="center", echo=FALSE}
knitr::include_graphics("imgs/darksky_forecast.png")
```
* There should be 5 levels - `sun`, `partial clouds`, `clouds`, `rain`, `snow`.
* Start with an *integer* vector and add the appropriate attributes.
---
class: middle
count: false
# Data Frames
---
## Data Frames
A data frame is how R handles heterogeneous tabular data (i.e. rows and columns) and is one of the most commonly used data structure in R.
At their core R represents data frames as a list of equal length vectors (usually atomic, but you can use lists as well).
Most often a data frame will be constructed by reading in from a file, but we can also create them from scratch.
```{r}
df = data.frame(x = 1:3, y = c("a", "b", "c"))
str(df)
```
---
```{r}
typeof(df)
attributes(df)
```
---
## Roll your own data.frame
```{r}
df2 = list(x = 1:3, y = factor(c("a", "b", "c")))
```
--
.pull-left[
```{r}
attr(df2,"class") = "data.frame"
df2
```
]
--
.pull-right[
```{r}
attr(df2,"row.names") = 1:3
df2
```
]
```{r}
str(df2)
```
---
## Strings (Characters) vs Factors
By default R will convert character vectors into factors when they are included in a data frame.
Sometimes this is useful, usually it isn't -- either way it is important to know what type/class you are working with. This behavior can be changed using the `stringsAsFactors` argument.
```{r}
df = data.frame(x = 1:3, y = c("a", "b", "c"), stringsAsFactors = FALSE)
str(df)
```
---
## Some general advice ...
---
## Length Coercion
As we have seen before, if a vector is shorter than expected, R will increase the length by repeating elements of the short vector. If the longer length is a multiple of the shorter then this coercion will occur without any warnings / errors.
For data frames if the lengths are not evenly divisible then there will be an error (previous examples this only produced a warning).
```{r error=TRUE}
data.frame(x = 1:3, y = c("a"))
data.frame(x = 1:3, y = c("a","b"))
```
---
## Growing data frames
We can add rows or columns to a data frame using `rbind` and `cbind` respectively.
```{r}
df = data.frame(x = 1:3, y = c("a","b","c"))
rbind(df, c(TRUE,FALSE))
```
```{r}
cbind(df, z=TRUE)
```
---
```{r}
df1 = data.frame(x = 1:3, y = c("a","b","c"))
df2 = data.frame(m = 3:1, n = c(TRUE,TRUE,FALSE))
cbind(df1,df2)
```
---
## Matrices
A matrix is a 2 dimensional equivalent of an atomic vector (i.e. all entries must share the same type).
```{r}
(m = matrix(c(1,2,3,4), ncol=2, nrow=2))
attributes(m)
```
---
## Column major ordering
A matrix is therefore just an atomic vector with a `dim` attribute where the data is stored in column major order (fill the first column starting at row one, then the next column and so on).
Data in a matrix is always stored in this format but we can fill by rows instead by using the `byrow` argument
.pull-left[
```{r}
cm = matrix(c(1,2,3,4),
ncol=2, nrow=2)
cm
c(cm)
```
]
.pull-right[
```{r}
rm = matrix(c(1,2,3,4),
ncol=2, nrow=2,
byrow=TRUE)
rm
c(rm)
```
]
---
class: middle
count: false
# Subsetting
---
## Subsetting in General
R has several different subsetting operators (`[`, `[[`, and `$`).
The behavior of these operators will depend on the object they are being used with.
--
In general there are 6 different data types that can be used to subset:
* Positive integers
* Negative integers
* Logical values
* Empty / NULL
* Zero
* Character values (names)
---
## Positive Integer subsetting
Returns elements at the given location(s) (Note - R uses a 1-based indexing scheme).
```{r}
x = c(1,4,7)
y = list(1,4,7)
```
.pull-left[.small[
```{r}
x[c(1,3)]
x[c(1,1)]
x[c(1.9,2.1)]
```
] ]
.pull-right[ .small[
```{r}
str( y[c(1,3)] )
str( y[c(1,1)] )
str( y[c(1.9,2.1)] )
```
] ]
---
## Negative Integer subsetting
Excludes elements at the given location
.pull-left[
```{r, error=TRUE}
x = c(1,4,7)
x[-1]
x[-c(1,3)]
x[c(-1,-1)]
```
]
.pull-right[
```{r, error=TRUE}
y = list(1,4,7)
str( y[-1] )
str( y[-c(1,3)] )
```
]
```{r error=TRUE}
x[c(-1,2)]
```
---
## Logical Value Subsetting
Returns elements that correspond to `TRUE` in the logical vector. Length of the logical vector is expected to be the same of the vector being subsetted.
.pull-left[
```{r}
x = c(1,4,7,12)
x[c(TRUE,TRUE,FALSE,TRUE)]
x[c(TRUE,FALSE)]
x[x %% 2 == 0]
```
]
.pull-right[
```{r, error=TRUE}
y = list(1,4,7,12)
str( y[c(TRUE,TRUE,FALSE,TRUE)] )
str( y[c(TRUE,FALSE)] )
```
]
--
```{r error=TRUE}
str( y[y %% 2 == 0] )
```
---
## Empty Subsetting
Returns the original vector.
```{r}
x = c(1,4,7)
x[]
y = list(1,4,7)
str(y[])
```
---
## Zero subsetting
Returns an empty vector of the same type as the vector being subseted.
.pull-left[
```{r}
x = c(1,4,7)
x[0]
y = list(1,4,7)
str(y[0])
```
]
.pull-right[
```{r}
x[c(0,1)]
y[c(0,1)]
```
]
---
## Character subsetting
If the vector has names, select elements whose names correspond to the character vector.
.pull-left[
```{r}
x = c(a=1,b=4,c=7)
x["a"]
x[c("a","a")]
x[c("b","c")]
```
]
.pull-right[
```{r}
y = list(a=1,b=4,c=7)
str(y["a"])
str(y[c("a","a")])
str(y[c("b","c")])
```
]
---
## Out of bound subsetting
.pull-left[
```{r}
x = c(1,4,7)
x[4]
x["a"]
x[c(1,4)]
```
]
.pull-right[
```{r}
y = list(1,4,7)
str(y[4])
str(y["a"])
str(y[c(1,4)])
```
]
---
## Missing and NULL subsetting
.pull-left[
```{r}
x = c(1,4,7)
x[NA]
x[NULL]
x[c(1,NA)]
```
]
.pull-right[
```{r}
y = list(1,4,7)
str(y[NA])
str(y[NULL])
str(y[c(1,NA)])
```
]
---
## Atomic vectors - [ vs. [[
`[[` subsets like `[` except it can only subset a single value.
```{r, error=TRUE}
x = c(a=1,b=4,c=7)
x[[1]]
x[["a"]]
x[[1:2]]
```
---
## Generic Vectors - [ vs. [[
Subsets a single value, but returns the value - not a list containing that value.
```{r, error=TRUE}
y = list(a=1,b=4,c=7)
y[2]
y[[2]]
y[["b"]]
y[[1:2]]
```
---
## Hadley's Analogy
```{r echo=FALSE, fig.align="center", outwidth="80%"}
knitr::include_graphics("imgs/pepper_subset.png")
```
---
## [[ vs. $
`$` is equivalent to `[[` but it only works for named *lists* and it has a terrible default where it uses partial matching (`exact=FALSE`) to access the underlying value.
```{r, error=TRUE}
x = c("abc"=1, "def"=5)
x$abc
y = list("abc"=1, "def"=5)
y[["abc"]]
y$abc
y$d
```
---
## A common gotcha
Why does the following code not work?
```{r error=TRUE}
x = list(abc = 1:10, def = 10:1)
y = "abc"
x$y
```
--
$$ x$y \Leftrightarrow x[["y"]] \ne x[[y]] $$
```{r}
x[[y]]
```
---
## Exercise 2
Below are 100 values,
```{r}
x = c(56, 3, 17, 2, 4, 9, 6, 5, 19, 5, 2, 3, 5, 0, 13, 12, 6, 31, 10, 21, 8, 4, 1, 1, 2, 5, 16, 1, 3, 8, 1,
3, 4, 8, 5, 2, 8, 6, 18, 40, 10, 20, 1, 27, 2, 11, 14, 5, 7, 0, 3, 0, 7, 0, 8, 10, 10, 12, 8, 82,
21, 3, 34, 55, 18, 2, 9, 29, 1, 4, 7, 14, 7, 1, 2, 7, 4, 74, 5, 0, 3, 13, 2, 8, 1, 6, 13, 7, 1, 10,
5, 2, 4, 4, 14, 15, 4, 17, 1, 9)
```
write down how you would create a subset to accomplish each of the following:
* Select every third value starting at position 2 in `x`.
* Remove all values with an odd index (e.g. 1, 3, etc.)
* Remove every 4th value, but only if it is odd.
---
## Acknowledgments
Above materials are derived in part from the following sources:
* Hadley Wickham - [Advanced R](http://adv-r.had.co.nz/)
* [R Language Definition](http://stat.ethz.ch/R-manual/R-devel/doc/manual/R-lang.html)