---
title: "Data structures & Subsetting"
author: "Colin Rundel"
date: "2018-01-25"
output:
xaringan::moon_reader:
css: "slides.css"
lib_dir: libs
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
---
exclude: true
```{r, message=FALSE, warning=FALSE, include=FALSE}
options(
htmltools.dir.version = FALSE, # for blogdown
width=80
)
library(emo)
htmltools::tagList(rmarkdown::html_dependency_font_awesome())
```
---
class: middle
count: false
# Attributes
---
## Attributes
Attributes are metadata that can be attached to objects in R. Some are special (e.g. class, comment, dim, dimnames, names, etc.) and change the way in which an object is treated by R.
Attributes are a named list that is attached to an R object, they can be accessed (get and set) individually via the `attr` and collectively via `attributes`.
```{r}
(x = c(L=1,M=2,N=3))
attr(x,"names") = c("A","B","C")
x
names(x)
```
---
##
```{r}
str(x)
attributes(x)
str(attributes(x))
```
---
## Factors
Factor objects are how R stores data for categorical variables (fixed #s of discrete values).
```{r}
(x = factor(c("BS", "MS", "PhD", "MS")))
str(x)
typeof(x)
```
---
##
A factor is just an integer vector with two attributes: `class` and `levels`.
```{r}
attributes(x)
```
---
## Exercise 1
Construct a factor variable (without using `factor`, `as.factor`, or related functions) that contains the weather forecast for the next 7 days.
```{r out.width="60%", fig.align="center", echo=FALSE}
knitr::include_graphics("imgs/darksky_forecast.png")
```
* There should be 5 levels - `sun`, `partial clouds`, `clouds`, `rain`, `snow`.
* Start with an *integer* vector and add the appropriate attributes.
---
class: middle
count: false
# Data Frames
---
## Data Frames
A data frame is one of the most commonly used data structure in R, they are just a list of equal length vectors (usually atomic, but you can use generic as well). Each vector is treated as a column and elements of the vectors as rows.
Most often a data frame will be constructed by reading in from a file, but we can also create them from scratch.
```{r}
df = data.frame(x = 1:3, y = c("a", "b", "c"))
str(df)
```
---
```{r}
typeof(df)
attributes(df)
```
---
## Roll your own data.frame
```{r}
df2 = list(x = 1:3, y = factor(c("a", "b", "c")))
attr(df2,"class") = "data.frame"
attr(df2,"row.names") = 1:3
str(df2)
```
---
## Strings (Characters) vs Factors
By default R will convert character vectors into factors when they are included in a data frame.
Sometimes this is useful, sometimes usually it isn't -- either way it is important to know what type/class you are working with. This behavior can be changed using the `stringsAsFactors` argument.
```{r}
df = data.frame(x = 1:3, y = c("a", "b", "c"), stringsAsFactors = FALSE)
str(df)
```
---
## Some general advice ...
---
## Length Coercion
As we have seen before, if a vector is shorter than expected, R will increase the length by repeating elements of the short vector. If the lengths are evenly divisible this will occur without any output / warning.
For data frames if the lengths are not evenly divisible then there will be an error.
```{r error=TRUE}
data.frame(x = 1:3, y = c("a"))
data.frame(x = 1:3, y = c("a","b"))
```
---
## Growing data frames
We can add rows or columns to a data frame using `rbind` and `cbind` respectively.
```{r}
df = data.frame(x = 1:3, y = c("a","b","c"))
rbind(df, c(TRUE,FALSE))
```
```{r}
cbind(df, z=TRUE)
```
---
```{r}
df1 = data.frame(x = 1:3, y = c("a","b","c"))
df2 = data.frame(m = 3:1, n = c(TRUE,TRUE,FALSE))
cbind(df1,df2)
```
---
## Exercise 2
Construct a data frame that contains the following data (in as efficient a manner as possible). Hint - the `rep` function should prove useful.
```
Patient Gender Treatment 1 Treatment 2 Treatment 3
---------- --------------- --------------- --------------- ---------------
1 Male Yes Yes Yes
2 Male Yes Yes No
3 Male Yes No Yes
4 Male Yes No No
5 Male No Yes Yes
6 Male No Yes No
7 Male No No Yes
8 Male No No No
9 Female Yes Yes Yes
10 Female Yes Yes No
11 Female Yes No Yes
12 Female Yes No No
13 Female No Yes Yes
14 Female No Yes No
15 Female No No Yes
16 Female No No No
```
---
class: middle
count: false
# Matrices
---
## Matrices
A matrix is a 2 dimensional equivalent of an atomic vector (i.e. all entries must share the same type).
```{r}
(m = matrix(c(1,2,3,4), ncol=2, nrow=2))
attributes(m)
```
---
class: split-50
## Column major ordering
A matrix is therefore just an atomic vector with a `dim` attribute where the data is stored in column major order (fill the first column starting at row one, then the next column and so on).
Data in a matrix is always stored in this format but we can fill by rows using the `byrow` argument
.column[
```{r}
(cm = matrix(c(1,2,3,4),
ncol=2, nrow=2))
c(cm)
```
]
.column[
```{r}
(rm = matrix(c(1,2,3,4),
ncol=2, nrow=2,
byrow=TRUE))
c(rm)
```
]
---
class: middle
count: false
# Subsetting
---
## Subsetting in General
R has several different subsetting operators (`[`, `[[`, and `$`).
The behavior of these operators will depend on the object they are being used with.
--
In general there are 6 different data types that can be used to subset:
* Positive integers
* Negative integers
* Logical values
* Empty / NULL
* Zero
* Character values (names)
---
class: split-50
## Positive Integer subsetting
Returns elements at the given location(s) (*note R uses a 1-based not a 0-based indexing scheme*).
```{r}
x = c(1,4,7)
y = list(1,4,7)
```
.column[
```{r}
x[c(1,3)]
x[c(1,1)]
x[c(1.9,2.1)]
```
]
.column[
```{r}
str( y[c(1,3)] )
str( y[c(1,1)] )
str( y[c(1.9,2.1)] )
```
]
---
class: split-50
## Negative Integer subsetting
Excludes elements at the given location
.column[
```{r, error=TRUE}
x = c(1,4,7)
x[-1]
x[-c(1,3)]
x[c(-1,-1)]
```
]
.column[
```{r, error=TRUE}
y = list(1,4,7)
str( y[-1] )
str( y[-c(1,3)] )
```
]