---
title: "Data structures & Subsetting"
author: "Colin Rundel"
date: "2018-09-05"
output:
  xaringan::moon_reader:
    css: "slides.css"
    lib_dir: libs
    nature:
      highlightStyle: github
      highlightLines: true
      countIncrementalSlides: false
---
exclude: true

```{r, message=FALSE, warning=FALSE, include=FALSE}
options(
  htmltools.dir.version = FALSE, # for blogdown
  width=80
)

htmltools::tagList(rmarkdown::html_dependency_font_awesome())
```

---
class: middle
count: false

# Attributes

---

## Attributes

Attributes are metadata that can be attached to objects in R. Some are special (e.g. `class`, `comment`, `dim`, `dimnames`, `names`, etc.) and change the way in which an object is treated by R. 

Attributes are a named list that is attached to an R object, they can be accessed (get and set) individually via the `attr` and collectively via `attributes`.

.midi[

```{r}
(x = c(L=1,M=2,N=3))
attr(x,"names") = c("A","B","C")
x
names(x)
```
]

---

## 

```{r}
str(x)
attributes(x)
str(attributes(x))
```

---

## Factors

Factor objects are how R represents categorical data (e.g. a variable where there are a fixed #s of possible outcomes).

```{r}
(x = factor(c("BS", "MS", "PhD", "MS")))
str(x)
```

```{r}
typeof(x)
```

---

## 

A factor is just an integer vector with two attributes: `class = "factor"` and `levels = ` a character vector.

```{r}
attributes(x)
```

---

## Exercise 1

Construct a factor variable (without using `factor`, `as.factor`, or related functions) that contains the weather forecast for Los Angeles over the next 7 days.

<br/>

```{r out.width="60%", fig.align="center", echo=FALSE}
knitr::include_graphics("imgs/darksky_forecast.png")
```


* There should be 5 levels - `sun`, `partial clouds`, `clouds`, `rain`, `snow`.

* Start with an *integer* vector and add the appropriate attributes.


---
class: middle
count: false

# Data Frames

---

## Data Frames

A data frame is how R handles heterogeneous tabular data (i.e. rows and columns) and is one of the most commonly used data structure in R.

At their core R represents data frames as a list of equal length vectors (usually atomic, but you can use lists as well).

Most often a data frame will be constructed by reading in from a file, but we can also create them from scratch.

```{r}
df = data.frame(x = 1:3, y = c("a", "b", "c"))
str(df)
```
---

```{r}
typeof(df)
attributes(df)
```

---

## Roll your own data.frame

```{r}
df2 = list(x = 1:3, y = factor(c("a", "b", "c")))
```

--

.pull-left[
```{r}
attr(df2,"class") = "data.frame"
df2
```
]

--

.pull-right[
```{r}
attr(df2,"row.names") = 1:3
df2
```
]

```{r}
str(df2)
```

---

## Strings (Characters) vs Factors

By default R will convert character vectors into factors when they are included in a data frame. 

Sometimes this is useful, usually it isn't -- either way it is important to know what type/class you are working with. This behavior can be changed using the `stringsAsFactors` argument.

```{r}
df = data.frame(x = 1:3, y = c("a", "b", "c"), stringsAsFactors = FALSE)
str(df)
```
---

## Some general advice ...

<br/>
<br/>

<img src="imgs/stringsasfactors.jpg" align="center" width="650px"/>

---

## Length Coercion

As we have seen before, if a vector is shorter than expected, R will increase the length by repeating elements of the short vector. If the longer length is a multiple of the shorter then this coercion will occur without any warnings / errors.

For data frames if the lengths are not evenly divisible then there will be an error (previous examples this only produced a warning).

```{r error=TRUE}
data.frame(x = 1:3, y = c("a"))
data.frame(x = 1:3, y = c("a","b"))
```
---

## Growing data frames 

We can add rows or columns to a data frame using `rbind` and `cbind` respectively.

```{r}
df = data.frame(x = 1:3, y = c("a","b","c"))
rbind(df, c(TRUE,FALSE))
```

```{r}
cbind(df, z=TRUE)
```

---

```{r}
df1 = data.frame(x = 1:3, y = c("a","b","c"))
df2 = data.frame(m = 3:1, n = c(TRUE,TRUE,FALSE))
cbind(df1,df2)
```

---

## Matrices

A matrix is a 2 dimensional equivalent of an atomic vector (i.e. all entries must share the same type).

```{r}
(m = matrix(c(1,2,3,4), ncol=2, nrow=2))

attributes(m)
```

---

## Column major ordering

A matrix is therefore just an atomic vector with a `dim` attribute where the data is stored in column major order (fill the first column starting at row one, then the next column and so on).

Data in a matrix is always stored in this format but we can fill by rows instead by using the `byrow` argument

.pull-left[
```{r}
cm = matrix(c(1,2,3,4), 
            ncol=2, nrow=2)

cm
c(cm)
```
]

.pull-right[
```{r}
rm = matrix(c(1,2,3,4), 
            ncol=2, nrow=2, 
            byrow=TRUE)
rm
c(rm)
```
]


---
class: middle
count: false

# Subsetting

---

## Subsetting in General

R has several different subsetting operators (`[`, `[[`, and `$`).

The behavior of these operators will depend on the object they are being used with.

<br/>

--

In general there are 6 different data types that can be used to subset:

* Positive integers

* Negative integers

* Logical values

* Empty / NULL

* Zero

* Character values (names)

---

## Positive Integer subsetting

Returns elements at the given location(s) (Note - R uses a 1-based indexing scheme).

```{r}
x = c(1,4,7)
y = list(1,4,7)
```

.pull-left[.small[
```{r}
x[c(1,3)]
x[c(1,1)]
x[c(1.9,2.1)]
```
] ]

.pull-right[ .small[
```{r}
str( y[c(1,3)] )
str( y[c(1,1)] )
str( y[c(1.9,2.1)] )
```
] ]

---

## Negative Integer subsetting

Excludes elements at the given location

.pull-left[
```{r, error=TRUE}
x = c(1,4,7)
x[-1]
x[-c(1,3)]
x[c(-1,-1)]
```
]

.pull-right[
```{r, error=TRUE}
y = list(1,4,7)
str( y[-1] )
str( y[-c(1,3)] )
```
]

```{r error=TRUE}
x[c(-1,2)]
```

---

## Logical Value Subsetting

Returns elements that correspond to `TRUE` in the logical vector. Length of the logical vector is expected to be the same of the vector being subsetted.

.pull-left[
```{r}
x = c(1,4,7,12)
x[c(TRUE,TRUE,FALSE,TRUE)]
x[c(TRUE,FALSE)]
x[x %% 2 == 0]
```
]

.pull-right[
```{r, error=TRUE}
y = list(1,4,7,12)
str( y[c(TRUE,TRUE,FALSE,TRUE)] )
str( y[c(TRUE,FALSE)] )
```
]

--

```{r error=TRUE}
str( y[y %% 2 == 0] )
```

---

## Empty Subsetting

Returns the original vector.

```{r}
x = c(1,4,7)
x[]

y = list(1,4,7)
str(y[])
```

---

## Zero subsetting

Returns an empty vector of the same type as the vector being subseted.

.pull-left[
```{r}
x = c(1,4,7)
x[0]

y = list(1,4,7)
str(y[0])
```
]

.pull-right[
```{r}
x[c(0,1)]

y[c(0,1)]
```
]

---

## Character subsetting

If the vector has names, select elements whose names correspond to the character vector.

.pull-left[
```{r}
x = c(a=1,b=4,c=7)
x["a"]
x[c("a","a")]
x[c("b","c")]
```
]

.pull-right[
```{r}
y = list(a=1,b=4,c=7)
str(y["a"])
str(y[c("a","a")])
str(y[c("b","c")])
```
]

---

## Out of bound subsetting

.pull-left[
```{r}
x = c(1,4,7)
x[4]
x["a"]
x[c(1,4)]
```
]

.pull-right[
```{r}
y = list(1,4,7)
str(y[4])
str(y["a"])
str(y[c(1,4)])
```
]

---

## Missing and NULL subsetting

.pull-left[
```{r}
x = c(1,4,7)
x[NA]
x[NULL]
x[c(1,NA)]
```
]

.pull-right[
```{r}
y = list(1,4,7)
str(y[NA])
str(y[NULL])
str(y[c(1,NA)])
```
]

---

## Atomic vectors - [ vs. [[

`[[` subsets like `[` except it can only subset a single value. 

```{r, error=TRUE}
x = c(a=1,b=4,c=7)
x[[1]]
x[["a"]]
x[[1:2]]
```

---

## Generic Vectors - [ vs. [[

Subsets a single value, but returns the value - not a list containing that value.

```{r, error=TRUE}
y = list(a=1,b=4,c=7)
y[2]
y[[2]]
y[["b"]]
y[[1:2]]
```

---

## Hadley's Analogy

```{r echo=FALSE, fig.align="center", outwidth="80%"}
knitr::include_graphics("imgs/pepper_subset.png")
```

---

## [[ vs. $

`$` is equivalent to `[[` but it only works for named *lists* and it has a terrible default where it uses partial matching (`exact=FALSE`) to access the underlying value.

```{r, error=TRUE}
x = c("abc"=1, "def"=5)
x$abc
y = list("abc"=1, "def"=5)
y[["abc"]]
y$abc
y$d
```
---

## A common gotcha

Why does the following code not work?

```{r error=TRUE}
x = list(abc = 1:10, def = 10:1)
y = "abc"

x$y
```

--

$$ x$y \Leftrightarrow x[["y"]] \ne x[[y]] $$

```{r}
x[[y]]
```

---

## Exercise 2

Below are 100 values,

```{r}
x = c(56, 3, 17, 2, 4, 9, 6, 5, 19, 5, 2, 3, 5, 0, 13, 12, 6, 31, 10, 21, 8, 4, 1, 1, 2, 5, 16, 1, 3, 8, 1,
      3, 4, 8, 5, 2, 8, 6, 18, 40, 10, 20, 1, 27, 2, 11, 14, 5, 7, 0, 3, 0, 7, 0, 8, 10, 10, 12, 8, 82,
      21, 3, 34, 55, 18, 2, 9, 29, 1, 4, 7, 14, 7, 1, 2, 7, 4, 74, 5, 0, 3, 13, 2, 8, 1, 6, 13, 7, 1, 10,
      5, 2, 4, 4, 14, 15, 4, 17, 1, 9)
```

write down how you would create a subset to accomplish each of the following:

* Select every third value starting at position 2 in `x`.

* Remove all values with an odd index (e.g. 1, 3, etc.)

* Remove every 4th value, but only if it is odd.

---

## Acknowledgments

Above materials are derived in part from the following sources:

* Hadley Wickham - [Advanced R](http://adv-r.had.co.nz/)
* [R Language Definition](http://stat.ethz.ch/R-manual/R-devel/doc/manual/R-lang.html)