---
title: "Data types and functions"
subtitle: "Statistical Computing & Programming"
author: "Shawn Santo"
institute: ""
date: "05-14-20"
output:
xaringan::moon_reader:
css: "slides.css"
lib_dir: libs
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
editor_options:
chunk_output_type: console
---
```{r include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE,
comment = "#>", highlight = TRUE,
fig.align = "center")
```
## Supplementary materials
Companion videos
- [More on atomic vectors](https://warpwire.duke.edu/w/N8EDAA/)
- [Generic vectors](https://warpwire.duke.edu/w/OcEDAA/)
- [Introduction to functions](https://warpwire.duke.edu/w/O8EDAA/)
- [More on functions](https://warpwire.duke.edu/w/QcEDAA/)
Additional resources
- [Section 3.5](https://adv-r.hadley.nz/vectors-chap.html#lists) Advanced R
- [Section 3.7](https://adv-r.hadley.nz/vectors-chap.html#null) Advanced R
- [Chapter 6](https://adv-r.hadley.nz/functions.html) Advanced R
---
class: inverse, center, middle
# Recall
---
## Vectors
The fundamental building block of data in R is a vector (collections of related
values, objects, other data structures, etc).
R has two types of vectors:
* **atomic** vectors
- homogeneous collections of the *same* type (e.g. all logical values, all
numbers, or all character strings).
* **generic** vectors
- heterogeneous collections of *any* type of R object, even other lists
(meaning they can have a hierarchical/tree-like structure).
I will use the term component or element when referring to a value
inside a vector.
---
## Atomic vectors
R has six atomic vector types:
.center[
`logical`, `double`, `integer`, `character`, `complex`, `raw`
]
In this course we will mostly work with the first four. You will rarely work
with the last two types - complex and raw.
---
## Conditional control flow
Conditional (choice) control flow is governed by `if` and `switch()`.
.pull-left[
```{r eval=FALSE}
if (condition) {
# code to run
# when condition is
# TRUE
}
```
]
.pull-right[
```{r eval=FALSE}
if (TRUE) {
print("The condition must have been true!")
}
```
]
---
## `if` is not vectorized
To remedy this potential problem of a non-vectorized `if`, you can
1. try to collapse the logical vector to a vector of length 1
- `any()`
- `all()`
2. use a vectorized conditional function such as `ifelse()` or
`dplyr::case_when()`.
---
## Loop types
R supports three types of loops: `for`, `while`, and `repeat`.
```{r eval=FALSE}
for (item in vector) {
##
## Iterate this code
##
}
```
```{r eval=FALSE}
while (we_have_a_true_condition) {
##
## Iterate this code
##
}
```
```{r eval=FALSE}
repeat {
##
## Iterate this code
##
}
```
In the `repeat` loop we will need a `break` statement to end iteration.
---
## Concatenation
Atomic vectors can be constructed using the concatenate, `c()`, function.
```{r}
c(1,2,3)
```
```{r}
c("Hello", "World!")
```
```{r}
c(1,c(2, c(3)))
```
Atomic vectors are always flat.
---
class: inverse, center, middle
# More on atomic vectors
---
## Atomic vectors
`typeof()` | `mode()` | `storage.mode()`
:-----------|:------------|:----------------
logical | logical | logical
double | numeric | double
integer | numeric | integer
character | character | character
complex | complex | complex
raw | raw | raw
- Function `typeof()` can handle any object
- Functions `mode()` and `storage.mode()` allow for assignment
---
## Examples of type and mode
.pull-left[
```{r}
typeof(c(T, F, T))
typeof(7)
typeof(7L)
typeof("S")
typeof("Shark")
```
]
.pull-right[
```{r}
mode(c(T, F, T))
mode(7)
mode(7L)
mode("S")
mode("Shark")
```
]
---
## Atomic vector type observations
- Numeric means an object of type integer or double.
- Integers must be followed by an L, except if you use operator `:`.
```{r results='hold'}
x <- 1:100
y <- as.numeric(1:100)
c(typeof(x), typeof(y))
```
```{r results='hold'}
object.size(x)
object.size(y)
```
- There is no "string" type or mode, only "character".
---
## Logical predicates
The `is.*(x)` family of functions performs a logical test as to whether
`x` is of type `*`. For example,
.pull-left[
```{r}
is.integer(T)
is.double(pi)
is.character("abc")
is.numeric(1L)
```
]
.pull-right[
```{r}
is.integer(pi)
is.double(pi)
is.integer(1:10)
is.numeric(1)
```
]
Function `is.numeric(x)` returns `TRUE` when `x` is integer or double.
---
## Coercion
Previously, we looked at R's coercion hierarchy:
.center[
`character` $\rightarrow$ `double` $\rightarrow$ `integer` $\rightarrow$ `logical`
]
Coercion can happen implicitly through functions and operations; it can
occur explicitly via the `as.*()` family of functions.
---
## Implicit coercion
.pull-left[
```{r}
x <- c(T, T, F, F, F)
mean(x)
c(1L, 1.0, "one")
0 >= "0"
(0 == "0") != "TRUE"
```
]
.pull-right[
```{r}
1 & TRUE & 5.0 & pi
0 == FALSE
(0 | 1) & 0
```
]
---
## Explicit coercion
.pull-left[
```{r}
as.logical(sqrt(2))
as.character(5L)
as.integer("4")
as.integer("four")
```
]
.pull-right[
```{r}
as.numeric(FALSE)
as.double(10L)
as.complex(5.4)
as.logical(as.character(3))
```
]
---
## Reserved words: `NA`, `NaN`, `Inf`, `-Inf`
- `NA` is a logical constant of length 1 which serves a missing value indicator.
- `NaN` stands for not a number.
- `Inf`, `-Inf` are positive and negative infinity, respectively.
---
## Missing values
- `NA` can be coerced to any other vector type except raw.
.pull-left[
```{r}
typeof(NA)
typeof(NA+1)
typeof(NA+1L)
```
]
.pull-right[
```{r}
typeof(NA_character_)
typeof(NA_real_)
typeof(NA_integer_)
```
]
---
## `NA` in, `NA` out (most of the time)
```{r}
x <- c(-4, 0, NA, 33, 1 / 9)
mean(x)
NA ^ 4
log(NA)
```
--
Some of the base R functions have an argument `na.rm` to remove `NA` values in
the calculation.
```{r}
mean(x, na.rm = TRUE)
```
---
## Special non-infectious `NA` cases
```{r}
NA ^ 0
NA | TRUE
NA & FALSE
```
--
Why does `NA / Inf` result in `NA`?
---
## Testing for `NA`
Use function `is.na()` (vectorized) to test for `NA` values.
.pull-left[
```{r}
is.na(NA)
is.na(1)
is.na(c(1,2,3,NA))
```
]
.pull-right[
```{r}
any(is.na(c(1,2,3,NA)))
all(is.na(c(1,2,3,NA)))
```
]
---
## `NaN`, `Inf`, and `-Inf`
.pull-left[
```{r}
-5 / 0
0 / 0
1/0 + 1/0
```
]
.pull-right[
```{r}
1/0 - 1/0
NaN / NA
NaN * NA
```
]
- Functions `is.finite()` and `is.nan()` test for `Inf`, `-Inf`, and `NaN`,
respectively.
- Coercion is possible with the `as.*()` family of functions. Be careful with
these; they may not always work as you expect.
.small-text[
```{r}
as.integer(Inf)
```
]
???
- Note that current implementations of R use 32-bit integers for integer vectors,
so the range of representable integers is restricted to about +/-2*10^9:
doubles can hold much larger integers exactly.
- Computations involving `NaN` will return `NaN` or perhaps `NA`: which of those
two is not guaranteed and may depend on the R platform
---
## Atomic vector properties
- Homogeneous
- Elements can have names
- Elements can be indexed by name or position
- Matrices, arrays, factors, and date-times are built on top of atomic
vectors by adding attributes.
.pull-left[
```{r}
x <- c(-3:2)
attributes(x)
x
```
]
.pull-right[
```{r}
attr(x, which = "dim") <- c(2, 3)
attributes(x)
x
```
]
---
## Exercises
1. What is the type of each vector below? Check your answer in R.
```{r eval=FALSE}
c(4L, 16, 0)
c(NaN, NA, -Inf)
c(NA, TRUE, FALSE, "TRUE")
c(pi, NaN, NA)
```
2. Write a conditional statement that prints "Can't proceed NA or NaN present!"
if a vector contains `NA` or `NaN`. Test your code with vectors `x` and `y`
below.
```{r}
x <- NA
y <- c(1:5, NaN, NA, sqrt(3))
```
???
## Solutions
1.
.solution[
```{r}
typeof(c(4L, 16, 0))
typeof(c(NaN, NA, -Inf))
typeof(c(NA, TRUE, FALSE, "TRUE"))
typeof(c(pi, NaN, NA))
```
]
2.
.solution[
```{r}
x <- NA
y <- c(1:5, NaN, NA, sqrt(3))
if (any(is.na(x))) {print("Can't proceed NA or NaN present!")}
if (any(is.na(y))) {print("Can't proceed NA or NaN present!")}
```
]
---
class: inverse, center, middle
# Generic vectors
---
## Lists
Lists are generic vectors, in that they are 1 dimensional (i.e. have a length)
and can contain any type of R object. They are heterogeneous structures.
```{r}
list("A", c(TRUE,FALSE), (1:4)/2, function(x) x^2)
```
---
## Structure
For complex objects, function `str()` will display the structure in a compact
form.
```{r}
str(list("A", c(TRUE,FALSE), (1:4)/2, function(x) x^2))
```
---
## Coercion and testing
Lists can be complex structures and even include other lists.
```{r eval=FALSE}
x <- list("a", list("b", c("c", "d"), list(1:5)))
```
```{r eval=FALSE}
> str(x)
List of 2 #<<
$ : chr "a" #<<
$ :List of 3 #<<
..$ : chr "b"
..$ : chr [1:2] "c" "d"
..$ :List of 1
.. ..$ : int [1:5] 1 2 3 4 5
```
---
## Coercion and testing
Lists can be complex structures and even include other lists.
```{r eval=FALSE}
x <- list("a", list("b", c("c", "d"), list(1:5)))
```
```{r eval=FALSE}
> str(x)
List of 2
$ : chr "a"
$ :List of 3 #<<
..$ : chr "b" #<<
..$ : chr [1:2] "c" "d" #<<
..$ :List of 1 #<<
.. ..$ : int [1:5] 1 2 3 4 5
```
---
## Coercion and testing
Lists can be complex structures and even include other lists.
```{r}
x <- list("a", list("b", c("c", "d"), list(1:5)))
```
```{r eval=FALSE}
> str(x)
List of 2
$ : chr "a"
$ :List of 3
..$ : chr "b"
..$ : chr [1:2] "c" "d"
..$ :List of 1 #<<
.. ..$ : int [1:5] 1 2 3 4 5 #<<
```
--
```{r}
typeof(x)
```
You can test for a list and coerce an object to a list with `is.list()` and
`as.list()`, respectively.
---
## Flattening
Function `unlist()` will turn a list into an atomic vector. Keep R's coercion
hierarchy in mind if you use this function.
```{r}
y <- list(1:5, pi, c(T, F, T, T))
unlist(y)
```
--
```{r}
x <- list("a", list("b", c("c", "d"), list(1:5)))
unlist(x)
```
---
## List properties
- Lists are heterogeneous.
- Lists elements can have names.
```{r}
list(stocks = c("AAPL", "BA", "PFE", "C"),
eps = c(1.1, .9, 2.3, .54),
index = c("DJIA", "NASDAQ", "SP500"))
```
- Lists can be indexed by name or position.
- Lists let you extract sublists or a specific object.
---
## Exercise
Create a list based on the JSON product order data below.
```
[
{
"id": {
"oid": "5968dd23fc13ae04d9000001"
},
"product_name": "sildenafil citrate",
"supplier": "Wisozk Inc",
"quantity": 261,
"unit_cost": "$10.47"
},
{
"id": {
"oid": "5968dd23fc13ae04d9000002"
},
"product_name": "Mountain Juniperus ashei",
"supplier": "Keebler-Hilpert",
"quantity": 292,
"unit_cost": "$8.74"
}
]
```
???
## Solution
.solution[
```{r eval=FALSE}
list(
list(
id = list(oid = "5968dd23fc13ae04d9000001"),
product_name = "sildenafil citrate",
supplier = "Wisozk Inc",
quantity = 261,
unit_cost = "$10.47"
),
list(
id = list(oid = "5968dd23fc13ae04d9000002"),
product_name = "Mountain Juniperus ashei",
supplier = "Keebler-Hilpert",
quantity = 292,
unit_cost = "$8.74"
)
)
```
]
---
class: inverse, center, middle
# Functions
---
## Fundamentals
A function is comprised of arguments (formals), body, and environment. The first
two will be our main focus as we use and develop these objects.
```{r, include=FALSE}
f <- function(x, y, z) {
# combine words
paste(x, " ", y, " ", z)
}
f(x = "just", y = "three", z = "words")
```
.pull-left[
```{r}
f <- function(x, y, z) {
# combine words
paste(x, " ", y, " ", z)
}
f(x = "just",
y = "three",
z = "words")
```
]
.pull-right[
```{r}
formals(f)
body(f)
environment(f)
```
]
---
## Exiting
Most functions end by returning a value (implicitly or explicitly) or in error.
**Implicit return**
```{r}
centers <- function(x) {
c(mean(x), median(x))
}
```
**Explicit return**
```{r}
standardize <- function(x) {
stopifnot(length(x) > 1)
x_stand <- (x - mean(x)) / sd(x)
return(x_stand)
}
```
Using return makes your function easier to read and interpret. R functions
can return any object.
---
## Calls
Function calls involve the function's name and, at a minimum, values to
its required arguments. Arguments can be given values by
1. position
```{r}
z <- 1:30
mean(z, .3, FALSE)
```
--
2. name
```{r}
mean(x = z, trim = .3, na.rm = FALSE)
```
--
3. partial name matching
```{r}
mean(x = z, na = FALSE, t = .3)
```
---
## Call style
The best choice is
```{r}
mean(z, trim = .3)
```
Leave the argument's name out for the commonly used (required) arguments, and
always specify the argument names for the optional arguments.
---
## Scope
R uses lexical scoping. This provides a lot of flexibility, but it can also
be problematic if a user is not careful. Let's see if we can get an idea of
the scoping rules.
```{r, eval=FALSE}
y <- 1
f <- function(x){
y <- x ^ 2
return(y)
}
f(x = 3) #<<
y #<<
```
What is the result of `f(x = 3)` and `y`?
???
.solution[
```{r}
y <- 1
f <- function(x){
y <- x ^ 2
return(y)
}
f(x = 3)
y
```
]
---
```{r eval=FALSE}
y <- 1
z <- 2
f <- function(x){
y <- x ^ 2
g <- function() {
c(x, y, z)
} # closes body of g()
g()
} # closes body of f()
f(x = 3) #<<
c(y, z) #<<
```
What is the result of `f(x = 3)` and `c(y, z)`?
--
R first searches for a value associated with a name in the current environment.
If the object is not found the search is widened to the next higher scope.
???
.solution[
```{r}
y <- 1
z <- 2
f <- function(x){
y <- x ^ 2
g <- function() {
c(x, y, z)
} # closes body of g()
g()
} # closes body of f()
f(x = 3)
c(y, z)
```
]
---
## Lazy evaluation
.pull-left[
Arguments to R functions are not evaluated until needed.
```{r, error=TRUE}
f <- function(a, b, x) {
print(a)
print(b ^ 2)
0 * x
}
f(5, 6)
```
]
.middle.pull-right[
![](images/sloth.png)
]
---
## Four function forms
| Form | Description | Example(s) |
|:-----------:|:----------------------------:|:-------------------------:|
| Prefix | name comes before arguments | `log(x, base = exp(1))` |
| Infix | name between arguments | `+`, `%>%`, `%/%` |
| Replacement | replace values by assignment | `names(x) <- c("a", "b")` |
| Special | all others not defined above | `[[`, `for`, `break`, `(` |
---
## Help
To get help on any function, type `?fcn_name` in your console, where `fcn_name`
is the function's name. For infix, replacement, and special functions you
will need to surround the function with backticks.
```{r}
?sd
```
```{r}
?`for`
```
```{r}
?`names<-`
```
```{r}
?`%/%`
```
Using function `help()` is an alternative to `?`.
---
## Best practices
- Write a function when you have copied code more than twice.
- Try to use a verb for your function's name.
- Keep argument names short but descriptive.
- Add code comments to explain the "why" of your code.
- Link a family of functions with a common prefix: `pnorm()`, `pbinom()`,
`ppois()`.
- Keep data arguments first, then other required arguments, then followed by
default arguments. The `...` argument can be placed last.
---
.middle[