Data types in R

class: center, middle, inverse, title-slide

# Data types in R
### Colin Rundel
### 2018-08-30

---

exclude: true

---
class: middle
count: false

# Atomic Vectors

---

## Atomic Vectors

R has six atomic vector types:

<br/>

`typeof`  |  `mode`     |  `storage.mode`
:-----------|:------------|:----------------
logical     |  logical    |  logical
double      |  numeric    |  double
integer     |  numeric    |  integer
character   |  character  |  character
complex     |  complex    |  complex
raw         |  raw        |  raw

---

## Vector types

`logical` - boolean values `TRUE` and `FALSE`

.pull-left[

```r
typeof(TRUE)
```

```
## [1] "logical"
```
]

.pull-right[

```r
mode(TRUE)
```

```
## [1] "logical"
```
]

<br/>

`character` - text strings

<div>

.pull-left[

```r
typeof("hello")
```

```
## [1] "character"
```

```r
typeof('world')
```

```
## [1] "character"
```
]

.pull-right[

```r
mode("hello")
```

```
## [1] "character"
```

```r
mode('world')
```

```
## [1] "character"
```
]

</div>

---

`double` - floating point numerical values (default numerical type)

.pull-left[

```r
typeof(1.33)
```

```
## [1] "double"
```

```r
typeof(7)
```

```
## [1] "double"
```
]

.pull-right[

```r
mode(1.33)
```

```
## [1] "numeric"
```

```r
mode(7)
```

```
## [1] "numeric"
```
]

<br/>

`integer` - integer numerical values (indicated with an `L`)

<div>

.pull-left[

```r
typeof( 7L )
```

```
## [1] "integer"
```

```r
typeof( 1:3 )
```

```
## [1] "integer"
```
]

.pull-right[

```r
mode( 7L )
```

```
## [1] "numeric"
```

```r
mode( 1:3 )
```

```
## [1] "numeric"
```
]

</div>

---

## Concatenation

Atomic vectors can be constructed using the concatenate, `c()`, function.

```r
c(1,2,3)
```

```
## [1] 1 2 3
```

```r
c("Hello", "World!")
```

```
## [1] "Hello"  "World!"
```

```r
c(1,c(2, c(3)))
```

```
## [1] 1 2 3
```

**Note** - atomic vectors are *always* flat.

---
class: split-thirds

## Testing types

* `typeof(x)` - returns a character vector (length 1) of the *type* of object `x`.

* `mode(x)` - returns a character vector (length 1) of the *mode* of object `x`.

* `storage.mode(x)` - returns a character vector (length 1) of the *storage mode* of object `x`.

.col1[

```r
typeof(1)
```

```
## [1] "double"
```

```r
typeof(1L)
```

```
## [1] "integer"
```

```r
typeof("A")
```

```
## [1] "character"
```

```r
typeof(TRUE)
```

```
## [1] "logical"
```
]

.col2[

```r
mode(1)
```

```
## [1] "numeric"
```

```r
mode(1L)
```

```
## [1] "numeric"
```

```r
mode("A")
```

```
## [1] "character"
```

```r
mode(TRUE)
```

```
## [1] "logical"
```
]

.col3[

```r
storage.mode(1)
```

```
## [1] "double"
```

```r
storage.mode(1L)
```

```
## [1] "integer"
```

```r
storage.mode("A")
```

```
## [1] "character"
```

```r
storage.mode(TRUE)
```

```
## [1] "logical"
```
]

---

## Logical Predicates

* `is.logical(x)` - returns `TRUE` if `x` has *type* logical.

* `is.character(x)` - returns `TRUE` if `x` has *type* character.

* `is.double(x)` - returns `TRUE` if `x` has *type* double.

* `is.integer(x)` - returns `TRUE` if `x` has *type* integer.

* `is.numeric(x)` - returns `TRUE` if `x` has *mode* numeric.

.col1[

```r
is.integer(1)
```

```
## [1] FALSE
```

```r
is.integer(1L)
```

```
## [1] TRUE
```

```r
is.integer(3:7)
```

```
## [1] TRUE
```
]

.col2[

```r
is.double(1)
```

```
## [1] TRUE
```

```r
is.double(1L)
```

```
## [1] FALSE
```

```r
is.double(3:8)
```

```
## [1] FALSE
```
]

.col3[

```r
is.numeric(1)
```

```
## [1] TRUE
```

```r
is.numeric(1L)
```

```
## [1] TRUE
```

```r
is.numeric(3:7)
```

```
## [1] TRUE
```
]

---

## Other useful predicates

* `is.atomic(x)` - returns `TRUE` if `x` is an *atomic vector*.

* `is.vector(x)` - returns `TRUE` if `x` is either type of vector (i.e. either *atomic vector* or *list*).

```r
is.atomic(c(1,2,3))
```

```
## [1] TRUE
```

```r
is.vector(c(1,2,3))
```

```
## [1] TRUE
```

```r
is.atomic(list(1,2,3))
```

```
## [1] FALSE
```

```r
is.vector(list(1,2,3))
```

```
## [1] TRUE
```

---

## Type Coercion

R is a dynamically typed language -- it will automatically convert between most type without raising warnings or errors.

```r
c(1,"Hello")
```

```
## [1] "1"     "Hello"
```

```r
c(FALSE, 3L)
```

```
## [1] 0 3
```

```r
c(1.2, 3L)
```

```
## [1] 1.2 3.0
```

---

## Operator coercion

Functions and operators will attempt to coerce object to an appropriate type

```r
3.1+1L
```

```
## [1] 4.1
```

```r
log(TRUE)
```

```
## [1] 0
```

```r
TRUE & 7
```

```
## [1] TRUE
```

```r
FALSE | !5
```

```
## [1] FALSE
```

---

## Explicit Coercion

Most of the `is` functions we just saw have an `as` variant which can be used for *explicit* coercion.

.pull-left[

```r
as.logical(5.2)
```

```
## [1] TRUE
```

```r
as.character(TRUE)
```

```
## [1] "TRUE"
```

```r
as.integer(pi)
```

```
## [1] 3
```
]

.pull-right[

```r
as.numeric(FALSE)
```

```
## [1] 0
```

```r
as.double("7.2")
```

```
## [1] 7.2
```

```r
as.double("one")
```

```
## Warning: NAs introduced by coercion
```

```
## [1] NA
```
]

---

## Missing Values

R uses `NA` to represent missing values in its data structures, what may not be obvious is that there are different `NA`s for the different types.

.pull-left[

```r
typeof(NA)
```

```
## [1] "logical"
```

```r
typeof(NA+1)
```

```
## [1] "double"
```

```r
typeof(NA+1L)
```

```
## [1] "integer"
```
]

.pull-right[

```r
typeof(NA_character_)
```

```
## [1] "character"
```

```r
typeof(NA_real_)
```

```
## [1] "double"
```

```r
typeof(NA_integer_)
```

```
## [1] "integer"
```
]

---

## Stickiness of Missing Values

Because `NA`s represent missing values it makes sense that any calculation using them should also be missing.

.pull-left[

```r
1 + NA
```

```
## [1] NA
```

```r
1 / NA
```

```
## [1] NA
```

```r
NA * 5
```

```
## [1] NA
```
]

.pull-right[

```r
mean(c(1,2,3,NA))
```

```
## [1] NA
```

```r
sqrt(NA)
```

```
## [1] NA
```

```r
3^NA
```

```
## [1] NA
```
]

---

## Conditionals and missing values

`NA`s can be problematic in some cases (particularly for control flow)

```r
1 == NA
```

```
## [1] NA
```

```r
if (2 != NA)
  "Here"
```

```
## Error in if (2 != NA) "Here": missing value where TRUE/FALSE needed
```

```r
if (all(c(1,2,NA,4) >= 1))
  "There"
```

```
## Error in if (all(c(1, 2, NA, 4) >= 1)) "There": missing value where TRUE/FALSE needed
```

```r
if (any(c(1,2,NA,4) >= 1))
  "There"
```

```
## [1] "There"
```

---

## Testing for `NA`

To explicitly test if a value is missing it is necessary to use `is.na` (often along with `any` or `all`).

.pull-left[

```r
is.na(NA)
```

```
## [1] TRUE
```

```r
is.na(1)
```

```
## [1] FALSE
```

```r
is.na(c(1,2,3,NA))
```

```
## [1] FALSE FALSE FALSE  TRUE
```
]

.pull-right[

```r
any(is.na(c(1,2,3,NA)))
```

```
## [1] TRUE
```

```r
all(is.na(c(1,2,3,NA)))
```

```
## [1] FALSE
```
]

---

## Other Special (double) values

* `NaN` - Not a number

* `Inf` - Positive infinity

* `-Inf` - Negative infinity

.pull-left[

```r
pi / 0
```

```
## [1] Inf
```

```r
0 / 0
```

```
## [1] NaN
```

```r
1/0 + 1/0
```

```
## [1] Inf
```
]

.pull-right[

```r
1/0 - 1/0
```

```
## [1] NaN
```

```r
NaN / NA
```

```
## [1] NaN
```

```r
NaN * NA
```

```
## [1] NaN
```
]

---

## Testing for `inf` and `NaN`

`NaN` and `Inf` don't have the same testing issues that `NA` has, but there are still convenience functions for testing for

.pull-left[

```r
NA
```

```
## [1] NA
```

```r
1/0+1/0
```

```
## [1] Inf
```

```r
1/0-1/0
```

```
## [1] NaN
```

```r
1/0-1/0
```

```
## [1] NaN
```
]

.pull-right[

```r
is.finite(NA)
```

```
## [1] FALSE
```

```r
is.finite(1/0+1/0)
```

```
## [1] FALSE
```

```r
is.finite(1/0-1/0)
```

```
## [1] FALSE
```

```r
is.nan(1/0-1/0)
```

```
## [1] TRUE
```
]

---

## Coercion for infinity and NaN

First remember that `Inf`, `-Inf`, and `NaN` have type double, however their coercion behavior is not the same as for other double values.

```r
as.integer(Inf)
```

```
## Warning: NAs introduced by coercion to integer range
```

```
## [1] NA
```

```r
as.integer(NaN)
```

```
## [1] NA
```

.pull-left[

```r
as.logical(Inf)
```

```
## [1] TRUE
```

```r
as.logical(NaN)
```

```
## [1] NA
```
]

.pull-right[

```r
as.character(Inf)
```

```
## [1] "Inf"
```

```r
as.character(NaN)
```

```
## [1] "NaN"
```
]

---

## Exercise 1

**Part 1**

What is the type of the following vectors? Explain why they have that type.

* `c(1, NA+1L, "C")`
* `c(1L / 0, NA)`
* `c(1:3, 5)`
* `c(3L, NaN+1L)`
* `c(NA, TRUE)`

**Part 2**

Considering only the four (common) data types, what is R's implicit type conversion hierarchy (from highest priority to lowest priority)?

*Hint* - think about the pairwise interactions between types.

---
class: middle
count: false

# Generic Vectors

---

## Lists

Lists are _generic vectors_, in that they are 1 dimensional (i.e. have a length) and can contain any type of R object.

```r
list("A", c(TRUE,FALSE), (1:4)/2, function(x) x^2)
```

```
## [[1]]
## [1] "A"
## 
## [[2]]
## [1]  TRUE FALSE
## 
## [[3]]
## [1] 0.5 1.0 1.5 2.0
## 
## [[4]]
## function (x) 
## x^2
```

---

## Structure

Often we want a more compact representation of a complex object, the `str` function is useful for this particular task

```r
str( list("A", c(TRUE,FALSE), (1:4)/2, function(x) x^2) )
```

```
## List of 4
##  $ : chr "A"
##  $ : logi [1:2] TRUE FALSE
##  $ : num [1:4] 0.5 1 1.5 2
##  $ :function (x)  
##   ..- attr(*, "srcref")= 'srcref' int [1:8] 1 40 1 54 40 54 1 1
##   .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x7fd61d7afe70>
```

---

## Recursive lists

Lists can contain other lists, meaning they don't have to be flat

```r
str( list(1, list(2, list(3, 4), 5)) )
```

```
## List of 2
##  $ : num 1
##  $ :List of 3
##   ..$ : num 2
##   ..$ :List of 2
##   .. ..$ : num 3
##   .. ..$ : num 4
##   ..$ : num 5
```

---

## List Coercion

By default a vector will be coerced to a list (as a list is more generic) if needed

```r
str( c(1, list(4, list(6, 7))) )
```

```
## List of 3
##  $ : num 1
##  $ : num 4
##  $ :List of 2
##   ..$ : num 6
##   ..$ : num 7
```

We can coerce a list into an atomic vector using `unlist` - the usual type coercion rules then apply to determine its type.

```r
unlist(list(1:3, list(4:5, 6)))
```

```
## [1] 1 2 3 4 5 6
```

```r
unlist( list(1, list(2, list(3, "Hello"))) )
```

```
## [1] "1"     "2"     "3"     "Hello"
```

---

## Named lists

Because of their more complex structure we often want to name the elements of a list (we can also do this with vectors). This can make reading and accessing the list more straight forward.

```r
str(list(A = 1, B = list(C = 2, D = 3)))
```

```
## List of 2
##  $ A: num 1
##  $ B:List of 2
##   ..$ C: num 2
##   ..$ D: num 3
```

```r
list("knock knock" = "who's there?")
```

```
## $`knock knock`
## [1] "who's there?"
```

```r
names(list(ABC=1, DEF=list(H=2, I=3)))
```

```
## [1] "ABC" "DEF"
```

---

## Exercise 2

Represent the following JSON data as a list in R.

```json
{
  "firstName": "John",
  "lastName": "Smith",
  "age": 25,
  "address": 
  {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": 10021
  },
  "phoneNumber": 
  [
    {
      "type": "home",
      "number": "212 555-1239"
    },
    {
      "type": "fax",
      "number": "646 555-4567"
    }
  ]
}
```

---
class: middle
count: false

# Functions

---

## When to use functions

The goal of a function should be to encapsulate a *small* *reusable* piece of code.

* Name should make it clear what the function does (think in terms of simple verbs).

* Functionality should be simple enough to be quickly understood.

* The smaller and more modular the code the easier it will be to reuse elsewhere.

* Better to change code in one location than code everywhere.

---

## Function Parts

The two parts of a function are the arguments (`formals`) and the code (`body`).

```r
gcd = function(long1, lat1, long2, lat2) {
  R = 6371 # Earth mean radius in km
  # distance in km
  acos(sin(lat1)*sin(lat2) + cos(lat1)*cos(lat2) * cos(long2-long1)) * R
}
```

.pull-left[

```r
formals(gcd)
```

```
## $long1
## 
## 
## $lat1
## 
## 
## $long2
## 
## 
## $lat2
```
]

.pull-right[

```r
body(gcd)
```

```
## {
##     R = 6371
##     acos(sin(lat1) * sin(lat2) + cos(lat1) * cos(lat2) * cos(long2 - 
##         long1)) * R
## }
```
]

---

## Return values

There are two ways of returning values in R: explicit or implicit return values.

<br/>

*Explicit* - includes one or more `return` statements

```r
f = function(x) {
  return(x*x)
}
```

<br/>

*Implicit* - value of the last statement is returned.

```r
f = function(x) {
  x*x
}
```

---

## Returning multiple values

If we want a function to return more than one value we can group things using either a vector or a list.

```r
f = function(x) {
  c(x, x^2, x^3)
}

f(2)
```

```
## [1] 2 4 8
```

```r
f(2:3)
```

```
## [1]  2  3  4  9  8 27
```

---

## Argument names

When defining a function we are also implicitly defining names for the arguments, when calling the function we can use these names to pass arguments in a different order.

```r
f = function(x,y,z) {
  paste0("x=",x," y=",y," z=",z)
}
```

.pull-left[

```r
f(1,2,3)
```

```
## [1] "x=1 y=2 z=3"
```

```r
f(z=1,x=2,y=3)
```

```
## [1] "x=2 y=3 z=1"
```
]

.pull-right[

```r
f(y=2,1,3)
```

```
## [1] "x=1 y=2 z=3"
```

```r
f(y=2,1,x=3)
```

```
## [1] "x=3 y=2 z=1"
```
]

```r
f(1,2,3,m=1)
```

```
## Error in f(1, 2, 3, m = 1): unused argument (m = 1)
```

---

## Argument defaults

It is also possible to give function arguments default values so that they don't need to be provided every time the function is called.

```r
f = function(x,y=1,z=1) {
  paste0("x=",x," y=",y," z=",z)
}
```

```r
f()
```

```
## Error in paste0("x=", x, " y=", y, " z=", z): argument "x" is missing, with no default
```

```r
f(x=3)
```

```
## [1] "x=3 y=1 z=1"
```

```r
f(y=2,2)
```

```
## [1] "x=2 y=2 z=1"
```

---

## Scope

R has generous scoping rules, if it can't find a variable in the functions body, it will look for it in the next higher scope, and so on.

```r
y = 1
f = function(x) {
  x+y
}
f(3)
```

```
## [1] 4
```

```r
g = function(x) {
  y=2
  x+y
}
g(3)
```

```
## [1] 5
```

---

Additionally, variables defined within a scope only persist for the duration of that scope, and do not overwrite variables at higher scopes (unless you use the global assignment operator `<<-`, *which you shouldn't*)

```r
x = 1
y = 1
z = 1
f = function() {
    y = 2
    g = function() {
      z = 3
      return(x + y + z)
    }
    return(g())
}
f()
```

```
## [1] 6
```

```r
c(x,y,z)
```

```
## [1] 1 1 1
```

---

## Lazy evaluation

Arguments to R functions are lazily evaluated - meaning they are not evaluated until they are used

```r
f = function(x)
{
  cat("Hello world!\n")
  x
}

f(stop())
```

```
## Hello world!
```

```
## Error in f(stop()):
```

---

## Everything is a function

```r
`+`
```

```
## function (e1, e2)  .Primitive("+")
```

```r
typeof(`+`)
```

```
## [1] "builtin"
```

```r
x = 4:1
`+`(x,2)
```

```
## [1] 6 5 4 3
```

---

## Getting Help

Prefixing any function name with a `?` will open the related help file for that function.

```r
?`+`
?sum
```

For functions not in the base package, you can generally see their implementation by entering the function name without parentheses (or using the `body` function).

```r
lm
```

```
## function (formula, data, subset, weights, na.action, method = "qr", 
##     model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, 
##     contrasts = NULL, offset, ...) 
## {
##     ret.x <- x
##     ret.y <- y
##     cl <- match.call()
##     mf <- match.call(expand.dots = FALSE)
##     m <- match(c("formula", "data", "subset", "weights", "na.action", 
##         "offset"), names(mf), 0L)
##     mf <- mf[c(1L, m)]
##     mf$drop.unused.levels <- TRUE
##     mf[[1L]] <- quote(stats::model.frame)
##     mf <- eval(mf, parent.frame())
##     if (method == "model.frame") 
##         return(mf)
##     else if (method != "qr") 
##         warning(gettextf("method = '%s' is not supported. Using 'qr'", 
##             method), domain = NA)
##     mt <- attr(mf, "terms")
##     y <- model.response(mf, "numeric")
##     w <- as.vector(model.weights(mf))
##     if (!is.null(w) && !is.numeric(w)) 
##         stop("'weights' must be a numeric vector")
##     offset <- as.vector(model.offset(mf))
##     if (!is.null(offset)) {
##         if (length(offset) != NROW(y)) 
##             stop(gettextf("number of offsets is %d, should equal %d (number of observations)", 
##                 length(offset), NROW(y)), domain = NA)
##     }
##     if (is.empty.model(mt)) {
##         x <- NULL
##         z <- list(coefficients = if (is.matrix(y)) matrix(, 0, 
##             3) else numeric(), residuals = y, fitted.values = 0 * 
##             y, weights = w, rank = 0L, df.residual = if (!is.null(w)) sum(w != 
##             0) else if (is.matrix(y)) nrow(y) else length(y))
##         if (!is.null(offset)) {
##             z$fitted.values <- offset
##             z$residuals <- y - offset
##         }
##     }
##     else {
##         x <- model.matrix(mt, mf, contrasts)
##         z <- if (is.null(w)) 
##             lm.fit(x, y, offset = offset, singular.ok = singular.ok, 
##                 ...)
##         else lm.wfit(x, y, w, offset = offset, singular.ok = singular.ok, 
##             ...)
##     }
##     class(z) <- c(if (is.matrix(y)) "mlm", "lm")
##     z$na.action <- attr(mf, "na.action")
##     z$offset <- offset
##     z$contrasts <- attr(x, "contrasts")
##     z$xlevels <- .getXlevels(mt, mf)
##     z$call <- cl
##     z$terms <- mt
##     if (model) 
##         z$model <- mf
##     if (ret.x) 
##         z$x <- x
##     if (ret.y) 
##         z$y <- y
##     if (!qr) 
##         z$qr <- NULL
##     z
## }
## <bytecode: 0x7fd61c21a778>
## <environment: namespace:stats>
```

---

## Less Helpful Examples

```r
list
```

```
## function (...)  .Primitive("list")
```

```r
`[`
```

```
## .Primitive("[")
```

```r
sum
```

```
## function (..., na.rm = FALSE)  .Primitive("sum")
```

```r
`+`
```

```
## function (e1, e2)  .Primitive("+")
```

---

# Acknowledgments
## Acknowledgments

Above materials are derived in part from the following sources:

* Hadley Wickham - [Advanced R](http://adv-r.had.co.nz/)
* [R Language Definition](http://stat.ethz.ch/R-manual/R-devel/doc/manual/R-lang.html)