Data structures and dimensionality

Dimensions Homogeneous Heterogeneous
1d Vector (atomic vector) List (generic vector)
2d Matrix Data Frame
nd Array

Vectors

Atomic Vectors

R has six basic atomic vector types:

typeof mode storage.mode
logical logical logical
double numeric double
integer numeric integer
character character character
complex complex complex
raw raw raw


For now we'll only worry about the first four.

Vector types

logical - boolean values TRUE and FALSE

typeof(TRUE)
## [1] "logical"

character - character strings

typeof("hello")
## [1] "character"
typeof('world')
## [1] "character"

double - floating point numerical values (default numerical type)

typeof(1.33)
## [1] "double"
typeof(7)
## [1] "double"

integer - integer numerical values (indicated with an L)

typeof( 7L )
## [1] "integer"
typeof( 1:3 )
## [1] "integer"

Concatenation

Vectors can be constructed using the c() function, note that vectors will always be flat.

c(1,2,3)
## [1] 1 2 3
c("Hello", "World!")
## [1] "Hello"  "World!"
c(1,c(2, c(3)))
## [1] 1 2 3

Testing types

typeof(x) - returns a character vector of the type of object x.


is.logical(x) - returns TRUE if x has type logical.

is.character(x) - returns TRUE if x has type character.

is.double(x) - returns TRUE if x has type double.

is.integer(x) - returns TRUE if x has type integer.


is.numeric(x) - returns TRUE if x has mode numeric.

is.atomic(x) - returns TRUE if x is an atomic vector.

is.vector(x) - returns TRUE if x is any type of vector (e.g. atomic vector or list).

is.atomic(c(1,2,3))
## [1] TRUE
is.vector(c(1,2,3))
## [1] TRUE
is.atomic(list(1,2,3))
## [1] FALSE
is.vector(list(1,2,3))
## [1] TRUE

Coercion

R is a dynamically typed language – it will happily convert between the various types without complaint.

c(1,"Hello")
## [1] "1"     "Hello"
c(FALSE, 3L)
## [1] 0 3
c(1.2, 3L)
## [1] 1.2 3.0

Operation coercion

Functions and operators with attempt to coerce object to an appropriate type

3.1+1L
## [1] 4.1
log(TRUE)
## [1] 0
TRUE & 7
## [1] TRUE
FALSE | !5
## [1] FALSE

Explicit Coercion

Most of the is functions we just saw have an as variant which can be used for coercion.

as.logical(5.2)
## [1] TRUE
as.character(TRUE)
## [1] "TRUE"
as.integer(pi)
## [1] 3

as.numeric(FALSE)
## [1] 0
as.double("7.2")
## [1] 7.2
as.double("one")
## Warning: NAs introduced by coercion
## [1] NA

Missing Values

R uses NA to represent missing values in its data structures, what may not be obvious is that there are different NA for the different vector types.

typeof(NA)
## [1] "logical"
typeof(NA+1)
## [1] "double"
typeof(NA+1L)
## [1] "integer"

Other Special Values

NaN - Not a number

Inf - Positive infinity

-Inf - Negative infinity


pi / 0
## [1] Inf
0 / 0
## [1] NaN
1/0 + 1/0
## [1] Inf
1/0 - 1/0
## [1] NaN
NaN / NA
## [1] NaN
NaN * NA
## [1] NA

Exercise 1

What is the type of the following vectors? Explain why they have that type.

  • c(1, NA+1L, "C")
  • c(1L / 0, NA)
  • c(1:3, 5)
  • c(3L, NaN+1L)
  • c(NA, TRUE)

Lists

Lists are generic vectors, in that they are 1d and can contain any combination of R objects.

list("A", 1:4, c(TRUE,FALSE), (1:4)/2)
## [[1]]
## [1] "A"
## 
## [[2]]
## [1] 1 2 3 4
## 
## [[3]]
## [1]  TRUE FALSE
## 
## [[4]]
## [1] 0.5 1.0 1.5 2.0

str( list("A", 1:4, c(TRUE,FALSE), (1:4)/2) )
## List of 4
##  $ : chr "A"
##  $ : int [1:4] 1 2 3 4
##  $ : logi [1:2] TRUE FALSE
##  $ : num [1:4] 0.5 1 1.5 2

Recursive lists

Lists can even contain other lists, meaning they don't have to be flat

str( list(1, list(2, list(3))) )
## List of 2
##  $ : num 1
##  $ :List of 2
##   ..$ : num 2
##   ..$ :List of 1
##   .. ..$ : num 3

List Coercion

By default a vector will be coerced to a list (as a list is more generic)

str( c(1:3, list(4,5,list(6,7))) )
## List of 6
##  $ : int 1
##  $ : int 2
##  $ : int 3
##  $ : num 4
##  $ : num 5
##  $ :List of 2
##   ..$ : num 6
##   ..$ : num 7

We can force a list back to a vector as well:

unlist( list(1, list(2, list(3, "Hello"))) )
## [1] "1"     "2"     "3"     "Hello"

Named lists

Because of their more complex structure we often want to name the elements of a list (we can also do this with vectors). This can make reading and accessing the list more straight forward.

str(list(A = 1, B = list(C = 2, D = 3)))
## List of 2
##  $ A: num 1
##  $ B:List of 2
##   ..$ C: num 2
##   ..$ D: num 3
list("knock knock" = "who's there?")
## $`knock knock`
## [1] "who's there?"
names(list(ABC=1, DEF=list(H=2, I=3)))
## [1] "ABC" "DEF"

Exercise 2

Represent the following JSON data as a list in R.

{
  "firstName": "John",
  "lastName": "Smith",
  "age": 25,
  "address": 
  {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": 10021
  },
  "phoneNumber": 
  [
    {
      "type": "home",
      "number": "212 555-1239"
    },
    {
      "type": "fax",
      "number": "646 555-4567"
    }
  ]
}

Attributes

Attributes

Attributes are arbitrary metadata that can be attached to objects in R. Some are special (e.g. class, comment, dim, dimnames, names, etc.) and change the way in which an object is treated by R.

Attributes are a named list that is attached to every R object, they can be accessed (get and set) individually via the attr and collectively via attributes.

(x = c(L=1,M=2,N=3))
## L M N 
## 1 2 3
attr(x,"names") = c("A","B","C")
attr(x,"message") = "Hello!"
x
## A B C 
## 1 2 3 
## attr(,"message")
## [1] "Hello!"

str(x)
##  atomic [1:3] 1 2 3
##  - attr(*, "message")= chr "Hello!"
attributes(x)
## $names
## [1] "A" "B" "C"
## 
## $message
## [1] "Hello!"
str(attributes(x))
## List of 2
##  $ names  : chr [1:3] "A" "B" "C"
##  $ message: chr "Hello!"

Factors

Factor objects are how R stores data for categorical variables (fixed # of discrete values).

(x = factor(c("BS", "MS", "PhD", "MS")))
## [1] BS  MS  PhD MS 
## Levels: BS MS PhD
str(x)
##  Factor w/ 3 levels "BS","MS","PhD": 1 2 3 2
typeof(x)
## [1] "integer"

A factor is just an integer vector with two attributes: class and levels.

attributes(x)
## $levels
## [1] "BS"  "MS"  "PhD"
## 
## $class
## [1] "factor"

Exercise 3

Construct a factor variable (without using factor, as.factor, or related functions) that contains the weather forecast for the next 7 days.

  • There should be 4 levels - sunny, cloudy, rain, snow.

  • Find the weekly forecast from Weather Underground

  • Start with an integer vector and add the appropriate attributes.

  • What would you need to do if I decided that I'd prefer to have only three levels: sunny/cloudy, rain, `snow.

Data Frames

Data Frames

A data frame is the most commonly used data structure in R, they are just a list of equal length vectors (usually atomic, but you can use generic as well). Each vector is treated as a column and elements of the vectors as rows.

Most often a data frame will be constructed by reading in from a file, but we can also create them from scratch.

df = data.frame(x = 1:3, y = c("a", "b", "c"))
str(df)
## 'data.frame':    3 obs. of  2 variables:
##  $ x: int  1 2 3
##  $ y: Factor w/ 3 levels "a","b","c": 1 2 3

attributes(df)
## $names
## [1] "x" "y"
## 
## $row.names
## [1] 1 2 3
## 
## $class
## [1] "data.frame"
df2 = list(x = 1:3, y = factor(c("a", "b", "c")))
attr(df2,"class") = "data.frame"
attr(df2,"row.names") = 1:3
str(df2)
## 'data.frame':    3 obs. of  2 variables:
##  $ x: int  1 2 3
##  $ y: Factor w/ 3 levels "a","b","c": 1 2 3

Strings (Characters) vs Factors

By default R will often attempt to convert character vectors into factors when they are included in a data frame. Sometimes this is useful, sometimes it isn't – either way it is important to know what type/class you are working with. This behavior can be changed using the stringsAsFactors argument.

df = data.frame(x = 1:3, y = c("a", "b", "c"), stringsAsFactors = FALSE)
str(df)
## 'data.frame':    3 obs. of  2 variables:
##  $ x: int  1 2 3
##  $ y: chr  "a" "b" "c"

Length Coercion

If an R vector is shorter than expected, R will increase the length by repeating elements of the short vector. If the lengths are evenly divisible this will occur without any feedback, if not there will be either an error or warning.

df = data.frame(x = 1:3, y = c("a"))
str(df)
## 'data.frame':    3 obs. of  2 variables:
##  $ x: int  1 2 3
##  $ y: Factor w/ 1 level "a": 1 1 1
c(1:3) + c(TRUE,FALSE)
## Warning: longer object length is not a multiple of shorter object length
## [1] 2 2 4

Growing data frames

We can add rows or columns to a data frame using rbind and cbind respectively.

df = data.frame(x = 1:3, y = c("a","b","c"))
str(cbind(df, z=TRUE))
## 'data.frame':    3 obs. of  3 variables:
##  $ x: int  1 2 3
##  $ y: Factor w/ 3 levels "a","b","c": 1 2 3
##  $ z: logi  TRUE TRUE TRUE
str(rbind(df, c(TRUE,FALSE)))
## Warning: invalid factor level, NA generated
## 'data.frame':    4 obs. of  2 variables:
##  $ x: int  1 2 3 1
##  $ y: Factor w/ 3 levels "a","b","c": 1 2 3 NA

df1 = data.frame(x = 1:3, y = c("a","b","c"))
df2 = data.frame(m = 3:1, n = c(TRUE,TRUE,FALSE))
str(cbind(df1,df2))
## 'data.frame':    3 obs. of  4 variables:
##  $ x: int  1 2 3
##  $ y: Factor w/ 3 levels "a","b","c": 1 2 3
##  $ m: int  3 2 1
##  $ n: logi  TRUE TRUE FALSE
str(rbind(cbind(df1,df2),c(1,"a",1,1)))
## 'data.frame':    4 obs. of  4 variables:
##  $ x: chr  "1" "2" "3" "1"
##  $ y: Factor w/ 3 levels "a","b","c": 1 2 3 1
##  $ m: chr  "3" "2" "1" "1"
##  $ n: chr  "TRUE" "TRUE" "FALSE" "1"
str(rbind(cbind(df1,df2),list(1,"a",1,1)))
## 'data.frame':    4 obs. of  4 variables:
##  $ x: num  1 2 3 1
##  $ y: Factor w/ 3 levels "a","b","c": 1 2 3 1
##  $ m: num  3 2 1 1
##  $ n: num  1 1 0 1

Acknowledgments

Acknowledgments