Data Frames

Data Frames

A data frame is one of the most commonly used data structure in R, they are just a list of equal length vectors (usually atomic, but you can use generic as well). Each vector is treated as a column and elements of the vectors as rows.

Most often a data frame will be constructed by reading in from a file, but we can also create them from scratch.

df = data.frame(x = 1:3, y = c("a", "b", "c"))
str(df)
## 'data.frame':    3 obs. of  2 variables:
##  $ x: int  1 2 3
##  $ y: Factor w/ 3 levels "a","b","c": 1 2 3

typeof(df)
## [1] "list"
attributes(df)
## $names
## [1] "x" "y"
## 
## $row.names
## [1] 1 2 3
## 
## $class
## [1] "data.frame"
df2 = list(x = 1:3, y = factor(c("a", "b", "c")))
attr(df2,"class") = "data.frame"
attr(df2,"row.names") = 1:3
str(df2)
## 'data.frame':    3 obs. of  2 variables:
##  $ x: int  1 2 3
##  $ y: Factor w/ 3 levels "a","b","c": 1 2 3

Strings (Characters) vs Factors

By default R will convert character vectors into factors when they are included in a data frame.

Sometimes this is useful, sometimes usually it isn't – either way it is important to know what type/class you are working with. This behavior can be changed using the stringsAsFactors argument.

df = data.frame(x = 1:3, y = c("a", "b", "c"), stringsAsFactors = FALSE)
str(df)
## 'data.frame':    3 obs. of  2 variables:
##  $ x: int  1 2 3
##  $ y: chr  "a" "b" "c"

Some general advice …



Length Coercion

As we have seen before, if a vector is shorter than expected, R will increase the length by repeating elements of the short vector. If the lengths are evenly divisible this will occur without any output / warning.

For data frames if the lengths are not evenly divisible then there will be an error.

data.frame(x = 1:3, y = c("a"))
##   x y
## 1 1 a
## 2 2 a
## 3 3 a
data.frame(x = 1:3, y = c("a","b"))
## Error in data.frame(x = 1:3, y = c("a", "b")): arguments imply differing number of rows: 3, 2

Growing data frames

We can add rows or columns to a data frame using rbind and cbind respectively.

df = data.frame(x = 1:3, y = c("a","b","c"))
str(rbind(df, c(TRUE,FALSE)))
## Warning in `[<-.factor`(`*tmp*`, ri, value = FALSE): invalid factor level,
## NA generated
## 'data.frame':    4 obs. of  2 variables:
##  $ x: int  1 2 3 1
##  $ y: Factor w/ 3 levels "a","b","c": 1 2 3 NA
str(cbind(df, z=TRUE))
## 'data.frame':    3 obs. of  3 variables:
##  $ x: int  1 2 3
##  $ y: Factor w/ 3 levels "a","b","c": 1 2 3
##  $ z: logi  TRUE TRUE TRUE

df1 = data.frame(x = 1:3, y = c("a","b","c"))
df2 = data.frame(m = 3:1, n = c(TRUE,TRUE,FALSE))
str(cbind(df1,df2))
## 'data.frame':    3 obs. of  4 variables:
##  $ x: int  1 2 3
##  $ y: Factor w/ 3 levels "a","b","c": 1 2 3
##  $ m: int  3 2 1
##  $ n: logi  TRUE TRUE FALSE
# Bad
str(rbind(cbind(df1,df2),c(1,"a",1,1)))
## 'data.frame':    4 obs. of  4 variables:
##  $ x: chr  "1" "2" "3" "1"
##  $ y: Factor w/ 3 levels "a","b","c": 1 2 3 1
##  $ m: chr  "3" "2" "1" "1"
##  $ n: chr  "TRUE" "TRUE" "FALSE" "1"
# Good
str(rbind(cbind(df1,df2),list(1,"a",1,1)))
## 'data.frame':    4 obs. of  4 variables:
##  $ x: num  1 2 3 1
##  $ y: Factor w/ 3 levels "a","b","c": 1 2 3 1
##  $ m: num  3 2 1 1
##  $ n: num  1 1 0 1

Exercise 1

Construct a data frame that contains the following data (in as efficient a manner as possible). Hint - the rep function should prove useful.

  Patient    Gender          Treatment 1     Treatment 2     Treatment 3
---------- --------------- --------------- --------------- ---------------
  1          Male            Yes             Yes             Yes
  2          Male            Yes             Yes             No 
  3          Male            Yes             No              Yes
  4          Male            Yes             No              No
  5          Male            No              Yes             Yes
  6          Male            No              Yes             No
  7          Male            No              No              Yes
  8          Male            No              No              No
  9          Female          Yes             Yes             Yes 
  10         Female          Yes             Yes             No
  11         Female          Yes             No              Yes
  12         Female          Yes             No              No
  13         Female          No              Yes             Yes
  14         Female          No              Yes             No
  15         Female          No              No              Yes
  16         Female          No              No              No

Matrices

Matrices

A matrix is a 2 dimensional equivalent of an atomic vector, in that all entries must be of the same type.

(m = matrix(c(1,2,3,4), ncol=2, nrow=2))
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
str(m)
##  num [1:2, 1:2] 1 2 3 4
attributes(m)
## $dim
## [1] 2 2

Column major ordering

A matrix is therefore just an atomic vector with a dim attribute where the data is stored in column major order (fill the first column starting at row one, then the next column and so on).

Data in a matrix is always stored in this format but we can fill by rows using the byrow argument

(cm = matrix(c(1,2,3,4), 
             ncol=2, nrow=2))
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
c(cm)
## [1] 1 2 3 4
(rm = matrix(c(1,2,3,4), 
             ncol=2, nrow=2, byrow=TRUE))
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4
c(rm)
## [1] 1 3 2 4

Naming dimensions

x = array(1:8,c(2,2,2))
rownames(x) = LETTERS[1:2]
colnames(x) = LETTERS[3:4]
dimnames(x)[[3]] = LETTERS[5:6]
x
## , , E
## 
##   C D
## A 1 3
## B 2 4
## 
## , , F
## 
##   C D
## A 5 7
## B 6 8

str(x)
##  int [1:2, 1:2, 1:2] 1 2 3 4 5 6 7 8
##  - attr(*, "dimnames")=List of 3
##   ..$ : chr [1:2] "A" "B"
##   ..$ : chr [1:2] "C" "D"
##   ..$ : chr [1:2] "E" "F"

Subsetting

Subsetting in General

R has several different subsetting operators ([, [[, and $).

The behavior of these operators will depend on the object they are being used with.

In general there are 6 different data types that can be used to subset:

  • Positive integers

  • Negative integers

  • Logical values

  • Empty / NULL

  • Zero

  • Character values (names)

Subsetting Vectors

Positive Integer subsetting

Returns elements at the given location(s) (note R uses a 1-based not a 0-based indexing scheme).

x = c(1,4,7)
x[c(1,3)]
## [1] 1 7
x[c(1,1)]
## [1] 1 1
x[c(1.9,2.1)]
## [1] 1 4




y = list(1,4,7)
str( y[c(1,3)] )
## List of 2
##  $ : num 1
##  $ : num 7
str( y[c(1,1)] )
## List of 2
##  $ : num 1
##  $ : num 1
str( y[c(1.9,2.1)] )
## List of 2
##  $ : num 1
##  $ : num 4

Negative Integer subsetting

Excludes elements at the given location

x = c(1,4,7)
x[-1]
## [1] 4 7
x[-c(1,3)]
## [1] 4
x[c(-1,-1)]
## [1] 4 7
y = list(1,4,7)
str( y[-1] )
## List of 2
##  $ : num 4
##  $ : num 7
str( y[-c(1,3)] )
## List of 1
##  $ : num 4


x[c(-1,2)]
## Error in x[c(-1, 2)]: only 0's may be mixed with negative subscripts

Logical Value Subsetting

Returns elements that correspond to TRUE in the logical vector. Length of the logical vector is expected to be the same of the vector being subsetted.

x = c(1,4,7,12)
x[c(TRUE,TRUE,FALSE,TRUE)]
## [1]  1  4 12
x[c(TRUE,FALSE)]
## [1] 1 7
x[x %% 2 == 0]
## [1]  4 12




y = list(1,4,7,12)
str( y[c(TRUE,TRUE,FALSE,TRUE)] )
## List of 3
##  $ : num 1
##  $ : num 4
##  $ : num 12
str( y[c(TRUE,FALSE)] )
## List of 2
##  $ : num 1
##  $ : num 7
str( y[y %% 2 == 0] )
## Error in y%%2: non-numeric argument to binary operator

Empty Subsetting

Returns the original vector.

x = c(1,4,7)
x[]
## [1] 1 4 7
y = list(1,4,7)
str(y[])
## List of 3
##  $ : num 1
##  $ : num 4
##  $ : num 7

Zero subsetting

Returns an empty vector of the same type as the vector being subseted.

x = c(1,4,7)
x[0]
## numeric(0)
y = list(1,4,7)
str(y[0])
##  list()
x[c(0,1)]
## [1] 1
y[c(0,1)]
## [[1]]
## [1] 1

Character subsetting

If the vector has names, select elements whose names correspond to the character vector.

x = c(a=1,b=4,c=7)
x["a"]
## a 
## 1
x[c("a","a")]
## a a 
## 1 1
x[c("b","c")]
## b c 
## 4 7


y = list(a=1,b=4,c=7)
str(y["a"])
## List of 1
##  $ a: num 1
str(y[c("a","a")])
## List of 2
##  $ a: num 1
##  $ a: num 1
str(y[c("b","c")])
## List of 2
##  $ b: num 4
##  $ c: num 7

Out of bound subsetting

x = c(1,4,7)
x[4]
## [1] NA
x["a"]
## [1] NA
x[c(1,4)]
## [1]  1 NA



y = list(1,4,7)
str(y[4])
## List of 1
##  $ : NULL
str(y["a"])
## List of 1
##  $ : NULL
str(y[c(1,4)])
## List of 2
##  $ : num 1
##  $ : NULL

Missing and NULL subsetting

x = c(1,4,7)
x[NA]
## [1] NA NA NA
x[NULL]
## numeric(0)
x[c(1,NA)]
## [1]  1 NA




y = list(1,4,7)
str(y[NA])
## List of 3
##  $ : NULL
##  $ : NULL
##  $ : NULL
str(y[NULL])
##  list()
str(y[c(1,NA)])
## List of 2
##  $ : num 1
##  $ : NULL

Atomic vectors - [ vs. [[

[[ subsets like [ except it can only subset a single value.

x = c(a=1,b=4,c=7)
x[[1]]
## [1] 1
x[["a"]]
## [1] 1
x[[1:2]]
## Error in x[[1:2]]: attempt to select more than one element in vectorIndex

Generic Vectors - [ vs. [[

Subsets a single value, but returns that value - not a list containing that value.

y = list(a=1,b=4,c=7)
y[2]
## $b
## [1] 4
y[[2]]
## [1] 4
y[["b"]]
## [1] 4
y[[1:2]]
## Error in y[[1:2]]: subscript out of bounds

Hadley's Analogy

Vectors - [[ vs. $

$ is equivalent to [[ but it only works for named lists, by default it uses partial matching (exact=FALSE).

x = c("abc"=1, "def"=5)
x$abc
## Error in x$abc: $ operator is invalid for atomic vectors
y = list("abc"=1, "def"=5)
y[["abc"]]
## [1] 1
y$abc
## [1] 1
y$d
## [1] 5

A common gotcha

Why does the following code not work?

x = list(abc = 1:10, def = 10:1)
y = "abc"

x$y
## NULL

\[ x$y \Leftrightarrow x[["y"]] \ne x[[y]] \]

x[[y]]
##  [1]  1  2  3  4  5  6  7  8  9 10

Exercise 2

Below are 100 values,

x = c(56, 3, 17, 2, 4, 9, 6, 5, 19, 5, 2, 3, 5, 0, 13, 12, 6, 31, 10, 21, 8, 4, 1, 1, 2, 5, 16, 1, 3, 8, 1,
      3, 4, 8, 5, 2, 8, 6, 18, 40, 10, 20, 1, 27, 2, 11, 14, 5, 7, 0, 3, 0, 7, 0, 8, 10, 10, 12, 8, 82,
      21, 3, 34, 55, 18, 2, 9, 29, 1, 4, 7, 14, 7, 1, 2, 7, 4, 74, 5, 0, 3, 13, 2, 8, 1, 6, 13, 7, 1, 10,
      5, 2, 4, 4, 14, 15, 4, 17, 1, 9)

write down how you would create a subset to accomplish each of the following:

  • Select every third value starting at position 2 in x.

  • Remove all values with an odd index (e.g. 1, 3, etc.)

  • Select only the values that are primes. (You may assume all values are less than 100)

  • Remove every 4th value, but only if it is odd.

Subsetting Matrices, Data Frames, and Arrays

Subsetting Matrices

(x = matrix(1:6, nrow=2, ncol=3, dimnames=list(c("A","B"),c("M","N","O"))))
##   M N O
## A 1 3 5
## B 2 4 6
x[1,3]
## [1] 5
x[1:2, 1:2]
##   M N
## A 1 3
## B 2 4


x[, 1:2]
##   M N
## A 1 3
## B 2 4
x[-1,-3]
## M N 
## 2 4

x["A","M"]
## [1] 1
x["A", c("M","O")]
## M O 
## 1 5
x[, "C"]
## Error in x[, "C"]: subscript out of bounds
x[1,"M"]
## [1] 1
x["B",]
## M N O 
## 2 4 6
x["B"]
## [1] NA
x[-1]
## [1] 2 3 4 5 6


Preserving Subsetting

By default R's [ subset operator is a preserving subset operator, in that the returned object will have the same type as the parent. Confusingly, when used with a matrix or array [ becomes a simplifying operator (does not preserve type) - this behavior can be controlled by the drop argument.

x = matrix(1:6, nrow=2, ncol=3, dimnames=list(c("A","B"),c("M","N","O")))

x[1, ]
## M N O 
## 1 3 5
x[1, , drop=TRUE]
## M N O 
## 1 3 5
x[1, , drop=FALSE]
##   M N O
## A 1 3 5


str(x[1, ])
##  Named int [1:3] 1 3 5
##  - attr(*, "names")= chr [1:3] "M" "N" "O"
str(x[1, , drop=TRUE])
##  Named int [1:3] 1 3 5
##  - attr(*, "names")= chr [1:3] "M" "N" "O"
str(x[1, , drop=FALSE])
##  int [1, 1:3] 1 3 5
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr "A"
##   ..$ : chr [1:3] "M" "N" "O"

Preserving vs Simplifying Subsets

Simplifying Preserving
Vector x[[1]] x[1]
List x[[1]] x[1]
Array x[1, ]
x[, 1]
x[1, , drop = FALSE]
x[, 1, drop = FALSE]
Factor x[1:4, drop = TRUE] x[1:4]
Data frame x[, 1]
x[[1]]
x[, 1, drop = FALSE]
x[1]

Back to Hadley's Analogy

Factor Subsetting

(x = factor(c("BS", "MS", "PhD", "MS")))
## [1] BS  MS  PhD MS 
## Levels: BS MS PhD
x[1:2]
## [1] BS MS
## Levels: BS MS PhD
x[1:2, drop=TRUE]
## [1] BS MS
## Levels: BS MS

Data Frame Subsetting

If provided with a single value, data frames assume you want to subset a column or columns - multiple values then the data frame is treated as a matrix.

df = data.frame(a = 1:2, b = 3:4)
df[1]
##   a
## 1 1
## 2 2
df[[1]]
## [1] 1 2
df[, "a"]
## [1] 1 2

df["a"]
##   a
## 1 1
## 2 2
df[, "a", drop = FALSE]
##   a
## 1 1
## 2 2
df[1,]
##   a b
## 1 1 3
df[c("a","b","a")]
##   a b a.1
## 1 1 3   1
## 2 2 4   2

Subsetting and assignment

Subsetting and assignment

Subsets can also be used with assignment to update specific values within an object.

x = c(1, 4, 7)
x[2] = 2
x
## [1] 1 2 7
x[x %% 2 != 0] = x[x %% 2 != 0] + 1
x
## [1] 2 2 8
x[c(1,1)] = c(2,3)
x
## [1] 3 2 8

x = 1:6
x[c(2,NA)] = 1
x
## [1] 1 1 3 4 5 6
x[c(TRUE,NA)] = 1
x
## [1] 1 1 1 4 1 6
x[c(-1,-3)] = 3
x
## [1] 1 3 1 3 3 3
x[] = 6:1
x
## [1] 6 5 4 3 2 1

Deleting list (df) elements

df = data.frame(a = 1:2, b = TRUE, c = c("A", "B"))
df[["b"]] = NULL
str(df)
## 'data.frame':    2 obs. of  2 variables:
##  $ a: int  1 2
##  $ c: Factor w/ 2 levels "A","B": 1 2
df[,"c"] = NULL
str(df)
## 'data.frame':    2 obs. of  1 variable:
##  $ a: int  1 2

Subsets of Subsets

df = data.frame(a = c(5,1,NA,3))
df$a[df$a == 5] = 0
df[["a"]][df[["a"]] == 1] = 0
df[1][df[1] == 3] = 0
df
##    a
## 1  0
## 2  0
## 3 NA
## 4  0

Exercise 3

grades = data.frame(
            student = c("Alice","Bob","Carol","Dan","Eve","Frank",
                        "Mallory","Oscar","Peggy","Sam","Wendy"),
            grade   = c(82, 78, 62, 98, 64, 53, 86, 73, 54, 57, 61),
            year    = c(3L, 2L, 2L, 1L, 3L, 3L, 4L, 3L, 2L, 2L, 1L),
            stringsAsFactors = FALSE
         )

For the above data frame use subsetting and subsetting assignment to add two new features (columns) to the data set:

  • the student's letter grade (factor vector with labels A - F)
    • A (90-100), B (80-89), C (70-79), D (60-69), F (0-59)
  • the student's passing status the class (logical vector)
    • TRUE for a grade of A, B, or C
    • FALSE for a grade of D or F

These changes should not be hard coded - if I gave you a new data frame your code should still produce the correct answer.

Acknowledgments

Acknowledgments