Subsetting in General

R has several different subsetting operators ([, [[, and $).

The behavior of these operators will depend on the object they are being used with.

In general there are 6 different data types that can be used to subset:

  • Positive integers

  • Negative integers

  • Logical values

  • Empty

  • Zero

  • Character values (names)

Subsetting Vectors

Atomic Vectors - Positive Ints

Returns elements at the given location (note R uses a 1-based not a 0-based indexing scheme).

x = c(1,4,7)
x[c(1,3)]
## [1] 1 7
x[c(1,1)]
## [1] 1 1
x[c(1.9,2.1)]
## [1] 1 4

Generic Vectors - Positive Ints

y = list(1,4,7)
str( y[c(1,3)] )
## List of 2
##  $ : num 1
##  $ : num 7
str( y[c(1,1)] )
## List of 2
##  $ : num 1
##  $ : num 1
str( y[c(1.9,2.1)] )
## List of 2
##  $ : num 1
##  $ : num 4

Atomic Vectors - Negative Ints

Excludes elements at the given location

x = c(1,4,7)
x[-1]
## [1] 4 7
x[-c(1,3)]
## [1] 4
x[c(-1,2)]
## Error in x[c(-1, 2)]: only 0's may be mixed with negative subscripts

Generic Vectors - Negative Ints

y = list(1,4,7)
str( y[-1] )
## List of 2
##  $ : num 4
##  $ : num 7
str( y[-c(1,3)] )
## List of 1
##  $ : num 4
y[c(-1,2)]
## Error in y[c(-1, 2)]: only 0's may be mixed with negative subscripts

Vectors - Logical Values

Returns elements that correspond to TRUE in the logical vector. Length of the logical vector is expected to be the same of the vector being subsetted.

x = c(1,4,7,12)
x[c(TRUE,TRUE,FALSE,TRUE)]
## [1]  1  4 12
x[c(TRUE,FALSE)]
## [1] 1 7
x[x %% 2 == 0]
## [1]  4 12




y = list(1,4,7,12)
str( y[c(TRUE,TRUE,FALSE,TRUE)] )
## List of 3
##  $ : num 1
##  $ : num 4
##  $ : num 12
str( y[c(TRUE,FALSE)] )
## List of 2
##  $ : num 1
##  $ : num 7
str( y[y %% 2 == 0] )
## Error in y%%2: non-numeric argument to binary operator

Vectors - Empty

Returns the original vector.

x = c(1,4,7)
x[]
## [1] 1 4 7
y = list(1,4,7)
str(y[])
## List of 3
##  $ : num 1
##  $ : num 4
##  $ : num 7

Vectors - Zero

Returns an empty vector of the same type as the vector being subseted.

x = c(1,4,7)
x[0]
## numeric(0)
y = list(1,4,7)
str(y[0])
##  list()

Vectors - Character Values

If the vector has names, select elements whose names correspond to the character vector.

x = c(a=1,b=4,c=7)
x["a"]
## a 
## 1
x[c("b","c")]
## b c 
## 4 7


y = list(a=1,b=4,c=7)
str(y["a"])
## List of 1
##  $ a: num 1
str(y[c("b","c")])
## List of 2
##  $ b: num 4
##  $ c: num 7

Vectors - Out of bound subsetting

x = c(1,4,7)
x[4]
## [1] NA
x["a"]
## [1] NA
x[c(1,4)]
## [1]  1 NA



y = list(1,4,7)
str(y[4])
## List of 1
##  $ : NULL
str(y["a"])
## List of 1
##  $ : NULL
str(y[c(1,4)])
## List of 2
##  $ : num 1
##  $ : NULL

Vectors - Missing and NULL

x = c(1,4,7)
x[NA]
## [1] NA NA NA
x[NULL]
## numeric(0)
x[c(1,NA)]
## [1]  1 NA



y = list(1,4,7)
str(y[NA])
## List of 3
##  $ : NULL
##  $ : NULL
##  $ : NULL
str(y[NULL])
##  list()
str(y[c(1,NA)])
## List of 2
##  $ : num 1
##  $ : NULL

Vectors - [ vs. [[

[[ subsets like [ except it only subsets a single value. Note that for lists the returned value may not be a list (more on this later).

x = c(1,4,7)
x[[1]]
## [1] 1
y = list(1,4,7)
y[2]
## [[1]]
## [1] 4
y[[2]]
## [1] 4

Hadley's Analogy

Vectors - [[ vs. $

$ is equivalent to [[ for character subsetting of lists, by default it uses partial matching (exact=FALSE).

x = c("abc"=1, "def"=5)
x$abc
## Error in x$abc: $ operator is invalid for atomic vectors
y = list("abc"=1, "def"=5)
y$abc
## [1] 1
y$d
## [1] 5

Logical operators and comparisons

op Vectorized Comp Vectorized
x | y True x < y True
x & y True x > y True
!x True x <= y True
x || y False x >= y True
x && y False x != y True
xor(x,y) True x == y True
x %in% y True (for x)

Exercise 1

Below are 100 values,

x = c(56, 3, 17, 2, 4, 9, 6, 5, 19, 5, 2, 3, 5, 0, 13, 12, 6, 31, 10, 21, 8, 4, 1, 1, 2, 5, 16, 1, 3, 8, 1,
      3, 4, 8, 5, 2, 8, 6, 18, 40, 10, 20, 1, 27, 2, 11, 14, 5, 7, 0, 3, 0, 7, 0, 8, 10, 10, 12, 8, 82,
      21, 3, 34, 55, 18, 2, 9, 29, 1, 4, 7, 14, 7, 1, 2, 7, 4, 74, 5, 0, 3, 13, 2, 8, 1, 6, 13, 7, 1, 10,
      5, 2, 4, 4, 14, 15, 4, 17, 1, 9)

write down how you would create a subset to accomplish each of the following:

  • Select every third value starting at position 2 in x.

  • Remove all values with an odd index (e.g. 1, 3, etc.)

  • Select only the values that are primes. (You may assume all values are less than 100)

  • Remove every 4th value, but only if it is odd.

Matrices, Data Frames, and Arrays

Matrices and Arrays

Atomic vectors can be treated as multidimensional (2 or more) objects by adding a dim attribute.

x = 1:8
dim(x) = c(2,4)
x
##      [,1] [,2] [,3] [,4]
## [1,]    1    3    5    7
## [2,]    2    4    6    8
matrix(1:8, nrow=2, ncol=4)
##      [,1] [,2] [,3] [,4]
## [1,]    1    3    5    7
## [2,]    2    4    6    8

x = 1:8
attr(x,"dim") = c(2,2,2)
x
## , , 1
## 
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## , , 2
## 
##      [,1] [,2]
## [1,]    5    7
## [2,]    6    8
x = array(1:8,c(2,2,2))
x
## , , 1
## 
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## , , 2
## 
##      [,1] [,2]
## [1,]    5    7
## [2,]    6    8

Naming dimensions

x = array(1:8,c(2,2,2))
rownames(x) = LETTERS[1:2]
colnames(x) = LETTERS[3:4]
dimnames(x)[[3]] = LETTERS[5:6]
x
## , , E
## 
##   C D
## A 1 3
## B 2 4
## 
## , , F
## 
##   C D
## A 5 7
## B 6 8

str(x)
##  int [1:2, 1:2, 1:2] 1 2 3 4 5 6 7 8
##  - attr(*, "dimnames")=List of 3
##   ..$ : chr [1:2] "A" "B"
##   ..$ : chr [1:2] "C" "D"
##   ..$ : chr [1:2] "E" "F"

Subsetting Matrices

(x = matrix(1:6, nrow=2, ncol=3, dimnames=list(c("A","B"),c("M","N","O"))))
##   M N O
## A 1 3 5
## B 2 4 6
x[1,3]
## [1] 5
x[1:2, 1:2]
##   M N
## A 1 3
## B 2 4
x[, 1:2]
##   M N
## A 1 3
## B 2 4
x[-1,-3]
## M N 
## 2 4

x["A","M"]
## [1] 1
x["A", c("M","O")]
## M O 
## 1 5
x[, "C"]
## Error in x[, "C"]: subscript out of bounds
x[1,"M"]
## [1] 1
x["B",]
## M N O 
## 2 4 6
x["B"]
## [1] NA
x[-1]
## [1] 2 3 4 5 6

Preserving Subsetting

By default R's [ subset operator is preserving subset operator, in that the returned object will have the same type as the parent. Confusingly, when used with a matrix or array [ becomes a simplifying operator (does not preserve type) - this behavior can be controlled by the drop argument.

x = matrix(1:6, nrow=2, ncol=3, dimnames=list(c("A","B"),c("M","N","O")))
x[1, ]
## M N O 
## 1 3 5
x[1, , drop=TRUE]
## M N O 
## 1 3 5
x[1, , drop=FALSE]
##   M N O
## A 1 3 5

Preserving vs Simplifying Subsets

Simplifying Preserving
Vector x[[1]] x[1]
List x[[1]] x[1]
Array x[1, ]
x[, 1]
x[1, , drop = FALSE]
x[, 1, drop = FALSE]
Factor x[1:4, drop = TRUE] x[1:4]
Data frame x[, 1]
x[[1]]
x[, 1, drop = FALSE]
x[1]

Hadley's Analogy

Factor Subsetting

(x = factor(c("BS", "MS", "PhD", "MS")))
## [1] BS  MS  PhD MS 
## Levels: BS MS PhD
x[1:2]
## [1] BS MS
## Levels: BS MS PhD
x[1:2, drop=TRUE]
## [1] BS MS
## Levels: BS MS

Data Frame Subsetting

If provided with a single value, data frames assume you want to subset a column or columns - multiple values then the data frame is treated as a matrix.

df = data.frame(a = 1:2, b = 3:4)
df[1]
##   a
## 1 1
## 2 2
df[[1]]
## [1] 1 2
df[, "a"]
## [1] 1 2

df["a"]
##   a
## 1 1
## 2 2
df[, "a", drop = FALSE]
##   a
## 1 1
## 2 2
df[1,]
##   a b
## 1 1 3
df[c("a","b","a")]
##   a b a.1
## 1 1 3   1
## 2 2 4   2

Subsetting and assignment

Subsetting and assignment

Subsets can also be used with assignment to update specific values within an object.

x = c(1, 4, 7)
x[2] = 2
x
## [1] 1 2 7
x[x %% 2 != 0] = x[x %% 2 != 0] + 1
x
## [1] 2 2 8
x[c(1,1)] = c(2,3)
x
## [1] 3 2 8

x = 1:6
x[c(2,NA)] = 1
x
## [1] 1 1 3 4 5 6
x[c(TRUE,NA)] = 1
x
## [1] 1 1 1 4 1 6
x[c(-1,-3)] = 3
x
## [1] 1 3 1 3 3 3
x[] = 6:1
x
## [1] 6 5 4 3 2 1

Deleting list (df) elements

df = data.frame(a = 1:2, b = TRUE, c = c("A", "B"))
df[["b"]] = NULL
str(df)
## 'data.frame':    2 obs. of  2 variables:
##  $ a: int  1 2
##  $ c: Factor w/ 2 levels "A","B": 1 2
df[,"c"] = NULL
str(df)
## 'data.frame':    2 obs. of  1 variable:
##  $ a: int  1 2

Subsets of Subsets

df = data.frame(a = c(5,1,NA,3))
df$a[df$a == 5] = 0
df[["a"]][df[["a"]] == 1] = 0
df[1][df[1] == 3] = 0
df
##    a
## 1  0
## 2  0
## 3 NA
## 4  0

Exercise 2

grades = data.frame(
            student = c("Alice","Bob","Carol","Dan","Eve","Frank",
                        "Mallory","Oscar","Peggy","Sam","Wendy"),
            grade   = c(82, 78, 62, 98, 64, 53, 86, 73, 54, 57, 61),
            year    = c(3L, 2L, 2L, 1L, 3L, 3L, 4L, 3L, 2L, 2L, 1L),
            stringsAsFactors = FALSE
         )

For the above data frame use subsetting and subsetting assignment to add two new features (columns) to the data set:

  • the student's letter grade (factor vector with labels A - F)
    • A (90-100), B (80-89), C (70-79), D (60-69), F (0-5w9)
  • the student's passing status the class (logical vector)
    • TRUE for a grade of A, B, or C
    • FALSE for a grade of D or F

These changes should not be hard coded - if I gave you a new data frame your code should still produce the correct answer. Hint - cbind or rbind may prove useful.

Acknowledgments

Acknowledgments