A data frame is one of the most commonly used data structure in R, they are just a list of equal length vectors (usually atomic, but you can use generic as well). Each vector is treated as a column and elements of the vectors as rows.
Most often a data frame will be constructed by reading in from a file, but we can also create them from scratch.
df = data.frame(x = 1:3, y = c("a", "b", "c")) str(df)
## 'data.frame': 3 obs. of 2 variables: ## $ x: int 1 2 3 ## $ y: Factor w/ 3 levels "a","b","c": 1 2 3
typeof(df)
## [1] "list"
attributes(df)
## $names ## [1] "x" "y" ## ## $row.names ## [1] 1 2 3 ## ## $class ## [1] "data.frame"
df2 = list(x = 1:3, y = factor(c("a", "b", "c"))) attr(df2,"class") = "data.frame" attr(df2,"row.names") = 1:3 str(df2)
## 'data.frame': 3 obs. of 2 variables: ## $ x: int 1 2 3 ## $ y: Factor w/ 3 levels "a","b","c": 1 2 3
By default R will convert character vectors into factors when they are included in a data frame.
Sometimes this is useful, sometimes it isn't – either way it is important to know what type/class you are working with. This behavior can be changed using the stringsAsFactors
argument.
df = data.frame(x = 1:3, y = c("a", "b", "c"), stringsAsFactors = FALSE) str(df)
## 'data.frame': 3 obs. of 2 variables: ## $ x: int 1 2 3 ## $ y: chr "a" "b" "c"
If an R vector is shorter than expected, R will increase the length by repeating elements of the short vector. If the lengths are evenly divisible this will occur without any output / warning.
For data frames if the lengths are not evenly divisible then there will be an error.
data.frame(x = 1:3, y = c("a"))
## x y ## 1 1 a ## 2 2 a ## 3 3 a
data.frame(x = 1:3, y = c("a","b"))
## Error in data.frame(x = 1:3, y = c("a", "b")): arguments imply differing number of rows: 3, 2
We can add rows or columns to a data frame using rbind
and cbind
respectively.
df = data.frame(x = 1:3, y = c("a","b","c")) str(rbind(df, c(TRUE,FALSE)))
## Warning in `[<-.factor`(`*tmp*`, ri, value = FALSE): invalid factor level, ## NA generated
## 'data.frame': 4 obs. of 2 variables: ## $ x: int 1 2 3 1 ## $ y: Factor w/ 3 levels "a","b","c": 1 2 3 NA
str(cbind(df, z=TRUE))
## 'data.frame': 3 obs. of 3 variables: ## $ x: int 1 2 3 ## $ y: Factor w/ 3 levels "a","b","c": 1 2 3 ## $ z: logi TRUE TRUE TRUE
df1 = data.frame(x = 1:3, y = c("a","b","c")) df2 = data.frame(m = 3:1, n = c(TRUE,TRUE,FALSE)) str(cbind(df1,df2))
## 'data.frame': 3 obs. of 4 variables: ## $ x: int 1 2 3 ## $ y: Factor w/ 3 levels "a","b","c": 1 2 3 ## $ m: int 3 2 1 ## $ n: logi TRUE TRUE FALSE
# Bad str(rbind(cbind(df1,df2),c(1,"a",1,1)))
## 'data.frame': 4 obs. of 4 variables: ## $ x: chr "1" "2" "3" "1" ## $ y: Factor w/ 3 levels "a","b","c": 1 2 3 1 ## $ m: chr "3" "2" "1" "1" ## $ n: chr "TRUE" "TRUE" "FALSE" "1"
# Good str(rbind(cbind(df1,df2),list(1,"a",1,1)))
## 'data.frame': 4 obs. of 4 variables: ## $ x: num 1 2 3 1 ## $ y: Factor w/ 3 levels "a","b","c": 1 2 3 1 ## $ m: num 3 2 1 1 ## $ n: num 1 1 0 1
Construct a data frame that contains the following data (in as efficient a manner as possible). Hint - the rep
function might prove useful.
Patient Gender Treatment 1 Treatment 2 Treatment 3 ---------- --------------- --------------- --------------- --------------- 1 Male Yes Yes Yes 2 Male Yes Yes No 3 Male Yes No Yes 4 Male Yes No No 5 Male No Yes Yes 6 Male No Yes No 7 Male No No Yes 8 Male No No No 9 Female Yes Yes Yes 10 Female Yes Yes No 11 Female Yes No Yes 12 Female Yes No No 13 Female No Yes Yes 14 Female No Yes No 15 Female No No Yes 16 Female No No No
R has several different subsetting operators ([
, [[
, and $
).
The behavior of these operators will depend on the object they are being used with.
In general there are 6 different data types that can be used to subset:
Positive integers
Negative integers
Logical values
Empty / NULL
Zero
Returns elements at the given location (note R uses a 1-based not a 0-based indexing scheme).
x = c(1,4,7) x[c(1,3)]
## [1] 1 7
x[c(1,1)]
## [1] 1 1
x[c(1.9,2.1)]
## [1] 1 4
y = list(1,4,7) str( y[c(1,3)] )
## List of 2 ## $ : num 1 ## $ : num 7
str( y[c(1,1)] )
## List of 2 ## $ : num 1 ## $ : num 1
str( y[c(1.9,2.1)] )
## List of 2 ## $ : num 1 ## $ : num 4
Excludes elements at the given location
x = c(1,4,7) x[-1]
## [1] 4 7
x[-c(1,3)]
## [1] 4
x[c(-1,2)]
## Error in x[c(-1, 2)]: only 0's may be mixed with negative subscripts
y = list(1,4,7) str( y[-1] )
## List of 2 ## $ : num 4 ## $ : num 7
str( y[-c(1,3)] )
## List of 1 ## $ : num 4
y[c(-1,2)]
## Error in y[c(-1, 2)]: only 0's may be mixed with negative subscripts
Returns elements that correspond to TRUE
in the logical vector. Length of the logical vector is expected to be the same of the vector being subsetted.
x = c(1,4,7,12) x[c(TRUE,TRUE,FALSE,TRUE)]
## [1] 1 4 12
x[c(TRUE,FALSE)]
## [1] 1 7
x[x %% 2 == 0]
## [1] 4 12
y = list(1,4,7,12) str( y[c(TRUE,TRUE,FALSE,TRUE)] )
## List of 3 ## $ : num 1 ## $ : num 4 ## $ : num 12
str( y[c(TRUE,FALSE)] )
## List of 2 ## $ : num 1 ## $ : num 7
str( y[y %% 2 == 0] )
## Error in y%%2: non-numeric argument to binary operator
Returns the original vector.
x = c(1,4,7) x[]
## [1] 1 4 7
y = list(1,4,7) str(y[])
## List of 3 ## $ : num 1 ## $ : num 4 ## $ : num 7
Returns an empty vector of the same type as the vector being subseted.
x = c(1,4,7) x[0]
## numeric(0)
y = list(1,4,7) str(y[0])
## list()
If the vector has names, select elements whose names correspond to the character vector.
x = c(a=1,b=4,c=7) x["a"]
## a ## 1
x[c("b","c")]
## b c ## 4 7
y = list(a=1,b=4,c=7) str(y["a"])
## List of 1 ## $ a: num 1
str(y[c("b","c")])
## List of 2 ## $ b: num 4 ## $ c: num 7
x = c(1,4,7) x[4]
## [1] NA
x["a"]
## [1] NA
x[c(1,4)]
## [1] 1 NA
y = list(1,4,7) str(y[4])
## List of 1 ## $ : NULL
str(y["a"])
## List of 1 ## $ : NULL
str(y[c(1,4)])
## List of 2 ## $ : num 1 ## $ : NULL
x = c(1,4,7) x[NA]
## [1] NA NA NA
x[NULL]
## numeric(0)
x[c(1,NA)]
## [1] 1 NA
y = list(1,4,7) str(y[NA])
## List of 3 ## $ : NULL ## $ : NULL ## $ : NULL
str(y[NULL])
## list()
str(y[c(1,NA)])
## List of 2 ## $ : num 1 ## $ : NULL
[[
subsets like [
except it only subsets a single value.
x = c(a=1,b=4,c=7) x[[1]]
## [1] 1
x[["a"]]
## [1] 1
x[[1:2]]
## Error in x[[1:2]]: attempt to select more than one element in vectorIndex
Note that for lists the returned value may not be a list (more on this later).
y = list(a=1,b=4,c=7) y[2]
## $b ## [1] 4
y[[2]]
## [1] 4
y[["b"]]
## [1] 4
y[[1:2]]
## Error in y[[1:2]]: subscript out of bounds
$
is equivalent to [[
but it only works for lists, by default it uses partial matching (exact=FALSE
).
x = c("abc"=1, "def"=5) x$abc
## Error in x$abc: $ operator is invalid for atomic vectors
y = list("abc"=1, "def"=5) y$abc
## [1] 1
y$d
## [1] 5
Below are 100 values,
x = c(56, 3, 17, 2, 4, 9, 6, 5, 19, 5, 2, 3, 5, 0, 13, 12, 6, 31, 10, 21, 8, 4, 1, 1, 2, 5, 16, 1, 3, 8, 1, 3, 4, 8, 5, 2, 8, 6, 18, 40, 10, 20, 1, 27, 2, 11, 14, 5, 7, 0, 3, 0, 7, 0, 8, 10, 10, 12, 8, 82, 21, 3, 34, 55, 18, 2, 9, 29, 1, 4, 7, 14, 7, 1, 2, 7, 4, 74, 5, 0, 3, 13, 2, 8, 1, 6, 13, 7, 1, 10, 5, 2, 4, 4, 14, 15, 4, 17, 1, 9)
write down how you would create a subset to accomplish each of the following:
Select every third value starting at position 2 in x
.
Remove all values with an odd index (e.g. 1, 3, etc.)
Select only the values that are primes. (You may assume all values are less than 100)
Remove every 4th value, but only if it is odd.
Atomic vectors can be treated as multidimensional (2 or more) objects by adding a dim
attribute.
x = 1:8 dim(x) = c(2,4) x
## [,1] [,2] [,3] [,4] ## [1,] 1 3 5 7 ## [2,] 2 4 6 8
matrix(1:8, nrow=2, ncol=4)
## [,1] [,2] [,3] [,4] ## [1,] 1 3 5 7 ## [2,] 2 4 6 8
x = 1:8 attr(x,"dim") = c(2,2,2) x
## , , 1 ## ## [,1] [,2] ## [1,] 1 3 ## [2,] 2 4 ## ## , , 2 ## ## [,1] [,2] ## [1,] 5 7 ## [2,] 6 8
x = array(1:8,c(2,2,2)) x
## , , 1 ## ## [,1] [,2] ## [1,] 1 3 ## [2,] 2 4 ## ## , , 2 ## ## [,1] [,2] ## [1,] 5 7 ## [2,] 6 8
x = array(1:8,c(2,2,2)) rownames(x) = LETTERS[1:2] colnames(x) = LETTERS[3:4] dimnames(x)[[3]] = LETTERS[5:6] x
## , , E ## ## C D ## A 1 3 ## B 2 4 ## ## , , F ## ## C D ## A 5 7 ## B 6 8
str(x)
## int [1:2, 1:2, 1:2] 1 2 3 4 5 6 7 8 ## - attr(*, "dimnames")=List of 3 ## ..$ : chr [1:2] "A" "B" ## ..$ : chr [1:2] "C" "D" ## ..$ : chr [1:2] "E" "F"
(x = matrix(1:6, nrow=2, ncol=3, dimnames=list(c("A","B"),c("M","N","O"))))
## M N O ## A 1 3 5 ## B 2 4 6
x[1,3]
## [1] 5
x[1:2, 1:2]
## M N ## A 1 3 ## B 2 4
x[, 1:2]
## M N ## A 1 3 ## B 2 4
x[-1,-3]
## M N ## 2 4
x["A","M"]
## [1] 1
x["A", c("M","O")]
## M O ## 1 5
x[, "C"]
## Error in x[, "C"]: subscript out of bounds
x[1,"M"]
## [1] 1
x["B",]
## M N O ## 2 4 6
x["B"]
## [1] NA
x[-1]
## [1] 2 3 4 5 6
By default R's [
subset operator is a preserving subset operator, in that the returned object will have the same type as the parent. Confusingly, when used with a matrix or array [
becomes a simplifying operator (does not preserve type) - this behavior can be controlled by the drop
argument.
x = matrix(1:6, nrow=2, ncol=3, dimnames=list(c("A","B"),c("M","N","O"))) x[1, ]
## M N O ## 1 3 5
x[1, , drop=TRUE]
## M N O ## 1 3 5
x[1, , drop=FALSE]
## M N O ## A 1 3 5
Simplifying | Preserving | |
---|---|---|
Vector | x[[1]] |
x[1] |
List | x[[1]] |
x[1] |
Array | x[1, ] x[, 1] |
x[1, , drop = FALSE] x[, 1, drop = FALSE] |
Factor | x[1:4, drop = TRUE] |
x[1:4] |
Data frame | x[, 1] x[[1]] |
x[, 1, drop = FALSE] x[1] |
(x = factor(c("BS", "MS", "PhD", "MS")))
## [1] BS MS PhD MS ## Levels: BS MS PhD
x[1:2]
## [1] BS MS ## Levels: BS MS PhD
x[1:2, drop=TRUE]
## [1] BS MS ## Levels: BS MS
If provided with a single value, data frames assume you want to subset a column or columns - multiple values then the data frame is treated as a matrix.
df = data.frame(a = 1:2, b = 3:4) df[1]
## a ## 1 1 ## 2 2
df[[1]]
## [1] 1 2
df[, "a"]
## [1] 1 2
df["a"]
## a ## 1 1 ## 2 2
df[, "a", drop = FALSE]
## a ## 1 1 ## 2 2
df[1,]
## a b ## 1 1 3
df[c("a","b","a")]
## a b a.1 ## 1 1 3 1 ## 2 2 4 2
Subsets can also be used with assignment to update specific values within an object.
x = c(1, 4, 7)
x[2] = 2 x
## [1] 1 2 7
x[x %% 2 != 0] = x[x %% 2 != 0] + 1 x
## [1] 2 2 8
x[c(1,1)] = c(2,3) x
## [1] 3 2 8
x = 1:6
x[c(2,NA)] = 1 x
## [1] 1 1 3 4 5 6
x[c(TRUE,NA)] = 1 x
## [1] 1 1 1 4 1 6
x[c(-1,-3)] = 3 x
## [1] 1 3 1 3 3 3
x[] = 6:1 x
## [1] 6 5 4 3 2 1
df = data.frame(a = 1:2, b = TRUE, c = c("A", "B"))
df[["b"]] = NULL str(df)
## 'data.frame': 2 obs. of 2 variables: ## $ a: int 1 2 ## $ c: Factor w/ 2 levels "A","B": 1 2
df[,"c"] = NULL str(df)
## 'data.frame': 2 obs. of 1 variable: ## $ a: int 1 2
df = data.frame(a = c(5,1,NA,3))
df$a[df$a == 5] = 0 df[["a"]][df[["a"]] == 1] = 0 df[1][df[1] == 3] = 0
df
## a ## 1 0 ## 2 0 ## 3 NA ## 4 0
grades = data.frame( student = c("Alice","Bob","Carol","Dan","Eve","Frank", "Mallory","Oscar","Peggy","Sam","Wendy"), grade = c(82, 78, 62, 98, 64, 53, 86, 73, 54, 57, 61), year = c(3L, 2L, 2L, 1L, 3L, 3L, 4L, 3L, 2L, 2L, 1L), stringsAsFactors = FALSE )
For the above data frame use subsetting and subsetting assignment to add two new features (columns) to the data set:
These changes should not be hard coded - if I gave you a new data frame your code should still produce the correct answer.
Hadley Wickham has a package that modifies data frames to be more modern, or as he calls them surly and lazy.
library(tibble) class(iris)
## [1] "data.frame"
tbl_iris = as_tibble(iris) class(tbl_iris)
## [1] "tbl_df" "tbl" "data.frame"
tbl_iris
## # A tibble: 150 x 5 ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## <dbl> <dbl> <dbl> <dbl> <fctr> ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## 7 4.6 3.4 1.4 0.3 setosa ## 8 5.0 3.4 1.5 0.2 setosa ## 9 4.4 2.9 1.4 0.2 setosa ## 10 4.9 3.1 1.5 0.1 setosa ## # ... with 140 more rows
tbl_iris[1,]
## # A tibble: 1 x 5 ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## <dbl> <dbl> <dbl> <dbl> <fctr> ## 1 5.1 3.5 1.4 0.2 setosa
tbl_iris[,"Species"]
## # A tibble: 150 x 1 ## Species ## <fctr> ## 1 setosa ## 2 setosa ## 3 setosa ## 4 setosa ## 5 setosa ## 6 setosa ## 7 setosa ## 8 setosa ## 9 setosa ## 10 setosa ## # ... with 140 more rows
data_frame(x = 1:3, y=c("A","B","C"))
## # A tibble: 3 x 2 ## x y ## <int> <chr> ## 1 1 A ## 2 2 B ## 3 3 C
tbl_iris[,"Name"]
## Error: Unknown columns 'Name'
tbl_iris$Name
## Warning: Unknown column 'Name'
## NULL
tbl_iris[160,]
## # A tibble: 1 x 5 ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## <dbl> <dbl> <dbl> <dbl> <fctr> ## 1 NA NA NA NA NA
Above materials are derived in part from the following sources: