Data structures and dimensionality

Dimensions	Homogeneous	Heterogeneous
1d	Vector (atomic vector)	List (generic vector)
2d	Matrix	Data Frame
nd	Array	—

Vectors

Vector types

R has six basic atomic vector types, but for now we’ll only focus on the first four:

logical
double
integer
character
complex
raw

Vector types - examples

logical - boolean values TRUE and FALSE

typeof(TRUE)

## [1] "logical"

character - character strings

typeof("hello")

## [1] "character"

typeof('world')

## [1] "character"

Vector types - examples

double - floating point numerical values (default numerical type)

typeof(1.335)

## [1] "double"

typeof(7)

## [1] "double"

integer - integer numerical values (indicated with an L)

typeof( 7L )

## [1] "integer"

typeof( 1:3 )

## [1] "integer"

Concatenation

Vectors can be constructed using the c() function.

c(1,2,3)

## [1] 1 2 3

c("Hello", "World!")

## [1] "Hello"  "World!"

c(1,c(2, c(3)))

## [1] 1 2 3

Coercion

R is a dynamically typed language – it will happily convert between the various types without complaint.

c(1,"Hello")

## [1] "1"     "Hello"

c(FALSE, 3L)

## [1] 0 3

c(1.2, 3L)

## [1] 1.2 3.0

Missing Values

R uses NA to represent missing values in its data structures.

typeof(NA)

## [1] "logical"

Other Special Values

NaN - Not a number

Inf - Positive infinity

-Inf - Negative infinity

pi / 0

## [1] Inf

0 / 0

## [1] NaN

1/0 + 1/0

## [1] Inf

1/0 - 1/0

## [1] NaN

NaN / NA

## [1] NaN

NaN * NA

## [1] NaN

Application Exercise 3

What is the type of the following vectors? Explain why they have that type.

c(1, NA+1L, "C")
c(1L / 0, NA)
c(1:3, 5)
c(3L, NaN+1L)
c(NA, TRUE)

Lists

Lists are generic vectors: 1d and can contain any combination of R objects.

mylist = list("A", 1:4, c(TRUE,FALSE), (1:4)/2)
mylist

## [[1]]
## [1] "A"
## 
## [[2]]
## [1] 1 2 3 4
## 
## [[3]]
## [1]  TRUE FALSE
## 
## [[4]]
## [1] 0.5 1.0 1.5 2.0

str(mylist)

## List of 4
##  $ : chr "A"
##  $ : int [1:4] 1 2 3 4
##  $ : logi [1:2] TRUE FALSE
##  $ : num [1:4] 0.5 1 1.5 2

Recursive lists

Lists can even contain other lists, meaning they don’t have to be flat

str( list(1, list(2, list(3))) )

## List of 2
##  $ : num 1
##  $ :List of 2
##   ..$ : num 2
##   ..$ :List of 1
##   .. ..$ : num 3

Named lists

Because of their more complex structure we often want to name the elements of a list (we can also do this with vectors). This can make reading and accessing the list more straight forward.

myotherlist = list(A = "hello", B = 1:4, "knock knock" = "who's there?")
str(myotherlist)

## List of 3
##  $ A          : chr "hello"
##  $ B          : int [1:4] 1 2 3 4
##  $ knock knock: chr "who's there?"

names(myotherlist)

## [1] "A"           "B"           "knock knock"

myotherlist$B

## [1] 1 2 3 4

Data Frames

A data frame is the most commonly used data structure in R, they are just a list of equal length vectors (usually atomic, but you can use generic as well). Each vector is treated as a column and elements of the vectors as rows.

Most often a data frame will be constructed by reading in from a file, but we can also create them from scratch.

df = data.frame(x = 1:3, y = c("a", "b", "c"))
str(df)

## 'data.frame':    3 obs. of  2 variables:
##  $ x: int  1 2 3
##  $ y: Factor w/ 3 levels "a","b","c": 1 2 3

attributes(df)

## $names
## [1] "x" "y"
## 
## $row.names
## [1] 1 2 3
## 
## $class
## [1] "data.frame"

class(df$x)

## [1] "integer"

class(df$y)

## [1] "factor"

Strings (Characters) vs Factors

By default R will often attempt to convert character vectors into factors when they are included in a data frame. Sometimes this is useful, sometimes it isn’t – either way it is important to know what type/class you are working with. This behavior can be changed using the stringsAsFactors argument.

df = data.frame(x = 1:3, y = c("a", "b", "c"), stringsAsFactors = FALSE)
str(df)

## 'data.frame':    3 obs. of  2 variables:
##  $ x: int  1 2 3
##  $ y: chr  "a" "b" "c"

Growing data frames

We can add rows or columns to a data frame using rbind and cbind respectively.

df = data.frame(x = 1:3, y = c("a","b","c"))
cbind(df, z=TRUE)

##   x y    z
## 1 1 a TRUE
## 2 2 b TRUE
## 3 3 c TRUE

rbind(df, c(4,"b"))

##   x y
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 b

Combining data frames

df1 = data.frame(x = 1:3, y = c("a","b","c"))
df2 = data.frame(m = 3:1, n = c(TRUE,TRUE,FALSE))
df3 = cbind(df1,df2)
df3

##   x y m     n
## 1 1 a 3  TRUE
## 2 2 b 2  TRUE
## 3 3 c 1 FALSE

str(df3)

## 'data.frame':    3 obs. of  4 variables:
##  $ x: int  1 2 3
##  $ y: Factor w/ 3 levels "a","b","c": 1 2 3
##  $ m: int  3 2 1
##  $ n: logi  TRUE TRUE FALSE

Subsetting

Subsetting in General

R has several different subsetting operators ([, [[, and $).

The behavior of these operators will depend on the object they are being used with.

There are 4 main data types that can be used to subset:

Inclusion (positive integers)
Exclusion (negative integers)
Logical values
Character values (names)

Subsetting vectors - inclusion

Returns elements at the given location. Note that R uses a 1-based indexing scheme.

x = c(8,4,7)
x[c(1,3)]

## [1] 8 7

x[c(1,1)]

## [1] 8 8

Subsetting vectors - exclusion

Excludes elements at the given location

x = c(8,4,7)
x[-1]

## [1] 4 7

x[-c(1,3)]

## [1] 4

Subsetting vectors - logical values

Returns elements that correspond to TRUE in the logical vector.

x = c(-10,4,7,12)
x[c(TRUE,TRUE,FALSE,TRUE)]

## [1] -10   4  12

x[x > 3]

## [1]  4  7 12

x[x < -2 | x > 4]

## [1] -10   7  12

x[x < -2 & x > 4]

## numeric(0)

Logical operators and comparisons

op	meaning	comp	meaning
x \| y	or	x < y	less than
x & y	and	x > y	greater than
!x	not x	x <= y	less than or equal to
%%	mod	x >= y	greater than or equal to
		x != y	not equal to
		x == y	equal to
		x %in% y	x in y

Subsetting vectors - character values

If the vector has names, select elements whose names correspond to the character vector.

x = c(a=1,b=4,c=7)
x["a"]

## a 
## 1

x[c("b","c")]

## b c 
## 4 7

Subsetting vectors - out of bound subsetting

x = c(1,4,7)
x[4]

## [1] NA

x["a"]

## [1] NA

Vectors vs. lists - [ vs. [[

x = c(8,4,7)
x[1]

## [1] 8

x[[1]]

## [1] 8

y = list(8,4,7)
y[2]

## [[1]]
## [1] 4

y[[2]]

## [1] 4

Application Exercise 4

Below are 100 values,

x = c(56, 3, 17, 2, 4, 9, 6, 5, 19, 5, 2, 3, 5, 0, 13, 12, 6, 31, 10, 21, 8, 4, 1, 1, 2, 5, 16, 1, 3, 8, 1, 
      3, 4, 8, 5, 2, 8, 6, 18, 40, 10, 20, 1, 27, 2, 11, 14, 5, 7, 0, 3, 0, 7, 0, 8, 10, 10, 12, 8, 82, 
      21, 3, 34, 55, 18, 2, 9, 29, 1, 4, 7, 14, 7, 1, 2, 7, 4, 74, 5, 0, 3, 13, 2, 8, 1, 6, 13, 7, 1, 10, 
      5, 2, 4, 4, 14, 15, 4, 17, 1, 9)

write down how you would create a subset to accomplish each of the following:

Select all observations with values greater than or equal to 40
Select all observations with values less than 30 or greater than 50
Select all observations with values between 35 and 75
Remove all observations with an odd index (e.g. 1, 3, etc.)

Factor Subsetting

(x = factor(c("BS", "MS", "PhD", "MS")))

## [1] BS  MS  PhD MS 
## Levels: BS MS PhD

x[1:2]

## [1] BS MS
## Levels: BS MS PhD

x[1:2, drop=TRUE]

## [1] BS MS
## Levels: BS MS

Data Frame Subsetting

df = data.frame(a = 1:2, b = 3:4, c = 5:6)
df[1,]

##   a b c
## 1 1 3 5

df[,-2]

##   a c
## 1 1 5
## 2 2 6

df[, c("a","b")]

##   a b
## 1 1 3
## 2 2 4

Subsetting and assignment

Subsets can also be used with assignment to update specific values within an object.

x = c(1, 4, 7)

x[2] = 2
x

## [1] 1 2 7

x[1] = x[1] + 1
x

## [1] 2 2 7

Assignment with factors

x = c(1,2,1,3,2,1,2,1,3)
x[x == 1] = "male"
x[x == 2] = "female"
x[x == 3] = "other"
str(x)

##  chr [1:9] "male" "female" "male" "other" "female" "male" ...

x = factor(x, levels = c("male","female","other")); str(x)

##  Factor w/ 3 levels "male","female",..: 1 2 1 3 2 1 2 1 3

y = x[x != "other"]; str(y)

##  Factor w/ 3 levels "male","female",..: 1 2 1 2 1 2 1

w = x[x != "other", drop = TRUE]; str(w)

##  Factor w/ 2 levels "male","female": 1 2 1 2 1 2 1

Misc.

HW

See HW2.

Acknowledgments

Above materials are derived in part from the following sources:

Hadley Wickham - Advanced R
R Language Definition

Sta112FS
Lecture 5 - Data types and subsetting

Dr. Çetinkaya-Rundel

September 9, 2014

Data structures and dimensionality

Vectors

Vector types

Vector types - examples

Vector types - examples

Concatenation

Coercion

Missing Values

Other Special Values

Application Exercise 3

Lists

Lists

Recursive lists

Named lists

Data Frames

Data Frames

Strings (Characters) vs Factors

Growing data frames

Combining data frames

Subsetting

Subsetting in General

Subsetting vectors - inclusion

Subsetting vectors - exclusion

Subsetting vectors - logical values

Logical operators and comparisons

Subsetting vectors - character values

Subsetting vectors - out of bound subsetting

Vectors vs. lists - [ vs. [[

Application Exercise 4

Factor Subsetting

Data Frame Subsetting

Subsetting and assignment

Subsetting and assignment

Assignment with factors

Misc.

HW

Acknowledgments

Sta112FS Lecture 5 - Data types and subsetting

Dr. Çetinkaya-Rundel

September 9, 2014

Data structures and dimensionality

Vectors

Vector types

Vector types - examples

Vector types - examples

Concatenation

Coercion

Missing Values

Other Special Values

Application Exercise 3

Lists

Lists

Recursive lists

Named lists

Data Frames

Data Frames

Strings (Characters) vs Factors

Growing data frames

Combining data frames

Subsetting

Subsetting in General

Subsetting vectors - inclusion

Subsetting vectors - exclusion

Subsetting vectors - logical values

Logical operators and comparisons

Subsetting vectors - character values

Subsetting vectors - out of bound subsetting

Vectors vs. lists - [ vs. [[

Application Exercise 4

Factor Subsetting

Data Frame Subsetting

Subsetting and assignment

Subsetting and assignment

Assignment with factors

Misc.

HW

Acknowledgments

Sta112FS
Lecture 5 - Data types and subsetting