September 9, 2014

## Data structures and dimensionality

Dimensions Homogeneous Heterogeneous
1d Vector (atomic vector) List (generic vector)
2d Matrix Data Frame
nd Array

## Vector types

R has six basic atomic vector types, but for now we'll only focus on the first four:

• logical

• double

• integer

• character

• complex

• raw

## Vector types - examples

logical - boolean values TRUE and FALSE

typeof(TRUE)
## [1] "logical"

character - character strings

typeof("hello")
## [1] "character"
typeof('world')
## [1] "character"

## Vector types - examples

double - floating point numerical values (default numerical type)

typeof(1.335)
## [1] "double"
typeof(7)
## [1] "double"

integer - integer numerical values (indicated with an L)

typeof( 7L )
## [1] "integer"
typeof( 1:3 )
## [1] "integer"

## Concatenation

Vectors can be constructed using the c() function.

c(1,2,3)
## [1] 1 2 3
c("Hello", "World!")
## [1] "Hello"  "World!"
c(1,c(2, c(3)))
## [1] 1 2 3

## Coercion

R is a dynamically typed language – it will happily convert between the various types without complaint.

c(1,"Hello")
## [1] "1"     "Hello"
c(FALSE, 3L)
## [1] 0 3
c(1.2, 3L)
## [1] 1.2 3.0

## Missing Values

R uses NA to represent missing values in its data structures.

typeof(NA)
## [1] "logical"

## Other Special Values

NaN - Not a number

Inf - Positive infinity

-Inf - Negative infinity

pi / 0
## [1] Inf
0 / 0
## [1] NaN
1/0 + 1/0
## [1] Inf
1/0 - 1/0
## [1] NaN
NaN / NA
## [1] NaN
NaN * NA
## [1] NaN

## Application Exercise 3

What is the type of the following vectors? Explain why they have that type.

• c(1, NA+1L, "C")
• c(1L / 0, NA)
• c(1:3, 5)
• c(3L, NaN+1L)
• c(NA, TRUE)

## Lists

Lists are generic vectors: 1d and can contain any combination of R objects.

mylist = list("A", 1:4, c(TRUE,FALSE), (1:4)/2)
mylist
## [[1]]
## [1] "A"
##
## [[2]]
## [1] 1 2 3 4
##
## [[3]]
## [1]  TRUE FALSE
##
## [[4]]
## [1] 0.5 1.0 1.5 2.0
str(mylist)
## List of 4
##  $: chr "A" ##$ : int [1:4] 1 2 3 4
##  $: logi [1:2] TRUE FALSE ##$ : num [1:4] 0.5 1 1.5 2

## Recursive lists

Lists can even contain other lists, meaning they don't have to be flat

str( list(1, list(2, list(3))) )
## List of 2
##  $: num 1 ##$ :List of 2
##   ..$: num 2 ## ..$ :List of 1
##   .. ..$: num 3 ## Named lists Because of their more complex structure we often want to name the elements of a list (we can also do this with vectors). This can make reading and accessing the list more straight forward. myotherlist = list(A = "hello", B = 1:4, "knock knock" = "who's there?") str(myotherlist) ## List of 3 ##$ A          : chr "hello"
##  $B : int [1:4] 1 2 3 4 ##$ knock knock: chr "who's there?"
names(myotherlist)
## [1] "A"           "B"           "knock knock"
myotherlist$B ## [1] 1 2 3 4 ## Data Frames ## Data Frames A data frame is the most commonly used data structure in R, they are just a list of equal length vectors (usually atomic, but you can use generic as well). Each vector is treated as a column and elements of the vectors as rows. Most often a data frame will be constructed by reading in from a file, but we can also create them from scratch. df = data.frame(x = 1:3, y = c("a", "b", "c")) str(df) ## 'data.frame': 3 obs. of 2 variables: ##$ x: int  1 2 3
##  $y: Factor w/ 3 levels "a","b","c": 1 2 3 attributes(df) ##$names
## [1] "x" "y"
##
## $row.names ## [1] 1 2 3 ## ##$class
## [1] "data.frame"
class(df$x) ## [1] "integer" class(df$y)
## [1] "factor"

## Strings (Characters) vs Factors

By default R will often attempt to convert character vectors into factors when they are included in a data frame. Sometimes this is useful, sometimes it isn't – either way it is important to know what type/class you are working with. This behavior can be changed using the stringsAsFactors argument.

df = data.frame(x = 1:3, y = c("a", "b", "c"), stringsAsFactors = FALSE)
str(df)
## 'data.frame':    3 obs. of  2 variables:
##  $x: int 1 2 3 ##$ y: chr  "a" "b" "c"

## Growing data frames

We can add rows or columns to a data frame using rbind and cbind respectively.

df = data.frame(x = 1:3, y = c("a","b","c"))
cbind(df, z=TRUE)
##   x y    z
## 1 1 a TRUE
## 2 2 b TRUE
## 3 3 c TRUE
rbind(df, c(4,"b"))
##   x y
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 b

## Combining data frames

df1 = data.frame(x = 1:3, y = c("a","b","c"))
df2 = data.frame(m = 3:1, n = c(TRUE,TRUE,FALSE))
df3 = cbind(df1,df2)
df3
##   x y m     n
## 1 1 a 3  TRUE
## 2 2 b 2  TRUE
## 3 3 c 1 FALSE
str(df3)
## 'data.frame':    3 obs. of  4 variables:
##  $x: int 1 2 3 ##$ y: Factor w/ 3 levels "a","b","c": 1 2 3
##  $m: int 3 2 1 ##$ n: logi  TRUE TRUE FALSE

## Subsetting in General

R has several different subsetting operators ([, [[, and \$).

The behavior of these operators will depend on the object they are being used with.

There are 4 main data types that can be used to subset:

• Inclusion (positive integers)

• Exclusion (negative integers)

• Logical values

• Character values (names)

## Subsetting vectors - inclusion

Returns elements at the given location. Note that R uses a 1-based indexing scheme.

x = c(8,4,7)
x[c(1,3)]
## [1] 8 7
x[c(1,1)]
## [1] 8 8

## Subsetting vectors - exclusion

Excludes elements at the given location

x = c(8,4,7)
x[-1]
## [1] 4 7
x[-c(1,3)]
## [1] 4

## Subsetting vectors - logical values

Returns elements that correspond to TRUE in the logical vector.

x = c(-10,4,7,12)
x[c(TRUE,TRUE,FALSE,TRUE)]
## [1] -10   4  12
x[x > 3]
## [1]  4  7 12
x[x < -2 | x > 4]
## [1] -10   7  12
x[x < -2 & x > 4]
## numeric(0)

## Logical operators and comparisons

op meaning comp meaning
x | y or x < y less than
x & y and x > y greater than
!x not x x <= y less than or equal to
%% mod x >= y greater than or equal to
x != y not equal to
x == y equal to
x %in% y x in y

## Subsetting vectors - character values

If the vector has names, select elements whose names correspond to the character vector.

x = c(a=1,b=4,c=7)
x["a"]
## a
## 1
x[c("b","c")]
## b c
## 4 7

## Subsetting vectors - out of bound subsetting

x = c(1,4,7)
x[4]
## [1] NA
x["a"]
## [1] NA

## Vectors vs. lists - [ vs. [[

x = c(8,4,7)
x[1]
## [1] 8
x[[1]]
## [1] 8
y = list(8,4,7)
y[2]
## [[1]]
## [1] 4
y[[2]]
## [1] 4

## Application Exercise 4

Below are 100 values,

x = c(56, 3, 17, 2, 4, 9, 6, 5, 19, 5, 2, 3, 5, 0, 13, 12, 6, 31, 10, 21, 8, 4, 1, 1, 2, 5, 16, 1, 3, 8, 1,
3, 4, 8, 5, 2, 8, 6, 18, 40, 10, 20, 1, 27, 2, 11, 14, 5, 7, 0, 3, 0, 7, 0, 8, 10, 10, 12, 8, 82,
21, 3, 34, 55, 18, 2, 9, 29, 1, 4, 7, 14, 7, 1, 2, 7, 4, 74, 5, 0, 3, 13, 2, 8, 1, 6, 13, 7, 1, 10,
5, 2, 4, 4, 14, 15, 4, 17, 1, 9)

write down how you would create a subset to accomplish each of the following:

• Select all observations with values greater than or equal to 40

• Select all observations with values less than 30 or greater than 50

• Select all observations with values between 35 and 75

• Remove all observations with an odd index (e.g. 1, 3, etc.)

## Factor Subsetting

(x = factor(c("BS", "MS", "PhD", "MS")))
## [1] BS  MS  PhD MS
## Levels: BS MS PhD
x[1:2]
## [1] BS MS
## Levels: BS MS PhD
x[1:2, drop=TRUE]
## [1] BS MS
## Levels: BS MS

## Data Frame Subsetting

df = data.frame(a = 1:2, b = 3:4, c = 5:6)
df[1,]
##   a b c
## 1 1 3 5
df[,-2]
##   a c
## 1 1 5
## 2 2 6
df[, c("a","b")]
##   a b
## 1 1 3
## 2 2 4

## Subsetting and assignment

Subsets can also be used with assignment to update specific values within an object.

x = c(1, 4, 7)
x[2] = 2
x
## [1] 1 2 7
x[1] = x[1] + 1
x
## [1] 2 2 7

## Assignment with factors

x = c(1,2,1,3,2,1,2,1,3)
x[x == 1] = "male"
x[x == 2] = "female"
x[x == 3] = "other"
str(x)
##  chr [1:9] "male" "female" "male" "other" "female" "male" ...
x = factor(x, levels = c("male","female","other")); str(x)
##  Factor w/ 3 levels "male","female",..: 1 2 1 3 2 1 2 1 3
y = x[x != "other"]; str(y)
##  Factor w/ 3 levels "male","female",..: 1 2 1 2 1 2 1
w = x[x != "other", drop = TRUE]; str(w)
##  Factor w/ 2 levels "male","female": 1 2 1 2 1 2 1