September 9, 2014

Data structures and dimensionality

Dimensions Homogeneous Heterogeneous
1d Vector (atomic vector) List (generic vector)
2d Matrix Data Frame
nd Array

Vectors

Vector types

R has six basic atomic vector types, but for now we'll only focus on the first four:

  • logical

  • double

  • integer

  • character

  • complex

  • raw

Vector types - examples

logical - boolean values TRUE and FALSE

typeof(TRUE)
## [1] "logical"

character - character strings

typeof("hello")
## [1] "character"
typeof('world')
## [1] "character"

Vector types - examples

double - floating point numerical values (default numerical type)

typeof(1.335)
## [1] "double"
typeof(7)
## [1] "double"

integer - integer numerical values (indicated with an L)

typeof( 7L )
## [1] "integer"
typeof( 1:3 )
## [1] "integer"

Concatenation

Vectors can be constructed using the c() function.

c(1,2,3)
## [1] 1 2 3
c("Hello", "World!")
## [1] "Hello"  "World!"
c(1,c(2, c(3)))
## [1] 1 2 3

Coercion

R is a dynamically typed language – it will happily convert between the various types without complaint.

c(1,"Hello")
## [1] "1"     "Hello"
c(FALSE, 3L)
## [1] 0 3
c(1.2, 3L)
## [1] 1.2 3.0

Missing Values

R uses NA to represent missing values in its data structures.

typeof(NA)
## [1] "logical"

Other Special Values

NaN - Not a number

Inf - Positive infinity

-Inf - Negative infinity


pi / 0
## [1] Inf
0 / 0
## [1] NaN
1/0 + 1/0
## [1] Inf
1/0 - 1/0
## [1] NaN
NaN / NA
## [1] NaN
NaN * NA
## [1] NaN

Application Exercise 3

What is the type of the following vectors? Explain why they have that type.

  • c(1, NA+1L, "C")
  • c(1L / 0, NA)
  • c(1:3, 5)
  • c(3L, NaN+1L)
  • c(NA, TRUE)

Lists

Lists

Lists are generic vectors: 1d and can contain any combination of R objects.

mylist = list("A", 1:4, c(TRUE,FALSE), (1:4)/2)
mylist
## [[1]]
## [1] "A"
## 
## [[2]]
## [1] 1 2 3 4
## 
## [[3]]
## [1]  TRUE FALSE
## 
## [[4]]
## [1] 0.5 1.0 1.5 2.0
str(mylist)
## List of 4
##  $ : chr "A"
##  $ : int [1:4] 1 2 3 4
##  $ : logi [1:2] TRUE FALSE
##  $ : num [1:4] 0.5 1 1.5 2

Recursive lists

Lists can even contain other lists, meaning they don't have to be flat

str( list(1, list(2, list(3))) )
## List of 2
##  $ : num 1
##  $ :List of 2
##   ..$ : num 2
##   ..$ :List of 1
##   .. ..$ : num 3

Named lists

Because of their more complex structure we often want to name the elements of a list (we can also do this with vectors). This can make reading and accessing the list more straight forward.

myotherlist = list(A = "hello", B = 1:4, "knock knock" = "who's there?")
str(myotherlist)
## List of 3
##  $ A          : chr "hello"
##  $ B          : int [1:4] 1 2 3 4
##  $ knock knock: chr "who's there?"
names(myotherlist)
## [1] "A"           "B"           "knock knock"
myotherlist$B
## [1] 1 2 3 4

Data Frames

Data Frames

A data frame is the most commonly used data structure in R, they are just a list of equal length vectors (usually atomic, but you can use generic as well). Each vector is treated as a column and elements of the vectors as rows.

Most often a data frame will be constructed by reading in from a file, but we can also create them from scratch.

df = data.frame(x = 1:3, y = c("a", "b", "c"))
str(df)
## 'data.frame':    3 obs. of  2 variables:
##  $ x: int  1 2 3
##  $ y: Factor w/ 3 levels "a","b","c": 1 2 3

attributes(df)
## $names
## [1] "x" "y"
## 
## $row.names
## [1] 1 2 3
## 
## $class
## [1] "data.frame"
class(df$x)
## [1] "integer"
class(df$y)
## [1] "factor"

Strings (Characters) vs Factors

By default R will often attempt to convert character vectors into factors when they are included in a data frame. Sometimes this is useful, sometimes it isn't – either way it is important to know what type/class you are working with. This behavior can be changed using the stringsAsFactors argument.

df = data.frame(x = 1:3, y = c("a", "b", "c"), stringsAsFactors = FALSE)
str(df)
## 'data.frame':    3 obs. of  2 variables:
##  $ x: int  1 2 3
##  $ y: chr  "a" "b" "c"

Growing data frames

We can add rows or columns to a data frame using rbind and cbind respectively.

df = data.frame(x = 1:3, y = c("a","b","c"))
cbind(df, z=TRUE)
##   x y    z
## 1 1 a TRUE
## 2 2 b TRUE
## 3 3 c TRUE
rbind(df, c(4,"b"))
##   x y
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 b

Combining data frames

df1 = data.frame(x = 1:3, y = c("a","b","c"))
df2 = data.frame(m = 3:1, n = c(TRUE,TRUE,FALSE))
df3 = cbind(df1,df2)
df3
##   x y m     n
## 1 1 a 3  TRUE
## 2 2 b 2  TRUE
## 3 3 c 1 FALSE
str(df3)
## 'data.frame':    3 obs. of  4 variables:
##  $ x: int  1 2 3
##  $ y: Factor w/ 3 levels "a","b","c": 1 2 3
##  $ m: int  3 2 1
##  $ n: logi  TRUE TRUE FALSE

Subsetting

Subsetting in General

R has several different subsetting operators ([, [[, and $).

The behavior of these operators will depend on the object they are being used with.

There are 4 main data types that can be used to subset:

  • Inclusion (positive integers)

  • Exclusion (negative integers)

  • Logical values

  • Character values (names)

Subsetting vectors - inclusion

Returns elements at the given location. Note that R uses a 1-based indexing scheme.

x = c(8,4,7)
x[c(1,3)]
## [1] 8 7
x[c(1,1)]
## [1] 8 8

Subsetting vectors - exclusion

Excludes elements at the given location

x = c(8,4,7)
x[-1]
## [1] 4 7
x[-c(1,3)]
## [1] 4

Subsetting vectors - logical values

Returns elements that correspond to TRUE in the logical vector.

x = c(-10,4,7,12)
x[c(TRUE,TRUE,FALSE,TRUE)]
## [1] -10   4  12
x[x > 3]
## [1]  4  7 12
x[x < -2 | x > 4]
## [1] -10   7  12
x[x < -2 & x > 4]
## numeric(0)

Logical operators and comparisons

op meaning comp meaning
x | y or x < y less than
x & y and x > y greater than
!x not x x <= y less than or equal to
%% mod x >= y greater than or equal to
x != y not equal to
x == y equal to
x %in% y x in y

Subsetting vectors - character values

If the vector has names, select elements whose names correspond to the character vector.

x = c(a=1,b=4,c=7)
x["a"]
## a 
## 1
x[c("b","c")]
## b c 
## 4 7

Subsetting vectors - out of bound subsetting

x = c(1,4,7)
x[4]
## [1] NA
x["a"]
## [1] NA

Vectors vs. lists - [ vs. [[

x = c(8,4,7)
x[1]
## [1] 8
x[[1]]
## [1] 8
y = list(8,4,7)
y[2]
## [[1]]
## [1] 4
y[[2]]
## [1] 4

Application Exercise 4

Below are 100 values,

x = c(56, 3, 17, 2, 4, 9, 6, 5, 19, 5, 2, 3, 5, 0, 13, 12, 6, 31, 10, 21, 8, 4, 1, 1, 2, 5, 16, 1, 3, 8, 1, 
      3, 4, 8, 5, 2, 8, 6, 18, 40, 10, 20, 1, 27, 2, 11, 14, 5, 7, 0, 3, 0, 7, 0, 8, 10, 10, 12, 8, 82, 
      21, 3, 34, 55, 18, 2, 9, 29, 1, 4, 7, 14, 7, 1, 2, 7, 4, 74, 5, 0, 3, 13, 2, 8, 1, 6, 13, 7, 1, 10, 
      5, 2, 4, 4, 14, 15, 4, 17, 1, 9)

write down how you would create a subset to accomplish each of the following:

  • Select all observations with values greater than or equal to 40

  • Select all observations with values less than 30 or greater than 50

  • Select all observations with values between 35 and 75

  • Remove all observations with an odd index (e.g. 1, 3, etc.)

Factor Subsetting

(x = factor(c("BS", "MS", "PhD", "MS")))
## [1] BS  MS  PhD MS 
## Levels: BS MS PhD
x[1:2]
## [1] BS MS
## Levels: BS MS PhD
x[1:2, drop=TRUE]
## [1] BS MS
## Levels: BS MS

Data Frame Subsetting

df = data.frame(a = 1:2, b = 3:4, c = 5:6)
df[1,]
##   a b c
## 1 1 3 5
df[,-2]
##   a c
## 1 1 5
## 2 2 6
df[, c("a","b")]
##   a b
## 1 1 3
## 2 2 4

Subsetting and assignment

Subsetting and assignment

Subsets can also be used with assignment to update specific values within an object.

x = c(1, 4, 7)
x[2] = 2
x
## [1] 1 2 7
x[1] = x[1] + 1
x
## [1] 2 2 7

Assignment with factors

x = c(1,2,1,3,2,1,2,1,3)
x[x == 1] = "male"
x[x == 2] = "female"
x[x == 3] = "other"
str(x)
##  chr [1:9] "male" "female" "male" "other" "female" "male" ...
x = factor(x, levels = c("male","female","other")); str(x)
##  Factor w/ 3 levels "male","female",..: 1 2 1 3 2 1 2 1 3
y = x[x != "other"]; str(y)
##  Factor w/ 3 levels "male","female",..: 1 2 1 2 1 2 1
w = x[x != "other", drop = TRUE]; str(w)
##  Factor w/ 2 levels "male","female": 1 2 1 2 1 2 1

Misc.

HW

Acknowledgments