September 9, 2014

Dimensions | Homogeneous | Heterogeneous |
---|---|---|

1d |
Vector (atomic vector) | List (generic vector) |

2d |
Matrix | Data Frame |

nd |
Array | — |

R has six basic atomic vector types, but for now we'll only focus on the first four:

**logical****double****integer****character**complex

raw

**logical** - boolean values `TRUE`

and `FALSE`

typeof(TRUE)

## [1] "logical"

**character** - character strings

typeof("hello")

## [1] "character"

typeof('world')

## [1] "character"

**double** - floating point numerical values (default numerical type)

typeof(1.335)

## [1] "double"

typeof(7)

## [1] "double"

**integer** - integer numerical values (indicated with an `L`

)

typeof( 7L )

## [1] "integer"

typeof( 1:3 )

## [1] "integer"

Vectors can be constructed using the `c()`

function.

c(1,2,3)

## [1] 1 2 3

c("Hello", "World!")

## [1] "Hello" "World!"

c(1,c(2, c(3)))

## [1] 1 2 3

R is a dynamically typed language – it will happily convert between the various types without complaint.

c(1,"Hello")

## [1] "1" "Hello"

c(FALSE, 3L)

## [1] 0 3

c(1.2, 3L)

## [1] 1.2 3.0

R uses `NA`

to represent missing values in its data structures.

typeof(NA)

## [1] "logical"

`NaN`

- Not a number

`Inf`

- Positive infinity

`-Inf`

- Negative infinity

pi / 0

## [1] Inf

0 / 0

## [1] NaN

1/0 + 1/0

## [1] Inf

1/0 - 1/0

## [1] NaN

NaN / NA

## [1] NaN

NaN * NA

## [1] NaN

What is the type of the following vectors? Explain why they have that type.

`c(1, NA+1L, "C")`

`c(1L / 0, NA)`

`c(1:3, 5)`

`c(3L, NaN+1L)`

`c(NA, TRUE)`

Lists are *generic vectors*: 1d and can contain any combination of R objects.

mylist = list("A", 1:4, c(TRUE,FALSE), (1:4)/2) mylist

## [[1]] ## [1] "A" ## ## [[2]] ## [1] 1 2 3 4 ## ## [[3]] ## [1] TRUE FALSE ## ## [[4]] ## [1] 0.5 1.0 1.5 2.0

str(mylist)

## List of 4 ## $ : chr "A" ## $ : int [1:4] 1 2 3 4 ## $ : logi [1:2] TRUE FALSE ## $ : num [1:4] 0.5 1 1.5 2

Lists can even contain other lists, meaning they don't have to be flat

str( list(1, list(2, list(3))) )

## List of 2 ## $ : num 1 ## $ :List of 2 ## ..$ : num 2 ## ..$ :List of 1 ## .. ..$ : num 3

Because of their more complex structure we often want to name the elements of a list (we can also do this with vectors). This can make reading and accessing the list more straight forward.

myotherlist = list(A = "hello", B = 1:4, "knock knock" = "who's there?") str(myotherlist)

## List of 3 ## $ A : chr "hello" ## $ B : int [1:4] 1 2 3 4 ## $ knock knock: chr "who's there?"

names(myotherlist)

## [1] "A" "B" "knock knock"

myotherlist$B

## [1] 1 2 3 4

A data frame is the most commonly used data structure in R, they are just a list of equal length vectors (usually atomic, but you can use generic as well). Each vector is treated as a column and elements of the vectors as rows.

Most often a data frame will be constructed by reading in from a file, but we can also create them from scratch.

df = data.frame(x = 1:3, y = c("a", "b", "c")) str(df)

## 'data.frame': 3 obs. of 2 variables: ## $ x: int 1 2 3 ## $ y: Factor w/ 3 levels "a","b","c": 1 2 3

attributes(df)

## $names ## [1] "x" "y" ## ## $row.names ## [1] 1 2 3 ## ## $class ## [1] "data.frame"

class(df$x)

## [1] "integer"

class(df$y)

## [1] "factor"

By default R will often attempt to convert character vectors into factors when they are included in a data frame. Sometimes this is useful, sometimes it isn't – either way it is important to know what type/class you are working with. This behavior can be changed using the `stringsAsFactors`

argument.

df = data.frame(x = 1:3, y = c("a", "b", "c"), stringsAsFactors = FALSE) str(df)

## 'data.frame': 3 obs. of 2 variables: ## $ x: int 1 2 3 ## $ y: chr "a" "b" "c"

We can add rows or columns to a data frame using `rbind`

and `cbind`

respectively.

df = data.frame(x = 1:3, y = c("a","b","c")) cbind(df, z=TRUE)

## x y z ## 1 1 a TRUE ## 2 2 b TRUE ## 3 3 c TRUE

rbind(df, c(4,"b"))

## x y ## 1 1 a ## 2 2 b ## 3 3 c ## 4 4 b

df1 = data.frame(x = 1:3, y = c("a","b","c")) df2 = data.frame(m = 3:1, n = c(TRUE,TRUE,FALSE)) df3 = cbind(df1,df2) df3

## x y m n ## 1 1 a 3 TRUE ## 2 2 b 2 TRUE ## 3 3 c 1 FALSE

str(df3)

## 'data.frame': 3 obs. of 4 variables: ## $ x: int 1 2 3 ## $ y: Factor w/ 3 levels "a","b","c": 1 2 3 ## $ m: int 3 2 1 ## $ n: logi TRUE TRUE FALSE

R has several different subsetting operators (`[`

, `[[`

, and `$`

).

The behavior of these operators will depend on the object they are being used with.

There are 4 main data types that can be used to subset:

Inclusion (positive integers)

Exclusion (negative integers)

Logical values

Character values (names)

Returns elements at the given location. Note that R uses a 1-based indexing scheme.

x = c(8,4,7) x[c(1,3)]

## [1] 8 7

x[c(1,1)]

## [1] 8 8

Excludes elements at the given location

x = c(8,4,7) x[-1]

## [1] 4 7

x[-c(1,3)]

## [1] 4

Returns elements that correspond to `TRUE`

in the logical vector.

x = c(-10,4,7,12) x[c(TRUE,TRUE,FALSE,TRUE)]

## [1] -10 4 12

x[x > 3]

## [1] 4 7 12

x[x < -2 | x > 4]

## [1] -10 7 12

x[x < -2 & x > 4]

## numeric(0)

op | meaning | comp | meaning | |
---|---|---|---|---|

x | y | or | x < y | less than | |

x & y | and | x > y | greater than | |

!x | not x | x <= y | less than or equal to | |

%% | mod | x >= y | greater than or equal to | |

x != y | not equal to | |||

x == y | equal to | |||

x %in% y | x in y |

If the vector has names, select elements whose names correspond to the character vector.

x = c(a=1,b=4,c=7) x["a"]

## a ## 1

x[c("b","c")]

## b c ## 4 7

x = c(1,4,7) x[4]

## [1] NA

x["a"]

## [1] NA

x = c(8,4,7) x[1]

## [1] 8

x[[1]]

## [1] 8

y = list(8,4,7) y[2]

## [[1]] ## [1] 4

y[[2]]

## [1] 4

Below are 100 values,

x = c(56, 3, 17, 2, 4, 9, 6, 5, 19, 5, 2, 3, 5, 0, 13, 12, 6, 31, 10, 21, 8, 4, 1, 1, 2, 5, 16, 1, 3, 8, 1, 3, 4, 8, 5, 2, 8, 6, 18, 40, 10, 20, 1, 27, 2, 11, 14, 5, 7, 0, 3, 0, 7, 0, 8, 10, 10, 12, 8, 82, 21, 3, 34, 55, 18, 2, 9, 29, 1, 4, 7, 14, 7, 1, 2, 7, 4, 74, 5, 0, 3, 13, 2, 8, 1, 6, 13, 7, 1, 10, 5, 2, 4, 4, 14, 15, 4, 17, 1, 9)

write down how you would create a subset to accomplish each of the following:

Select all observations with values greater than or equal to 40

Select all observations with values less than 30 or greater than 50

Select all observations with values between 35 and 75

Remove all observations with an odd index (e.g. 1, 3, etc.)

(x = factor(c("BS", "MS", "PhD", "MS")))

## [1] BS MS PhD MS ## Levels: BS MS PhD

x[1:2]

## [1] BS MS ## Levels: BS MS PhD

x[1:2, drop=TRUE]

## [1] BS MS ## Levels: BS MS

df = data.frame(a = 1:2, b = 3:4, c = 5:6) df[1,]

## a b c ## 1 1 3 5

df[,-2]

## a c ## 1 1 5 ## 2 2 6

df[, c("a","b")]

## a b ## 1 1 3 ## 2 2 4

Subsets can also be used with assignment to update specific values within an object.

x = c(1, 4, 7)

x[2] = 2 x

## [1] 1 2 7

x[1] = x[1] + 1 x

## [1] 2 2 7

x = c(1,2,1,3,2,1,2,1,3) x[x == 1] = "male" x[x == 2] = "female" x[x == 3] = "other" str(x)

## chr [1:9] "male" "female" "male" "other" "female" "male" ...

x = factor(x, levels = c("male","female","other")); str(x)

## Factor w/ 3 levels "male","female",..: 1 2 1 3 2 1 2 1 3

y = x[x != "other"]; str(y)

## Factor w/ 3 levels "male","female",..: 1 2 1 2 1 2 1

w = x[x != "other", drop = TRUE]; str(w)

## Factor w/ 2 levels "male","female": 1 2 1 2 1 2 1

See HW2.

Above materials are derived in part from the following sources:

- Hadley Wickham - Advanced R
- R Language Definition