read.csv
header
: a logical value indicating whether the file contains the names of the variables as its first line, default: TRUE
na.strings
: a character vector of strings which are to be interpreted as NA values, default: "NA"
stringsAsFactors
: a logical value indicating whether character vectors should be converted to factors, default: TRUE
For more, see ?read.csv
Go to https://stat.duke.edu/courses/Fall14/sta112.01/data/paris_paintings.html
Click on download data (requires NetID login)
Load the file using the following structure:
pp <- read.csv("[filepath]",
na.strings = c("NA","n/a", "", "#NAME?", " ", "Unknown", "X"),
stringsAsFactors = FALSE)
dplyr is based on the concepts of functions as verbs that manipulate data frames.
Single table functions / verbs:
filter()
: pick rows matching criteria
slice()
: pick rows using index(es)
select()
: pick columns by name
rename()
: rename specific columns
arrange()
: reorder rows
mutate()
: add new variables
transmute()
: create new data frame with variables
sample_n()
/ sample_frac()
: randomly sample rows
summarise()
: reduce variables to valuesFirst argument is a data frame
Subsequent arguments say what to do with data frame
Always return a data frame
Avoid modify in place
Nested:
f( g( h(x), z=1), y=1 )
Piped:
h(x) %>% g(z=1) %>% g(y=1)
library(dplyr)
pp = read.csv("~/Desktop/Teaching/Sta112FS_F14/Data/Paris Paintings/paris_paintings.csv",
na.strings = c("NA","n/a", "", "#NAME?", " ", "Unknown", "X"),
stringsAsFactors=FALSE) %>%
as.data.frame() %>%
tbl_df()
class(pp)
## [1] "tbl_df" "tbl" "data.frame"
pp
## Source: local data frame [3,393 x 57]
##
## name sale lot dealer year origin_author origin_cat school_pntg
## 1 L1764-2 L1764 2 L 1764 F O F
## 2 L1764-3 L1764 3 L 1764 I O I
## 3 L1764-4 L1764 4 L 1764 NA O D/FL
## 4 L1764-5a L1764 5 L 1764 F O F
## 5 L1764-5b L1764 5 L 1764 F O F
## 6 L1764-6 L1764 6 L 1764 NA O I
## 7 L1764-7a L1764 7 L 1764 F O F
## 8 L1764-7b L1764 7 L 1764 F O F
## 9 L1764-8 L1764 8 L 1764 NA O I
## 10 L1764-9a L1764 9 L 1764 D/FL O D/FL
## .. ... ... ... ... ... ... ... ...
## Variables not shown: diff_origin (int), price (int), count (int), subject
## (chr), authorstandard (chr), artistliving (int), authorstyle (chr),
## author (chr), winningbidder (chr), winningbiddertype (chr), endbuyer
## (chr), interm (int), type_intermed (chr), height_in (dbl), width_in
## (dbl), surface_rect (dbl), diam_in (dbl), surface_rnd (dbl), shape
## (chr), surface (dbl), material (chr), mat (chr), quantity (int),
## nfigures (int), engraved (int), original (int), prevcoll (int),
## othartist (int), paired (int), figures (int), finished (int), lrgfont
## (int), relig (int), landsALL (int), lands_sc (int), lands_figs (int),
## lands_ment (int), arch (int), mytho (int), peasant (int), othgenre
## (int), singlefig (int), portrait (int), still_life (int), discauth
## (int), history (int), allegory (int), pastorale (int), other (int)
Return rows with matching conditions.
filter(pp, year > 1770, year < 1773, school_pntg == "D/FL")
## Source: local data frame [118 x 57]
##
## name sale lot dealer year origin_author origin_cat school_pntg
## 1 R1771-9 R1771 9 R 1771 D/FL D/FL D/FL
## 2 R1771-10a R1771 10 R 1771 D/FL D/FL D/FL
## 3 R1771-10b R1771 10 R 1771 D/FL D/FL D/FL
## 4 R1771-11 R1771 11 R 1771 D/FL D/FL D/FL
## 5 R1771-12 R1771 12 R 1771 D/FL D/FL D/FL
## 6 R1771-13 R1771 13 R 1771 NA D/FL D/FL
## 7 R1771-14 R1771 14 R 1771 D/FL D/FL D/FL
## 8 R1771-15 R1771 15 R 1771 D/FL D/FL D/FL
## 9 R1771-16a R1771 16 R 1771 D/FL D/FL D/FL
## 10 R1771-16b R1771 16 R 1771 D/FL D/FL D/FL
## .. ... ... ... ... ... ... ... ...
## Variables not shown: diff_origin (int), price (int), count (int), subject
## (chr), authorstandard (chr), artistliving (int), authorstyle (chr),
## author (chr), winningbidder (chr), winningbiddertype (chr), endbuyer
## (chr), interm (int), type_intermed (chr), height_in (dbl), width_in
## (dbl), surface_rect (dbl), diam_in (dbl), surface_rnd (dbl), shape
## (chr), surface (dbl), material (chr), mat (chr), quantity (int),
## nfigures (int), engraved (int), original (int), prevcoll (int),
## othartist (int), paired (int), figures (int), finished (int), lrgfont
## (int), relig (int), landsALL (int), lands_sc (int), lands_figs (int),
## lands_ment (int), arch (int), mytho (int), peasant (int), othgenre
## (int), singlefig (int), portrait (int), still_life (int), discauth
## (int), history (int), allegory (int), pastorale (int), other (int)
filter(pp, school_pntg == "D/FL" | school_pntg == "F")
## Source: local data frame [2,878 x 57]
##
## name sale lot dealer year origin_author origin_cat school_pntg
## 1 L1764-2 L1764 2 L 1764 F O F
## 2 L1764-4 L1764 4 L 1764 NA O D/FL
## 3 L1764-5a L1764 5 L 1764 F O F
## 4 L1764-5b L1764 5 L 1764 F O F
## 5 L1764-7a L1764 7 L 1764 F O F
## 6 L1764-7b L1764 7 L 1764 F O F
## 7 L1764-9a L1764 9 L 1764 D/FL O D/FL
## 8 L1764-9b L1764 9 L 1764 D/FL O D/FL
## 9 L1764-11 L1764 11 L 1764 NA O D/FL
## 10 L1764-12a L1764 12 L 1764 D/FL O D/FL
## .. ... ... ... ... ... ... ... ...
## Variables not shown: diff_origin (int), price (int), count (int), subject
## (chr), authorstandard (chr), artistliving (int), authorstyle (chr),
## author (chr), winningbidder (chr), winningbiddertype (chr), endbuyer
## (chr), interm (int), type_intermed (chr), height_in (dbl), width_in
## (dbl), surface_rect (dbl), diam_in (dbl), surface_rnd (dbl), shape
## (chr), surface (dbl), material (chr), mat (chr), quantity (int),
## nfigures (int), engraved (int), original (int), prevcoll (int),
## othartist (int), paired (int), figures (int), finished (int), lrgfont
## (int), relig (int), landsALL (int), lands_sc (int), lands_figs (int),
## lands_ment (int), arch (int), mytho (int), peasant (int), othgenre
## (int), singlefig (int), portrait (int), still_life (int), discauth
## (int), history (int), allegory (int), pastorale (int), other (int)
Select rows by position.
slice(pp, 3:8)
## Source: local data frame [6 x 57]
##
## name sale lot dealer year origin_author origin_cat school_pntg
## 1 L1764-4 L1764 4 L 1764 NA O D/FL
## 2 L1764-5a L1764 5 L 1764 F O F
## 3 L1764-5b L1764 5 L 1764 F O F
## 4 L1764-6 L1764 6 L 1764 NA O I
## 5 L1764-7a L1764 7 L 1764 F O F
## 6 L1764-7b L1764 7 L 1764 F O F
## Variables not shown: diff_origin (int), price (int), count (int), subject
## (chr), authorstandard (chr), artistliving (int), authorstyle (chr),
## author (chr), winningbidder (chr), winningbiddertype (chr), endbuyer
## (chr), interm (int), type_intermed (chr), height_in (dbl), width_in
## (dbl), surface_rect (dbl), diam_in (dbl), surface_rnd (dbl), shape
## (chr), surface (dbl), material (chr), mat (chr), quantity (int),
## nfigures (int), engraved (int), original (int), prevcoll (int),
## othartist (int), paired (int), figures (int), finished (int), lrgfont
## (int), relig (int), landsALL (int), lands_sc (int), lands_figs (int),
## lands_ment (int), arch (int), mytho (int), peasant (int), othgenre
## (int), singlefig (int), portrait (int), still_life (int), discauth
## (int), history (int), allegory (int), pastorale (int), other (int)
n()
: The number of observations in the current group
slice(pp, (n()-5):n())
## Source: local data frame [6 x 57]
##
## name sale lot dealer year origin_author origin_cat school_pntg
## 1 R1764-497 R1764 497 R 1764 F O F
## 2 R1764-498 R1764 498 R 1764 F O F
## 3 R1764-499 R1764 499 R 1764 F O F
## 4 R1764-500 R1764 500 R 1764 F O F
## 5 R1764-502a R1764 502 R 1764 F O F
## 6 R1764-502b R1764 502 R 1764 F O F
## Variables not shown: diff_origin (int), price (int), count (int), subject
## (chr), authorstandard (chr), artistliving (int), authorstyle (chr),
## author (chr), winningbidder (chr), winningbiddertype (chr), endbuyer
## (chr), interm (int), type_intermed (chr), height_in (dbl), width_in
## (dbl), surface_rect (dbl), diam_in (dbl), surface_rnd (dbl), shape
## (chr), surface (dbl), material (chr), mat (chr), quantity (int),
## nfigures (int), engraved (int), original (int), prevcoll (int),
## othartist (int), paired (int), figures (int), finished (int), lrgfont
## (int), relig (int), landsALL (int), lands_sc (int), lands_figs (int),
## lands_ment (int), arch (int), mytho (int), peasant (int), othgenre
## (int), singlefig (int), portrait (int), still_life (int), discauth
## (int), history (int), allegory (int), pastorale (int), other (int)
Equivalent to pp[(3393-5):3393,]
.
Selects variables with names that meet the conditions given.
select(pp, starts_with("h"))
## Source: local data frame [3,393 x 2]
##
## height_in history
## 1 37 0
## 2 18 0
## 3 13 0
## 4 14 0
## 5 14 0
## 6 7 0
## 7 6 0
## 8 6 0
## 9 15 0
## 10 9 0
## .. ... ...
select(pp, contains("_in"))
## Source: local data frame [3,393 x 4]
##
## type_intermed height_in width_in diam_in
## 1 NA 37 29.5 NA
## 2 NA 18 14.0 NA
## 3 NA 13 16.0 NA
## 4 NA 14 18.0 NA
## 5 NA 14 18.0 NA
## 6 NA 7 10.0 NA
## 7 NA 6 13.0 NA
## 8 NA 6 13.0 NA
## 9 NA 15 15.0 NA
## 10 NA 9 7.0 NA
## .. ... ... ... ...
select(pp, artistliving:author)
## Source: local data frame [3,393 x 3]
##
## artistliving authorstyle author
## 1 0 NA Grimou
## 2 0 NA Féty
## 3 0 copy after copié d'après Wandeik
## 4 0 NA Hérault
## 5 0 NA Hérault
## 6 0 copy after copié d'après l'Albanne
## 7 0 NA Nicolas Loire
## 8 0 NA Nicolas Loire
## 9 0 copy after copié d'après Paul Veronese
## 10 0 NA Rikaers
## .. ... ... ...
select(pp, -(artistliving:author))
## Source: local data frame [3,393 x 54]
##
## name sale lot dealer year origin_author origin_cat school_pntg
## 1 L1764-2 L1764 2 L 1764 F O F
## 2 L1764-3 L1764 3 L 1764 I O I
## 3 L1764-4 L1764 4 L 1764 NA O D/FL
## 4 L1764-5a L1764 5 L 1764 F O F
## 5 L1764-5b L1764 5 L 1764 F O F
## 6 L1764-6 L1764 6 L 1764 NA O I
## 7 L1764-7a L1764 7 L 1764 F O F
## 8 L1764-7b L1764 7 L 1764 F O F
## 9 L1764-8 L1764 8 L 1764 NA O I
## 10 L1764-9a L1764 9 L 1764 D/FL O D/FL
## .. ... ... ... ... ... ... ... ...
## Variables not shown: diff_origin (int), price (int), count (int), subject
## (chr), authorstandard (chr), winningbidder (chr), winningbiddertype
## (chr), endbuyer (chr), interm (int), type_intermed (chr), height_in
## (dbl), width_in (dbl), surface_rect (dbl), diam_in (dbl), surface_rnd
## (dbl), shape (chr), surface (dbl), material (chr), mat (chr), quantity
## (int), nfigures (int), engraved (int), original (int), prevcoll (int),
## othartist (int), paired (int), figures (int), finished (int), lrgfont
## (int), relig (int), landsALL (int), lands_sc (int), lands_figs (int),
## lands_ment (int), arch (int), mytho (int), peasant (int), othgenre
## (int), singlefig (int), portrait (int), still_life (int), discauth
## (int), history (int), allegory (int), pastorale (int), other (int)
Rename columns, new_name = old_name
rename(pp, lands_all = landsALL)
## Source: local data frame [3,393 x 57]
##
## name sale lot dealer year origin_author origin_cat school_pntg
## 1 L1764-2 L1764 2 L 1764 F O F
## 2 L1764-3 L1764 3 L 1764 I O I
## 3 L1764-4 L1764 4 L 1764 NA O D/FL
## 4 L1764-5a L1764 5 L 1764 F O F
## 5 L1764-5b L1764 5 L 1764 F O F
## 6 L1764-6 L1764 6 L 1764 NA O I
## 7 L1764-7a L1764 7 L 1764 F O F
## 8 L1764-7b L1764 7 L 1764 F O F
## 9 L1764-8 L1764 8 L 1764 NA O I
## 10 L1764-9a L1764 9 L 1764 D/FL O D/FL
## .. ... ... ... ... ... ... ... ...
## Variables not shown: diff_origin (int), price (int), count (int), subject
## (chr), authorstandard (chr), artistliving (int), authorstyle (chr),
## author (chr), winningbidder (chr), winningbiddertype (chr), endbuyer
## (chr), interm (int), type_intermed (chr), height_in (dbl), width_in
## (dbl), surface_rect (dbl), diam_in (dbl), surface_rnd (dbl), shape
## (chr), surface (dbl), material (chr), mat (chr), quantity (int),
## nfigures (int), engraved (int), original (int), prevcoll (int),
## othartist (int), paired (int), figures (int), finished (int), lrgfont
## (int), relig (int), lands_all (int), lands_sc (int), lands_figs (int),
## lands_ment (int), arch (int), mytho (int), peasant (int), othgenre
## (int), singlefig (int), portrait (int), still_life (int), discauth
## (int), history (int), allegory (int), pastorale (int), other (int)
Arrange rows by variables, default: ascending order
select(pp, 7:10) %>%
arrange(price)
## Source: local data frame [3,393 x 4]
##
## origin_cat school_pntg diff_origin price
## 1 O NA 1 1
## 2 O NA 1 1
## 3 O NA 1 1
## 4 O F 1 1
## 5 O F 1 1
## 6 O F 1 1
## 7 O F 1 1
## 8 O F 1 1
## 9 I I 1 1
## 10 I I 1 1
## .. ... ... ... ...
Arrange rows by variables
select(pp, 7:10) %>%
arrange(desc(price))
## Source: local data frame [3,393 x 4]
##
## origin_cat school_pntg diff_origin price
## 1 D/FL D/FL 0 29000
## 2 D/FL D/FL 0 25800
## 3 D/FL D/FL 0 20000
## 4 D/FL D/FL 0 20000
## 5 D/FL D/FL 0 18030
## 6 S S 0 17535
## 7 D/FL D/FL 0 17210
## 8 D/FL D/FL 0 16700
## 9 D/FL D/FL 0 15500
## 10 D/FL D/FL 0 15000
## .. ... ... ... ...
Add new variables, preserve existing
median(pp$price)
## [1] 131
select(pp, 7:10) %>%
mutate(price_cat = ifelse(price < median(price), "lt median", "gt median"))
## Source: local data frame [3,393 x 5]
##
## origin_cat school_pntg diff_origin price price_cat
## 1 O F 1 360 gt median
## 2 O I 1 6 lt median
## 3 O D/FL 1 12 lt median
## 4 O F 1 6 lt median
## 5 O F 1 6 lt median
## 6 O I 1 9 lt median
## 7 O F 1 12 lt median
## 8 O F 1 12 lt median
## 9 O I 1 24 lt median
## 10 O D/FL 1 6 lt median
## .. ... ... ... ... ...
Add new variables, drop existing
select(pp, 7:10) %>%
transmute(price_cat = ifelse(price < median(price), "lt median", "gt median"))
## Source: local data frame [3,393 x 1]
##
## price_cat
## 1 gt median
## 2 lt median
## 3 lt median
## 4 lt median
## 5 lt median
## 6 lt median
## 7 lt median
## 8 lt median
## 9 lt median
## 10 lt median
## .. ...
Sample n rows from a table
select(pp, 1:6) %>%
sample_n(10)
## Source: local data frame [10 x 6]
##
## name sale lot dealer year origin_author
## 1 R1770-20b R1770 20 R 1770 D/FL
## 2 P1780-44b P1780 44 P 1780 D/FL
## 3 P1780-55a P1780 55 P 1780 D/FL
## 4 L1768-83 L1768 83 L 1768 F
## 5 R1767-235b R1767 235 R 1767 F
## 6 R1776-117a R1776 117 R 1776 D/FL
## 7 J1777-34d J1777 34 J 1777 F
## 8 R1777-175b R1777 175 R 1777 I
## 9 L1778b-151 L1778b 151 L 1778 NA
## 10 J1777-43b J1777 43 J 1777 F
Sample a certain proportion of the data
select(pp, 1:6) %>%
sample_frac(0.001)
## Source: local data frame [3 x 6]
##
## name sale lot dealer year origin_author
## 1 R1776-198 R1776 198 R 1776 F
## 2 L1764-4 L1764 4 L 1764 NA
## 3 J1776-7 J1776 7 J 1776 I
Summarise multiple values to a single value
summarise(pp, n(), min(year), max(price))
## Source: local data frame [1 x 3]
##
## n() min(year) max(price)
## 1 3393 1764 29000
select(pp, 5:10) %>%
group_by(origin_author) %>%
summarise(n(), min(year), max(price))
## Source: local data frame [7 x 4]
##
## origin_author n() min(year) max(price)
## 1 A 13 1764 190
## 2 D/FL 1355 1764 29000
## 3 F 1318 1764 15000
## 4 G 61 1767 1900
## 5 I 348 1764 12012
## 6 S 20 1764 17535
## 7 NA 278 1764 561
select(pp, origin_author:winningbiddertype) %>%
filter(winningbiddertype == "D") %>%
group_by(origin_author) %>%
summarise(n = n(), n_school_pntg = n_distinct(school_pntg))
## Source: local data frame [7 x 3]
##
## origin_author n n_school_pntg
## 1 A 4 2
## 2 D/FL 487 3
## 3 F 349 3
## 4 G 29 2
## 5 I 118 3
## 6 S 5 1
## 7 NA 66 4
school_pntg
. You can, if you like, conduct statistical inference, but the main goal of this application exercise is to familiarize yourself with the data and find similarities and differences between the various schools of paintings. Make sure your exploration includes some visualization and some summary.
Above materials are derived in part from the following sources: