November 4, 2014
read.csv
header
: a logical value indicating whether the file contains the names of the variables as its first line, default: TRUE
na.strings
: a character vector of strings which are to be interpreted as NA values, default: "NA"
stringsAsFactors
: a logical value indicating whether character vectors should be converted to factors, default: TRUE
For more, see ?read.csv
Go to https://stat.duke.edu/courses/Fall14/sta112.01/data/paris_paintings.html
Click on download data (requires NetID login)
Load the file using the following structure:
pp <- read.csv("[filepath]", na.strings = c("NA","n/a", "", "#NAME?", " ", "Unknown", "X"), stringsAsFactors = FALSE)
dplyr is based on the concepts of functions as verbs that manipulate data frames.
Single table functions / verbs:
filter()
: pick rows matching criteria
slice()
: pick rows using index(es)
select()
: pick columns by name
rename()
: rename specific columns
arrange()
: reorder rows
mutate()
: add new variables
transmute()
: create new data frame with variables
sample_n()
/ sample_frac()
: randomly sample rows
summarise()
: reduce variables to valuesFirst argument is a data frame
Subsequent arguments say what to do with data frame
Always return a data frame
Avoid modify in place
Nested:
f( g( h(x), z=1), y=1 )
Piped:
h(x) %>% g(z=1) %>% g(y=1)
library(dplyr) pp = read.csv("~/Desktop/Teaching/Sta112FS_F14/Data/Paris Paintings/paris_paintings.csv", na.strings = c("NA","n/a", "", "#NAME?", " ", "Unknown", "X"), stringsAsFactors=FALSE) %>% as.data.frame() %>% tbl_df() class(pp)
## [1] "tbl_df" "tbl" "data.frame"
pp
## Source: local data frame [3,393 x 57] ## ## name sale lot dealer year origin_author origin_cat school_pntg ## 1 L1764-2 L1764 2 L 1764 F O F ## 2 L1764-3 L1764 3 L 1764 I O I ## 3 L1764-4 L1764 4 L 1764 NA O D/FL ## 4 L1764-5a L1764 5 L 1764 F O F ## 5 L1764-5b L1764 5 L 1764 F O F ## 6 L1764-6 L1764 6 L 1764 NA O I ## 7 L1764-7a L1764 7 L 1764 F O F ## 8 L1764-7b L1764 7 L 1764 F O F ## 9 L1764-8 L1764 8 L 1764 NA O I ## 10 L1764-9a L1764 9 L 1764 D/FL O D/FL ## .. ... ... ... ... ... ... ... ... ## Variables not shown: diff_origin (int), price (int), count (int), subject ## (chr), authorstandard (chr), artistliving (int), authorstyle (chr), ## author (chr), winningbidder (chr), winningbiddertype (chr), endbuyer ## (chr), interm (int), type_intermed (chr), height_in (dbl), width_in ## (dbl), surface_rect (dbl), diam_in (dbl), surface_rnd (dbl), shape ## (chr), surface (dbl), material (chr), mat (chr), quantity (int), ## nfigures (int), engraved (int), original (int), prevcoll (int), ## othartist (int), paired (int), figures (int), finished (int), lrgfont ## (int), relig (int), landsALL (int), lands_sc (int), lands_figs (int), ## lands_ment (int), arch (int), mytho (int), peasant (int), othgenre ## (int), singlefig (int), portrait (int), still_life (int), discauth ## (int), history (int), allegory (int), pastorale (int), other (int)
Return rows with matching conditions.
filter(pp, year > 1770, year < 1773, school_pntg == "D/FL")
## Source: local data frame [118 x 57] ## ## name sale lot dealer year origin_author origin_cat school_pntg ## 1 R1771-9 R1771 9 R 1771 D/FL D/FL D/FL ## 2 R1771-10a R1771 10 R 1771 D/FL D/FL D/FL ## 3 R1771-10b R1771 10 R 1771 D/FL D/FL D/FL ## 4 R1771-11 R1771 11 R 1771 D/FL D/FL D/FL ## 5 R1771-12 R1771 12 R 1771 D/FL D/FL D/FL ## 6 R1771-13 R1771 13 R 1771 NA D/FL D/FL ## 7 R1771-14 R1771 14 R 1771 D/FL D/FL D/FL ## 8 R1771-15 R1771 15 R 1771 D/FL D/FL D/FL ## 9 R1771-16a R1771 16 R 1771 D/FL D/FL D/FL ## 10 R1771-16b R1771 16 R 1771 D/FL D/FL D/FL ## .. ... ... ... ... ... ... ... ... ## Variables not shown: diff_origin (int), price (int), count (int), subject ## (chr), authorstandard (chr), artistliving (int), authorstyle (chr), ## author (chr), winningbidder (chr), winningbiddertype (chr), endbuyer ## (chr), interm (int), type_intermed (chr), height_in (dbl), width_in ## (dbl), surface_rect (dbl), diam_in (dbl), surface_rnd (dbl), shape ## (chr), surface (dbl), material (chr), mat (chr), quantity (int), ## nfigures (int), engraved (int), original (int), prevcoll (int), ## othartist (int), paired (int), figures (int), finished (int), lrgfont ## (int), relig (int), landsALL (int), lands_sc (int), lands_figs (int), ## lands_ment (int), arch (int), mytho (int), peasant (int), othgenre ## (int), singlefig (int), portrait (int), still_life (int), discauth ## (int), history (int), allegory (int), pastorale (int), other (int)
filter(pp, school_pntg == "D/FL" | school_pntg == "F")
## Source: local data frame [2,878 x 57] ## ## name sale lot dealer year origin_author origin_cat school_pntg ## 1 L1764-2 L1764 2 L 1764 F O F ## 2 L1764-4 L1764 4 L 1764 NA O D/FL ## 3 L1764-5a L1764 5 L 1764 F O F ## 4 L1764-5b L1764 5 L 1764 F O F ## 5 L1764-7a L1764 7 L 1764 F O F ## 6 L1764-7b L1764 7 L 1764 F O F ## 7 L1764-9a L1764 9 L 1764 D/FL O D/FL ## 8 L1764-9b L1764 9 L 1764 D/FL O D/FL ## 9 L1764-11 L1764 11 L 1764 NA O D/FL ## 10 L1764-12a L1764 12 L 1764 D/FL O D/FL ## .. ... ... ... ... ... ... ... ... ## Variables not shown: diff_origin (int), price (int), count (int), subject ## (chr), authorstandard (chr), artistliving (int), authorstyle (chr), ## author (chr), winningbidder (chr), winningbiddertype (chr), endbuyer ## (chr), interm (int), type_intermed (chr), height_in (dbl), width_in ## (dbl), surface_rect (dbl), diam_in (dbl), surface_rnd (dbl), shape ## (chr), surface (dbl), material (chr), mat (chr), quantity (int), ## nfigures (int), engraved (int), original (int), prevcoll (int), ## othartist (int), paired (int), figures (int), finished (int), lrgfont ## (int), relig (int), landsALL (int), lands_sc (int), lands_figs (int), ## lands_ment (int), arch (int), mytho (int), peasant (int), othgenre ## (int), singlefig (int), portrait (int), still_life (int), discauth ## (int), history (int), allegory (int), pastorale (int), other (int)
Select rows by position.
slice(pp, 3:8)
## Source: local data frame [6 x 57] ## ## name sale lot dealer year origin_author origin_cat school_pntg ## 1 L1764-4 L1764 4 L 1764 NA O D/FL ## 2 L1764-5a L1764 5 L 1764 F O F ## 3 L1764-5b L1764 5 L 1764 F O F ## 4 L1764-6 L1764 6 L 1764 NA O I ## 5 L1764-7a L1764 7 L 1764 F O F ## 6 L1764-7b L1764 7 L 1764 F O F ## Variables not shown: diff_origin (int), price (int), count (int), subject ## (chr), authorstandard (chr), artistliving (int), authorstyle (chr), ## author (chr), winningbidder (chr), winningbiddertype (chr), endbuyer ## (chr), interm (int), type_intermed (chr), height_in (dbl), width_in ## (dbl), surface_rect (dbl), diam_in (dbl), surface_rnd (dbl), shape ## (chr), surface (dbl), material (chr), mat (chr), quantity (int), ## nfigures (int), engraved (int), original (int), prevcoll (int), ## othartist (int), paired (int), figures (int), finished (int), lrgfont ## (int), relig (int), landsALL (int), lands_sc (int), lands_figs (int), ## lands_ment (int), arch (int), mytho (int), peasant (int), othgenre ## (int), singlefig (int), portrait (int), still_life (int), discauth ## (int), history (int), allegory (int), pastorale (int), other (int)
n()
: The number of observations in the current group
slice(pp, (n()-5):n())
## Source: local data frame [6 x 57] ## ## name sale lot dealer year origin_author origin_cat school_pntg ## 1 R1764-497 R1764 497 R 1764 F O F ## 2 R1764-498 R1764 498 R 1764 F O F ## 3 R1764-499 R1764 499 R 1764 F O F ## 4 R1764-500 R1764 500 R 1764 F O F ## 5 R1764-502a R1764 502 R 1764 F O F ## 6 R1764-502b R1764 502 R 1764 F O F ## Variables not shown: diff_origin (int), price (int), count (int), subject ## (chr), authorstandard (chr), artistliving (int), authorstyle (chr), ## author (chr), winningbidder (chr), winningbiddertype (chr), endbuyer ## (chr), interm (int), type_intermed (chr), height_in (dbl), width_in ## (dbl), surface_rect (dbl), diam_in (dbl), surface_rnd (dbl), shape ## (chr), surface (dbl), material (chr), mat (chr), quantity (int), ## nfigures (int), engraved (int), original (int), prevcoll (int), ## othartist (int), paired (int), figures (int), finished (int), lrgfont ## (int), relig (int), landsALL (int), lands_sc (int), lands_figs (int), ## lands_ment (int), arch (int), mytho (int), peasant (int), othgenre ## (int), singlefig (int), portrait (int), still_life (int), discauth ## (int), history (int), allegory (int), pastorale (int), other (int)
Equivalent to pp[(3393-5):3393,]
.
Selects variables with names that meet the conditions given.
select(pp, starts_with("h"))
## Source: local data frame [3,393 x 2] ## ## height_in history ## 1 37 0 ## 2 18 0 ## 3 13 0 ## 4 14 0 ## 5 14 0 ## 6 7 0 ## 7 6 0 ## 8 6 0 ## 9 15 0 ## 10 9 0 ## .. ... ...
select(pp, contains("_in"))
## Source: local data frame [3,393 x 4] ## ## type_intermed height_in width_in diam_in ## 1 NA 37 29.5 NA ## 2 NA 18 14.0 NA ## 3 NA 13 16.0 NA ## 4 NA 14 18.0 NA ## 5 NA 14 18.0 NA ## 6 NA 7 10.0 NA ## 7 NA 6 13.0 NA ## 8 NA 6 13.0 NA ## 9 NA 15 15.0 NA ## 10 NA 9 7.0 NA ## .. ... ... ... ...
select(pp, artistliving:author)
## Source: local data frame [3,393 x 3] ## ## artistliving authorstyle author ## 1 0 NA Grimou ## 2 0 NA Féty ## 3 0 copy after copié d'après Wandeik ## 4 0 NA Hérault ## 5 0 NA Hérault ## 6 0 copy after copié d'après l'Albanne ## 7 0 NA Nicolas Loire ## 8 0 NA Nicolas Loire ## 9 0 copy after copié d'après Paul Veronese ## 10 0 NA Rikaers ## .. ... ... ...
select(pp, -(artistliving:author))
## Source: local data frame [3,393 x 54] ## ## name sale lot dealer year origin_author origin_cat school_pntg ## 1 L1764-2 L1764 2 L 1764 F O F ## 2 L1764-3 L1764 3 L 1764 I O I ## 3 L1764-4 L1764 4 L 1764 NA O D/FL ## 4 L1764-5a L1764 5 L 1764 F O F ## 5 L1764-5b L1764 5 L 1764 F O F ## 6 L1764-6 L1764 6 L 1764 NA O I ## 7 L1764-7a L1764 7 L 1764 F O F ## 8 L1764-7b L1764 7 L 1764 F O F ## 9 L1764-8 L1764 8 L 1764 NA O I ## 10 L1764-9a L1764 9 L 1764 D/FL O D/FL ## .. ... ... ... ... ... ... ... ... ## Variables not shown: diff_origin (int), price (int), count (int), subject ## (chr), authorstandard (chr), winningbidder (chr), winningbiddertype ## (chr), endbuyer (chr), interm (int), type_intermed (chr), height_in ## (dbl), width_in (dbl), surface_rect (dbl), diam_in (dbl), surface_rnd ## (dbl), shape (chr), surface (dbl), material (chr), mat (chr), quantity ## (int), nfigures (int), engraved (int), original (int), prevcoll (int), ## othartist (int), paired (int), figures (int), finished (int), lrgfont ## (int), relig (int), landsALL (int), lands_sc (int), lands_figs (int), ## lands_ment (int), arch (int), mytho (int), peasant (int), othgenre ## (int), singlefig (int), portrait (int), still_life (int), discauth ## (int), history (int), allegory (int), pastorale (int), other (int)
Rename columns, new_name = old_name
rename(pp, lands_all = landsALL)
## Source: local data frame [3,393 x 57] ## ## name sale lot dealer year origin_author origin_cat school_pntg ## 1 L1764-2 L1764 2 L 1764 F O F ## 2 L1764-3 L1764 3 L 1764 I O I ## 3 L1764-4 L1764 4 L 1764 NA O D/FL ## 4 L1764-5a L1764 5 L 1764 F O F ## 5 L1764-5b L1764 5 L 1764 F O F ## 6 L1764-6 L1764 6 L 1764 NA O I ## 7 L1764-7a L1764 7 L 1764 F O F ## 8 L1764-7b L1764 7 L 1764 F O F ## 9 L1764-8 L1764 8 L 1764 NA O I ## 10 L1764-9a L1764 9 L 1764 D/FL O D/FL ## .. ... ... ... ... ... ... ... ... ## Variables not shown: diff_origin (int), price (int), count (int), subject ## (chr), authorstandard (chr), artistliving (int), authorstyle (chr), ## author (chr), winningbidder (chr), winningbiddertype (chr), endbuyer ## (chr), interm (int), type_intermed (chr), height_in (dbl), width_in ## (dbl), surface_rect (dbl), diam_in (dbl), surface_rnd (dbl), shape ## (chr), surface (dbl), material (chr), mat (chr), quantity (int), ## nfigures (int), engraved (int), original (int), prevcoll (int), ## othartist (int), paired (int), figures (int), finished (int), lrgfont ## (int), relig (int), lands_all (int), lands_sc (int), lands_figs (int), ## lands_ment (int), arch (int), mytho (int), peasant (int), othgenre ## (int), singlefig (int), portrait (int), still_life (int), discauth ## (int), history (int), allegory (int), pastorale (int), other (int)
Arrange rows by variables, default: ascending order
select(pp, 7:10) %>% arrange(price)
## Source: local data frame [3,393 x 4] ## ## origin_cat school_pntg diff_origin price ## 1 O NA 1 1 ## 2 O NA 1 1 ## 3 O NA 1 1 ## 4 O F 1 1 ## 5 O F 1 1 ## 6 O F 1 1 ## 7 O F 1 1 ## 8 O F 1 1 ## 9 I I 1 1 ## 10 I I 1 1 ## .. ... ... ... ...
Arrange rows by variables
select(pp, 7:10) %>% arrange(desc(price))
## Source: local data frame [3,393 x 4] ## ## origin_cat school_pntg diff_origin price ## 1 D/FL D/FL 0 29000 ## 2 D/FL D/FL 0 25800 ## 3 D/FL D/FL 0 20000 ## 4 D/FL D/FL 0 20000 ## 5 D/FL D/FL 0 18030 ## 6 S S 0 17535 ## 7 D/FL D/FL 0 17210 ## 8 D/FL D/FL 0 16700 ## 9 D/FL D/FL 0 15500 ## 10 D/FL D/FL 0 15000 ## .. ... ... ... ...
Add new variables, preserve existing
median(pp$price)
## [1] 131
select(pp, 7:10) %>% mutate(price_cat = ifelse(price < median(price), "lt median", "gt median"))
## Source: local data frame [3,393 x 5] ## ## origin_cat school_pntg diff_origin price price_cat ## 1 O F 1 360 gt median ## 2 O I 1 6 lt median ## 3 O D/FL 1 12 lt median ## 4 O F 1 6 lt median ## 5 O F 1 6 lt median ## 6 O I 1 9 lt median ## 7 O F 1 12 lt median ## 8 O F 1 12 lt median ## 9 O I 1 24 lt median ## 10 O D/FL 1 6 lt median ## .. ... ... ... ... ...
Add new variables, drop existing
select(pp, 7:10) %>% transmute(price_cat = ifelse(price < median(price), "lt median", "gt median"))
## Source: local data frame [3,393 x 1] ## ## price_cat ## 1 gt median ## 2 lt median ## 3 lt median ## 4 lt median ## 5 lt median ## 6 lt median ## 7 lt median ## 8 lt median ## 9 lt median ## 10 lt median ## .. ...
Sample n rows from a table
select(pp, 1:6) %>% sample_n(10)
## Source: local data frame [10 x 6] ## ## name sale lot dealer year origin_author ## 1 L1768-100 L1768 100 L 1768 F ## 2 R1765-49c R1765 49 R 1765 NA ## 3 L1778b-82 L1778b 82 L 1778 D/FL ## 4 R1776-297 R1776 297 R 1776 D/FL ## 5 L1778b-100 L1778b 100 L 1778 F ## 6 J1777b-19bis J1777b 19bis J 1777 F ## 7 R1773-47 R1773 47 R 1773 D/FL ## 8 R1776-304 R1776 304 R 1776 D/FL ## 9 R1767-186 R1767 186 R 1767 D/FL ## 10 R1767-24b R1767 24 R 1767 I
Sample a certain proportion of the data
select(pp, 1:6) %>% sample_frac(0.001)
## Source: local data frame [3 x 6] ## ## name sale lot dealer year origin_author ## 1 J1768-73a J1768 73 J 1768 F ## 2 L1774-100 L1774 100 L_R 1774 F ## 3 R1764-157a R1764 157 R 1764 F
Summarise multiple values to a single value
summarise(pp, n(), min(year), max(price))
## Source: local data frame [1 x 3] ## ## n() min(year) max(price) ## 1 3393 1764 29000
select(pp, 5:10) %>% group_by(origin_author) %>% summarise(n(), min(year), max(price))
## Source: local data frame [7 x 4] ## ## origin_author n() min(year) max(price) ## 1 A 13 1764 190 ## 2 D/FL 1355 1764 29000 ## 3 F 1318 1764 15000 ## 4 G 61 1767 1900 ## 5 I 348 1764 12012 ## 6 S 20 1764 17535 ## 7 NA 278 1764 561
select(pp, origin_author:winningbiddertype) %>% filter(winningbiddertype == "D") %>% group_by(origin_author) %>% summarise(n = n(), n_school_pntg = n_distinct(school_pntg))
## Source: local data frame [7 x 3] ## ## origin_author n n_school_pntg ## 1 A 4 2 ## 2 D/FL 487 3 ## 3 F 349 3 ## 4 G 29 2 ## 5 I 118 3 ## 6 S 5 1 ## 7 NA 66 4
school_pntg
. You can, if you like, conduct statistical inference, but the main goal of this application exercise is to familiarize yourself with the data and find similarities and differences between the various schools of paintings. Make sure your exploration includes some visualization and some summary.Above materials are derived in part from the following sources: