Paris Paintings

Additional arguments for read.csv

  • header: a logical value indicating whether the file contains the names of the variables as its first line, default: TRUE

  • na.strings: a character vector of strings which are to be interpreted as NA values, default: "NA"

  • stringsAsFactors: a logical value indicating whether character vectors should be converted to factors, default: TRUE

  • For more, see ?read.csv

Loading the data

pp <- read.csv("[filepath]", 
               na.strings = c("NA","n/a", "", "#NAME?", " ", "Unknown", "X"), 
               stringsAsFactors = FALSE)

dplyr

A Grammar of Data Manipulation

dplyr is based on the concepts of functions as verbs that manipulate data frames.

Single table functions / verbs:

  • filter(): pick rows matching criteria

  • slice(): pick rows using index(es)

  • select(): pick columns by name

  • rename(): rename specific columns

  • arrange(): reorder rows

  • mutate(): add new variables

  • transmute(): create new data frame with variables

  • sample_n() / sample_frac(): randomly sample rows

  • summarise(): reduce variables to values

dplyr function rules

  • First argument is a data frame

  • Subsequent arguments say what to do with data frame

  • Always return a data frame

  • Avoid modify in place

magrittr magritte


Nested:

f( g( h(x), z=1), y=1 )


Piped:

h(x) %>% g(z=1) %>% g(y=1)

Read data

library(dplyr)

pp = read.csv("~/Desktop/Teaching/Sta112FS_F14/Data/Paris Paintings/paris_paintings.csv",
              na.strings = c("NA","n/a", "", "#NAME?", " ", "Unknown", "X"),   
              stringsAsFactors=FALSE) %>% 
       as.data.frame() %>%
       tbl_df()

class(pp)
## [1] "tbl_df"     "tbl"        "data.frame"

View data

pp
## Source: local data frame [3,393 x 57]
## 
##        name  sale lot dealer year origin_author origin_cat school_pntg
## 1   L1764-2 L1764   2      L 1764             F          O           F
## 2   L1764-3 L1764   3      L 1764             I          O           I
## 3   L1764-4 L1764   4      L 1764            NA          O        D/FL
## 4  L1764-5a L1764   5      L 1764             F          O           F
## 5  L1764-5b L1764   5      L 1764             F          O           F
## 6   L1764-6 L1764   6      L 1764            NA          O           I
## 7  L1764-7a L1764   7      L 1764             F          O           F
## 8  L1764-7b L1764   7      L 1764             F          O           F
## 9   L1764-8 L1764   8      L 1764            NA          O           I
## 10 L1764-9a L1764   9      L 1764          D/FL          O        D/FL
## ..      ...   ... ...    ...  ...           ...        ...         ...
## Variables not shown: diff_origin (int), price (int), count (int), subject
##   (chr), authorstandard (chr), artistliving (int), authorstyle (chr),
##   author (chr), winningbidder (chr), winningbiddertype (chr), endbuyer
##   (chr), interm (int), type_intermed (chr), height_in (dbl), width_in
##   (dbl), surface_rect (dbl), diam_in (dbl), surface_rnd (dbl), shape
##   (chr), surface (dbl), material (chr), mat (chr), quantity (int),
##   nfigures (int), engraved (int), original (int), prevcoll (int),
##   othartist (int), paired (int), figures (int), finished (int), lrgfont
##   (int), relig (int), landsALL (int), lands_sc (int), lands_figs (int),
##   lands_ment (int), arch (int), mytho (int), peasant (int), othgenre
##   (int), singlefig (int), portrait (int), still_life (int), discauth
##   (int), history (int), allegory (int), pastorale (int), other (int)

filter()

Return rows with matching conditions.

filter(pp, year > 1770, year < 1773, school_pntg == "D/FL")
## Source: local data frame [118 x 57]
## 
##         name  sale lot dealer year origin_author origin_cat school_pntg
## 1    R1771-9 R1771   9      R 1771          D/FL       D/FL        D/FL
## 2  R1771-10a R1771  10      R 1771          D/FL       D/FL        D/FL
## 3  R1771-10b R1771  10      R 1771          D/FL       D/FL        D/FL
## 4   R1771-11 R1771  11      R 1771          D/FL       D/FL        D/FL
## 5   R1771-12 R1771  12      R 1771          D/FL       D/FL        D/FL
## 6   R1771-13 R1771  13      R 1771            NA       D/FL        D/FL
## 7   R1771-14 R1771  14      R 1771          D/FL       D/FL        D/FL
## 8   R1771-15 R1771  15      R 1771          D/FL       D/FL        D/FL
## 9  R1771-16a R1771  16      R 1771          D/FL       D/FL        D/FL
## 10 R1771-16b R1771  16      R 1771          D/FL       D/FL        D/FL
## ..       ...   ... ...    ...  ...           ...        ...         ...
## Variables not shown: diff_origin (int), price (int), count (int), subject
##   (chr), authorstandard (chr), artistliving (int), authorstyle (chr),
##   author (chr), winningbidder (chr), winningbiddertype (chr), endbuyer
##   (chr), interm (int), type_intermed (chr), height_in (dbl), width_in
##   (dbl), surface_rect (dbl), diam_in (dbl), surface_rnd (dbl), shape
##   (chr), surface (dbl), material (chr), mat (chr), quantity (int),
##   nfigures (int), engraved (int), original (int), prevcoll (int),
##   othartist (int), paired (int), figures (int), finished (int), lrgfont
##   (int), relig (int), landsALL (int), lands_sc (int), lands_figs (int),
##   lands_ment (int), arch (int), mytho (int), peasant (int), othgenre
##   (int), singlefig (int), portrait (int), still_life (int), discauth
##   (int), history (int), allegory (int), pastorale (int), other (int)

filter(pp, school_pntg == "D/FL" | school_pntg == "F")
## Source: local data frame [2,878 x 57]
## 
##         name  sale lot dealer year origin_author origin_cat school_pntg
## 1    L1764-2 L1764   2      L 1764             F          O           F
## 2    L1764-4 L1764   4      L 1764            NA          O        D/FL
## 3   L1764-5a L1764   5      L 1764             F          O           F
## 4   L1764-5b L1764   5      L 1764             F          O           F
## 5   L1764-7a L1764   7      L 1764             F          O           F
## 6   L1764-7b L1764   7      L 1764             F          O           F
## 7   L1764-9a L1764   9      L 1764          D/FL          O        D/FL
## 8   L1764-9b L1764   9      L 1764          D/FL          O        D/FL
## 9   L1764-11 L1764  11      L 1764            NA          O        D/FL
## 10 L1764-12a L1764  12      L 1764          D/FL          O        D/FL
## ..       ...   ... ...    ...  ...           ...        ...         ...
## Variables not shown: diff_origin (int), price (int), count (int), subject
##   (chr), authorstandard (chr), artistliving (int), authorstyle (chr),
##   author (chr), winningbidder (chr), winningbiddertype (chr), endbuyer
##   (chr), interm (int), type_intermed (chr), height_in (dbl), width_in
##   (dbl), surface_rect (dbl), diam_in (dbl), surface_rnd (dbl), shape
##   (chr), surface (dbl), material (chr), mat (chr), quantity (int),
##   nfigures (int), engraved (int), original (int), prevcoll (int),
##   othartist (int), paired (int), figures (int), finished (int), lrgfont
##   (int), relig (int), landsALL (int), lands_sc (int), lands_figs (int),
##   lands_ment (int), arch (int), mytho (int), peasant (int), othgenre
##   (int), singlefig (int), portrait (int), still_life (int), discauth
##   (int), history (int), allegory (int), pastorale (int), other (int)

slice()

Select rows by position.

slice(pp, 3:8)
## Source: local data frame [6 x 57]
## 
##       name  sale lot dealer year origin_author origin_cat school_pntg
## 1  L1764-4 L1764   4      L 1764            NA          O        D/FL
## 2 L1764-5a L1764   5      L 1764             F          O           F
## 3 L1764-5b L1764   5      L 1764             F          O           F
## 4  L1764-6 L1764   6      L 1764            NA          O           I
## 5 L1764-7a L1764   7      L 1764             F          O           F
## 6 L1764-7b L1764   7      L 1764             F          O           F
## Variables not shown: diff_origin (int), price (int), count (int), subject
##   (chr), authorstandard (chr), artistliving (int), authorstyle (chr),
##   author (chr), winningbidder (chr), winningbiddertype (chr), endbuyer
##   (chr), interm (int), type_intermed (chr), height_in (dbl), width_in
##   (dbl), surface_rect (dbl), diam_in (dbl), surface_rnd (dbl), shape
##   (chr), surface (dbl), material (chr), mat (chr), quantity (int),
##   nfigures (int), engraved (int), original (int), prevcoll (int),
##   othartist (int), paired (int), figures (int), finished (int), lrgfont
##   (int), relig (int), landsALL (int), lands_sc (int), lands_figs (int),
##   lands_ment (int), arch (int), mytho (int), peasant (int), othgenre
##   (int), singlefig (int), portrait (int), still_life (int), discauth
##   (int), history (int), allegory (int), pastorale (int), other (int)

n(): The number of observations in the current group

slice(pp, (n()-5):n())
## Source: local data frame [6 x 57]
## 
##         name  sale lot dealer year origin_author origin_cat school_pntg
## 1  R1764-497 R1764 497      R 1764             F          O           F
## 2  R1764-498 R1764 498      R 1764             F          O           F
## 3  R1764-499 R1764 499      R 1764             F          O           F
## 4  R1764-500 R1764 500      R 1764             F          O           F
## 5 R1764-502a R1764 502      R 1764             F          O           F
## 6 R1764-502b R1764 502      R 1764             F          O           F
## Variables not shown: diff_origin (int), price (int), count (int), subject
##   (chr), authorstandard (chr), artistliving (int), authorstyle (chr),
##   author (chr), winningbidder (chr), winningbiddertype (chr), endbuyer
##   (chr), interm (int), type_intermed (chr), height_in (dbl), width_in
##   (dbl), surface_rect (dbl), diam_in (dbl), surface_rnd (dbl), shape
##   (chr), surface (dbl), material (chr), mat (chr), quantity (int),
##   nfigures (int), engraved (int), original (int), prevcoll (int),
##   othartist (int), paired (int), figures (int), finished (int), lrgfont
##   (int), relig (int), landsALL (int), lands_sc (int), lands_figs (int),
##   lands_ment (int), arch (int), mytho (int), peasant (int), othgenre
##   (int), singlefig (int), portrait (int), still_life (int), discauth
##   (int), history (int), allegory (int), pastorale (int), other (int)

Equivalent to pp[(3393-5):3393,].

select()

Selects variables with names that meet the conditions given.

select(pp, starts_with("h"))
## Source: local data frame [3,393 x 2]
## 
##    height_in history
## 1         37       0
## 2         18       0
## 3         13       0
## 4         14       0
## 5         14       0
## 6          7       0
## 7          6       0
## 8          6       0
## 9         15       0
## 10         9       0
## ..       ...     ...
select(pp, contains("_in"))
## Source: local data frame [3,393 x 4]
## 
##    type_intermed height_in width_in diam_in
## 1             NA        37     29.5      NA
## 2             NA        18     14.0      NA
## 3             NA        13     16.0      NA
## 4             NA        14     18.0      NA
## 5             NA        14     18.0      NA
## 6             NA         7     10.0      NA
## 7             NA         6     13.0      NA
## 8             NA         6     13.0      NA
## 9             NA        15     15.0      NA
## 10            NA         9      7.0      NA
## ..           ...       ...      ...     ...

select(pp, artistliving:author)
## Source: local data frame [3,393 x 3]
## 
##    artistliving authorstyle                      author
## 1             0          NA                      Grimou
## 2             0          NA                        Féty
## 3             0  copy after       copié d'après Wandeik
## 4             0          NA                     Hérault
## 5             0          NA                     Hérault
## 6             0  copy after     copié d'après l'Albanne
## 7             0          NA               Nicolas Loire
## 8             0          NA               Nicolas Loire
## 9             0  copy after copié d'après Paul Veronese
## 10            0          NA                     Rikaers
## ..          ...         ...                         ...

select(pp, -(artistliving:author))
## Source: local data frame [3,393 x 54]
## 
##        name  sale lot dealer year origin_author origin_cat school_pntg
## 1   L1764-2 L1764   2      L 1764             F          O           F
## 2   L1764-3 L1764   3      L 1764             I          O           I
## 3   L1764-4 L1764   4      L 1764            NA          O        D/FL
## 4  L1764-5a L1764   5      L 1764             F          O           F
## 5  L1764-5b L1764   5      L 1764             F          O           F
## 6   L1764-6 L1764   6      L 1764            NA          O           I
## 7  L1764-7a L1764   7      L 1764             F          O           F
## 8  L1764-7b L1764   7      L 1764             F          O           F
## 9   L1764-8 L1764   8      L 1764            NA          O           I
## 10 L1764-9a L1764   9      L 1764          D/FL          O        D/FL
## ..      ...   ... ...    ...  ...           ...        ...         ...
## Variables not shown: diff_origin (int), price (int), count (int), subject
##   (chr), authorstandard (chr), winningbidder (chr), winningbiddertype
##   (chr), endbuyer (chr), interm (int), type_intermed (chr), height_in
##   (dbl), width_in (dbl), surface_rect (dbl), diam_in (dbl), surface_rnd
##   (dbl), shape (chr), surface (dbl), material (chr), mat (chr), quantity
##   (int), nfigures (int), engraved (int), original (int), prevcoll (int),
##   othartist (int), paired (int), figures (int), finished (int), lrgfont
##   (int), relig (int), landsALL (int), lands_sc (int), lands_figs (int),
##   lands_ment (int), arch (int), mytho (int), peasant (int), othgenre
##   (int), singlefig (int), portrait (int), still_life (int), discauth
##   (int), history (int), allegory (int), pastorale (int), other (int)

rename()

Rename columns, new_name = old_name

rename(pp, lands_all = landsALL)
## Source: local data frame [3,393 x 57]
## 
##        name  sale lot dealer year origin_author origin_cat school_pntg
## 1   L1764-2 L1764   2      L 1764             F          O           F
## 2   L1764-3 L1764   3      L 1764             I          O           I
## 3   L1764-4 L1764   4      L 1764            NA          O        D/FL
## 4  L1764-5a L1764   5      L 1764             F          O           F
## 5  L1764-5b L1764   5      L 1764             F          O           F
## 6   L1764-6 L1764   6      L 1764            NA          O           I
## 7  L1764-7a L1764   7      L 1764             F          O           F
## 8  L1764-7b L1764   7      L 1764             F          O           F
## 9   L1764-8 L1764   8      L 1764            NA          O           I
## 10 L1764-9a L1764   9      L 1764          D/FL          O        D/FL
## ..      ...   ... ...    ...  ...           ...        ...         ...
## Variables not shown: diff_origin (int), price (int), count (int), subject
##   (chr), authorstandard (chr), artistliving (int), authorstyle (chr),
##   author (chr), winningbidder (chr), winningbiddertype (chr), endbuyer
##   (chr), interm (int), type_intermed (chr), height_in (dbl), width_in
##   (dbl), surface_rect (dbl), diam_in (dbl), surface_rnd (dbl), shape
##   (chr), surface (dbl), material (chr), mat (chr), quantity (int),
##   nfigures (int), engraved (int), original (int), prevcoll (int),
##   othartist (int), paired (int), figures (int), finished (int), lrgfont
##   (int), relig (int), lands_all (int), lands_sc (int), lands_figs (int),
##   lands_ment (int), arch (int), mytho (int), peasant (int), othgenre
##   (int), singlefig (int), portrait (int), still_life (int), discauth
##   (int), history (int), allegory (int), pastorale (int), other (int)

arrange()

Arrange rows by variables, default: ascending order

select(pp, 7:10) %>% 
  arrange(price)
## Source: local data frame [3,393 x 4]
## 
##    origin_cat school_pntg diff_origin price
## 1           O          NA           1     1
## 2           O          NA           1     1
## 3           O          NA           1     1
## 4           O           F           1     1
## 5           O           F           1     1
## 6           O           F           1     1
## 7           O           F           1     1
## 8           O           F           1     1
## 9           I           I           1     1
## 10          I           I           1     1
## ..        ...         ...         ...   ...

arrange()

Arrange rows by variables

select(pp, 7:10) %>% 
  arrange(desc(price))
## Source: local data frame [3,393 x 4]
## 
##    origin_cat school_pntg diff_origin price
## 1        D/FL        D/FL           0 29000
## 2        D/FL        D/FL           0 25800
## 3        D/FL        D/FL           0 20000
## 4        D/FL        D/FL           0 20000
## 5        D/FL        D/FL           0 18030
## 6           S           S           0 17535
## 7        D/FL        D/FL           0 17210
## 8        D/FL        D/FL           0 16700
## 9        D/FL        D/FL           0 15500
## 10       D/FL        D/FL           0 15000
## ..        ...         ...         ...   ...

mutate()

Add new variables, preserve existing

median(pp$price)
## [1] 131
select(pp, 7:10) %>% 
  mutate(price_cat = ifelse(price < median(price), "lt median", "gt median"))
## Source: local data frame [3,393 x 5]
## 
##    origin_cat school_pntg diff_origin price price_cat
## 1           O           F           1   360 gt median
## 2           O           I           1     6 lt median
## 3           O        D/FL           1    12 lt median
## 4           O           F           1     6 lt median
## 5           O           F           1     6 lt median
## 6           O           I           1     9 lt median
## 7           O           F           1    12 lt median
## 8           O           F           1    12 lt median
## 9           O           I           1    24 lt median
## 10          O        D/FL           1     6 lt median
## ..        ...         ...         ...   ...       ...

transmute()

Add new variables, drop existing

select(pp, 7:10) %>% 
  transmute(price_cat = ifelse(price < median(price), "lt median", "gt median"))
## Source: local data frame [3,393 x 1]
## 
##    price_cat
## 1  gt median
## 2  lt median
## 3  lt median
## 4  lt median
## 5  lt median
## 6  lt median
## 7  lt median
## 8  lt median
## 9  lt median
## 10 lt median
## ..       ...

sample_n()

Sample n rows from a table

select(pp, 1:6) %>% 
  sample_n(10)
## Source: local data frame [10 x 6]
## 
##          name   sale lot dealer year origin_author
## 1   R1770-20b  R1770  20      R 1770          D/FL
## 2   P1780-44b  P1780  44      P 1780          D/FL
## 3   P1780-55a  P1780  55      P 1780          D/FL
## 4    L1768-83  L1768  83      L 1768             F
## 5  R1767-235b  R1767 235      R 1767             F
## 6  R1776-117a  R1776 117      R 1776          D/FL
## 7   J1777-34d  J1777  34      J 1777             F
## 8  R1777-175b  R1777 175      R 1777             I
## 9  L1778b-151 L1778b 151      L 1778            NA
## 10  J1777-43b  J1777  43      J 1777             F

sample_frac()

Sample a certain proportion of the data

select(pp, 1:6) %>% 
  sample_frac(0.001)
## Source: local data frame [3 x 6]
## 
##        name  sale lot dealer year origin_author
## 1 R1776-198 R1776 198      R 1776             F
## 2   L1764-4 L1764   4      L 1764            NA
## 3   J1776-7 J1776   7      J 1776             I

summarise()

Summarise multiple values to a single value

summarise(pp, n(), min(year), max(price))
## Source: local data frame [1 x 3]
## 
##    n() min(year) max(price)
## 1 3393      1764      29000

summarise() & group_by()

select(pp, 5:10) %>% 
  group_by(origin_author) %>%
  summarise(n(), min(year), max(price))
## Source: local data frame [7 x 4]
## 
##   origin_author  n() min(year) max(price)
## 1             A   13      1764        190
## 2          D/FL 1355      1764      29000
## 3             F 1318      1764      15000
## 4             G   61      1767       1900
## 5             I  348      1764      12012
## 6             S   20      1764      17535
## 7            NA  278      1764        561

select(pp, origin_author:winningbiddertype) %>% 
  filter(winningbiddertype == "D") %>%
  group_by(origin_author) %>%
  summarise(n = n(), n_school_pntg = n_distinct(school_pntg))
## Source: local data frame [7 x 3]
## 
##   origin_author   n n_school_pntg
## 1             A   4             2
## 2          D/FL 487             3
## 3             F 349             3
## 4             G  29             2
## 5             I 118             3
## 6             S   5             1
## 7            NA  66             4

Exercise

Application Exercise 15

Explore characteristics of various schools of paintings school_pntg. You can, if you like, conduct statistical inference, but the main goal of this application exercise is to familiarize yourself with the data and find similarities and differences between the various schools of paintings. Make sure your exploration includes some visualization and some summary.

Acknowledgments

Acknowledgments

Above materials are derived in part from the following sources: