class: center, middle, inverse, title-slide # Review + recap
🧢 --- layout: true <div class="my-footer"> <span> Dr. Mine Çetinkaya-Rundel - <a href="http://www2.stat.duke.edu/courses/Fall18/sta112.01/schedule" target="_blank">stat.duke.edu/courses/Fall18/sta112.01 </a> </span> </div> --- ## Announcements - MT 01 due Tuesday --- class: center, middle # group_by --- ## What does group_by() do? `group_by()` takes an existing `tbl` and converts it into a grouped `tbl` where operations are performed "by group": .pull-left[ ```r ucbadmit ``` ``` ## # A tibble: 4,526 x 3 ## admit gender dept ## <fct> <fct> <ord> ## 1 Admitted Male A ## 2 Admitted Male A ## 3 Admitted Male A ## 4 Admitted Male A ## 5 Admitted Male A ## 6 Admitted Male A ## 7 Admitted Male A ## 8 Admitted Male A ## 9 Admitted Male A ## 10 Admitted Male A ## # ... with 4,516 more rows ``` ] .pull-right[ ```r ucbadmit %>% group_by(gender) ``` ``` ## # A tibble: 4,526 x 3 ## # Groups: gender [2] ## admit gender dept ## <fct> <fct> <ord> ## 1 Admitted Male A ## 2 Admitted Male A ## 3 Admitted Male A ## 4 Admitted Male A ## 5 Admitted Male A ## 6 Admitted Male A ## 7 Admitted Male A ## 8 Admitted Male A ## 9 Admitted Male A ## 10 Admitted Male A ## # ... with 4,516 more rows ``` ] --- ## What does group_by() not do? `group_by()` does not sort the data, `arrange()` does: .pull-left[ ```r ucbadmit %>% group_by(gender) ``` ``` ## # A tibble: 4,526 x 3 ## # Groups: gender [2] ## admit gender dept ## <fct> <fct> <ord> ## 1 Admitted Male A ## 2 Admitted Male A ## 3 Admitted Male A ## 4 Admitted Male A ## 5 Admitted Male A ## 6 Admitted Male A ## 7 Admitted Male A ## 8 Admitted Male A ## 9 Admitted Male A ## 10 Admitted Male A ## # ... with 4,516 more rows ``` ] .pull-right[ ```r ucbadmit %>% arrange(gender) ``` ``` ## # A tibble: 4,526 x 3 ## admit gender dept ## <fct> <fct> <ord> ## 1 Admitted Female A ## 2 Admitted Female A ## 3 Admitted Female A ## 4 Admitted Female A ## 5 Admitted Female A ## 6 Admitted Female A ## 7 Admitted Female A ## 8 Admitted Female A ## 9 Admitted Female A ## 10 Admitted Female A ## # ... with 4,516 more rows ``` ] --- ## What does group_by() not do? `group_by()` does not create frequency tables, `count()` does: .pull-left[ ```r ucbadmit %>% group_by(gender) ``` ``` ## # A tibble: 4,526 x 3 ## # Groups: gender [2] ## admit gender dept ## <fct> <fct> <ord> ## 1 Admitted Male A ## 2 Admitted Male A ## 3 Admitted Male A ## 4 Admitted Male A ## 5 Admitted Male A ## 6 Admitted Male A ## 7 Admitted Male A ## 8 Admitted Male A ## 9 Admitted Male A ## 10 Admitted Male A ## # ... with 4,516 more rows ``` ] .pull-right[ ```r ucbadmit %>% count(gender) ``` ``` ## # A tibble: 2 x 2 ## gender n ## <fct> <int> ## 1 Female 1835 ## 2 Male 2691 ``` ] --- ## Undo grouping with ungroup() .pull-left[ ```r ucbadmit %>% count(gender, admit) %>% group_by(gender) %>% mutate(prop_admit = n / sum(n)) %>% select(gender, prop_admit) ``` ``` ## # A tibble: 4 x 2 ## # Groups: gender [2] ## gender prop_admit ## <fct> <dbl> ## 1 Female 0.696 ## 2 Female 0.304 ## 3 Male 0.555 ## 4 Male 0.445 ``` ] .pull-right[ ```r ucbadmit %>% count(gender, admit) %>% group_by(gender) %>% mutate(prop_admit = n / sum(n)) %>% select(gender, prop_admit) %>% ungroup() ``` ``` ## # A tibble: 4 x 2 ## gender prop_admit ## <fct> <dbl> ## 1 Female 0.696 ## 2 Female 0.304 ## 3 Male 0.555 ## 4 Male 0.445 ``` ] --- class: center, middle # count --- ## count() is a short-hand `count()` is a short-hand for `group_by()` and then `summarise()` to count the number of observations in each group: .pull-left[ ```r ucbadmit %>% group_by(gender) %>% summarise(n = n()) ``` ``` ## # A tibble: 2 x 2 ## gender n ## <fct> <int> ## 1 Female 1835 ## 2 Male 2691 ``` ] .pull-right[ ```r ucbadmit %>% count(gender) ``` ``` ## # A tibble: 2 x 2 ## gender n ## <fct> <int> ## 1 Female 1835 ## 2 Male 2691 ``` ] --- ## count can take multiple arguments .pull-left[ ```r ucbadmit %>% group_by(gender, admit) %>% summarise(n = n()) ``` ``` ## # A tibble: 4 x 3 ## # Groups: gender [?] ## gender admit n ## <fct> <fct> <int> ## 1 Female Rejected 1278 ## 2 Female Admitted 557 ## 3 Male Rejected 1493 ## 4 Male Admitted 1198 ``` ] .pull-right[ ```r ucbadmit %>% count(gender, admit) ``` ``` ## # A tibble: 4 x 3 ## gender admit n ## <fct> <fct> <int> ## 1 Female Rejected 1278 ## 2 Female Admitted 557 ## 3 Male Rejected 1493 ## 4 Male Admitted 1198 ``` ] -- .question[ What is the difference between the two outputs? ] -- - `count()` ungroups after itself - `summarise()` peels off one layer of grouping - The question mark just means that the number of groups is unkown right now, it will only be computed when/if the next line is executed --- ## tally() is also a short-hand `tally()` is a short-hand for `summarise()` .pull-left[ ```r ucbadmit %>% tally() ``` ``` ## # A tibble: 1 x 1 ## n ## <int> ## 1 4526 ``` ] .pull-right[ ```r ucbadmit %>% summarise(n = n()) ``` ``` ## # A tibble: 1 x 1 ## n ## <int> ## 1 4526 ``` ] -- <br> .question[ What is the relationship between `count()` and `tally()`? ] --- ## Relationship between count() and tally() `count()` is also a short-hand for `group_by()` and then `tally()`: .pull-left[ ```r ucbadmit %>% group_by(admit) %>% tally() ``` ``` ## # A tibble: 2 x 2 ## admit n ## <fct> <int> ## 1 Rejected 2771 ## 2 Admitted 1755 ``` ] .pull-right[ ```r ucbadmit %>% count(admit) ``` ``` ## # A tibble: 2 x 2 ## admit n ## <fct> <int> ## 1 Rejected 2771 ## 2 Admitted 1755 ``` ]