class: center, middle, inverse, title-slide # Tidytext analysis ### Dr. Maria Tackett ### 04.24.19 --- layout: true <div class="my-footer"> <span> <a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a> </span> </div> --- ## Announcements - Office hours: - TAs have regular office hours Thursday - Sunday - Professor Tackett: Thursday 10:30a - 12p and Friday 3:30p - 5p - Jose: Saturday, 12:30p - 2:30p - All grades (except final project) should be in by the end of the day on Monday, April 29 - At that point check Sakai to make sure your grades are correctly recorded - If you catch any issues (in recorded grade) email me -- for regrade requests use the usual regrade process, this is just for errors/missingness in recorded grades - Extra credit 2: If we can get *both* the [post-study survey](https://duke.qualtrics.com/jfe/form/SV_6WfAY1m6YiHUmIl) and course evaluation response rates to above 90%, everyone gets +5 pts on their total (not average) HW score. --- ## Announcements (cont.) - Presentation schedule: Mon, April 29 - Lab 01L: 2p - 3p - Lab 02L: 3p - 4p - Lab 03L: 4p - 5p - There will be one more peer eval, specifically for the project, due Wed, May 1 --- class: center, middle # Tidytext analysis --- ## Packages In addition to `tidyverse` we will be using a few other packages today ```r library(tidytext) library(genius) # https://github.com/JosiahParry/genius library(wordcloud) library(reshape2) library(gutenbergr) # repository of books ``` --- ## Tidytext - Using tidy data principles can make many text mining tasks easier, more effective, and consistent with tools already in wide use. - Learn more at https://www.tidytextmining.com/. --- ## What is tidy text? ```r text <- c("On your mark ready set let's go", "dance floor pro", "I know you know I go psycho", "When my new joint hit", "just can't sit", "Got to get jiggy wit it", "ooh, that's it") text ``` ``` ## [1] "On your mark ready set let's go" "dance floor pro" ## [3] "I know you know I go psycho" "When my new joint hit" ## [5] "just can't sit" "Got to get jiggy wit it" ## [7] "ooh, that's it" ``` --- ## What is tidy text? ```r text_df <- tibble(line = 1:7, text = text) text_df ``` ``` ## # A tibble: 7 x 2 ## line text ## <int> <chr> ## 1 1 On your mark ready set let's go ## 2 2 dance floor pro ## 3 3 I know you know I go psycho ## 4 4 When my new joint hit ## 5 5 just can't sit ## 6 6 Got to get jiggy wit it ## 7 7 ooh, that's it ``` --- ## What is tidy text? ```r text_df %>% unnest_tokens(word, text) ``` ``` ## # A tibble: 34 x 2 ## line word ## <int> <chr> ## 1 1 on ## 2 1 your ## 3 1 mark ## 4 1 ready ## 5 1 set ## 6 1 let's ## 7 1 go ## 8 2 dance ## 9 2 floor ## 10 2 pro ## # … with 24 more rows ``` --- class: center, middle ## Hamilton the `tidy` way! .center[  ] --- ## Let's get more data We'll use the `genius` package to get song lyric data from [Genius](https://genius.com/). - `genius_album()` allows you to download the lyrics for an entire album in a tidy format. - Input: Two arguments artists and album. Supply the quoted name of artist and the album (if it gives you issues check that you have the album name and artists as specified on [Genius](https://genius.com/)). - Output: A tidy data frame with three columns: - `title`: track name - `track_n`: track number - `text`: lyrics --- ### "What's your name, man?" ```r hamilton <- genius_album( artist = "Original Broadway Cast of Hamilton", album = "Hamilton (Original Broadway Cast Recording)" ) hamilton ``` ``` ## # A tibble: 3,319 x 4 ## track_title track_n line lyric ## <chr> <int> <int> <chr> ## 1 Alexander Hamil… 1 1 How does a bastard, orphan, son of a who… ## 2 Alexander Hamil… 1 2 Scotsman, dropped in the middle of a for… ## 3 Alexander Hamil… 1 3 Spot in the Caribbean by providence, imp… ## 4 Alexander Hamil… 1 4 Grow up to be a hero and a scholar? ## 5 Alexander Hamil… 1 5 The ten-dollar Founding Father without a… ## 6 Alexander Hamil… 1 6 Got a lot farther by working a lot harder ## 7 Alexander Hamil… 1 7 By being a lot smarter ## 8 Alexander Hamil… 1 8 By being a self-starter ## 9 Alexander Hamil… 1 9 By fourteen, they placed him in charge o… ## 10 Alexander Hamil… 1 10 And every day while slaves were being sl… ## # … with 3,309 more rows ``` --- ## Save for later ```r hamilton <- hamilton %>% mutate( album = "Hamilton (Original Broadway Cast Recording)", artist = "Original Broadway Cast of Hamilton" ) ``` --- ## What songs are in the album? ```r hamilton %>% distinct(track_title) ``` ``` ## # A tibble: 46 x 1 ## track_title ## <chr> ## 1 Alexander Hamilton ## 2 Aaron Burr, Sir ## 3 My Shot ## 4 The Story of Tonight ## 5 The Schuyler Sisters ## 6 Farmer Refuted ## 7 You'll Be Back ## 8 Right Hand Man ## 9 A Winter's Ball ## 10 Helpless ## # … with 36 more rows ``` --- ### How long are the songs in Hamilton? Length measured by number of lines ```r hamilton %>% count(track_title) %>% arrange(desc(n)) ``` ``` ## # A tibble: 46 x 2 ## track_title n ## <chr> <int> ## 1 Non-Stop 182 ## 2 The Room Where It Happens 173 ## 3 My Shot 164 ## 4 Right Hand Man 162 ## 5 Satisfied 132 ## 6 Take a Break 119 ## 7 The Election of 1800 117 ## 8 The World Was Wide Enough 108 ## 9 Say No to This 106 ## 10 Wait for It 100 ## # … with 36 more rows ``` --- ## Tidy up your lyrics! ```r hamilton_lyrics <- hamilton %>% unnest_tokens(word, lyric) hamilton_lyrics ``` ``` ## # A tibble: 21,122 x 6 ## track_title track_n line album artist word ## <chr> <int> <int> <chr> <chr> <chr> ## 1 Alexander Ham… 1 1 Hamilton (Original … Original Broadw… how ## 2 Alexander Ham… 1 1 Hamilton (Original … Original Broadw… does ## 3 Alexander Ham… 1 1 Hamilton (Original … Original Broadw… a ## 4 Alexander Ham… 1 1 Hamilton (Original … Original Broadw… bast… ## 5 Alexander Ham… 1 1 Hamilton (Original … Original Broadw… orph… ## 6 Alexander Ham… 1 1 Hamilton (Original … Original Broadw… son ## 7 Alexander Ham… 1 1 Hamilton (Original … Original Broadw… of ## 8 Alexander Ham… 1 1 Hamilton (Original … Original Broadw… a ## 9 Alexander Ham… 1 1 Hamilton (Original … Original Broadw… whore ## 10 Alexander Ham… 1 1 Hamilton (Original … Original Broadw… and ## # … with 21,112 more rows ``` --- ### What are the most common words? ```r hamilton_lyrics %>% count(word) %>% arrange(desc(n)) ``` ``` ## # A tibble: 2,948 x 2 ## word n ## <chr> <int> ## 1 the 847 ## 2 i 634 ## 3 you 576 ## 4 to 544 ## 5 a 472 ## 6 and 383 ## 7 in 317 ## 8 it 294 ## 9 of 274 ## 10 my 257 ## # … with 2,938 more rows ``` --- ## Stop words - In computing, stop words are words which are filtered out before or after processing of natural language data (text). - They usually refer to the most common words in a language, but there is not a single list of stop words used by all natural language processing tools. --- ## Spanish stop words ```r get_stopwords(language = "es") ``` ``` ## # A tibble: 308 x 2 ## word lexicon ## <chr> <chr> ## 1 de snowball ## 2 la snowball ## 3 que snowball ## 4 el snowball ## 5 en snowball ## 6 y snowball ## 7 a snowball ## 8 los snowball ## 9 del snowball ## 10 se snowball ## # … with 298 more rows ``` --- ## Various lexicons See `?get_stopwords` for more info. ```r get_stopwords(source = "smart") ``` ``` ## # A tibble: 571 x 2 ## word lexicon ## <chr> <chr> ## 1 a smart ## 2 a's smart ## 3 able smart ## 4 about smart ## 5 above smart ## 6 according smart ## 7 accordingly smart ## 8 across smart ## 9 actually smart ## 10 after smart ## # … with 561 more rows ``` --- ### What are the most common words? ```r hamilton_lyrics %>% anti_join(get_stopwords(source = "smart")) %>% count(word) %>% arrange(desc(n)) ``` ``` ## # A tibble: 2,591 x 2 ## word n ## <chr> <int> ## 1 da 89 ## 2 wait 81 ## 3 time 79 ## 4 hamilton 77 ## 5 hey 71 ## 6 room 71 ## 7 burr 63 ## 8 shot 58 ## 9 sir 56 ## 10 man 51 ## # … with 2,581 more rows ``` --- ### What are the most common words? ```r hamilton_lyrics %>% anti_join(get_stopwords(source = "smart")) %>% count(word) %>% arrange(desc(n)) %>% top_n(20) %>% ggplot(aes(fct_reorder(word, n), n)) + geom_col() + coord_flip() + theme_minimal() + labs(title = "Frequency of Hamilton lyrics", y = "", x = "") ``` --- <!-- --> --- ## Sentiment analysis - One way to analyze the sentiment of a text is to consider the text as a combination of its individual words - and the sentiment content of the whole text as the sum of the sentiment content of the individual words --- ## Sentiment lexicons .pull-left[ ```r get_sentiments("afinn") ``` ``` ## # A tibble: 2,476 x 2 ## word score ## <chr> <int> ## 1 abandon -2 ## 2 abandoned -2 ## 3 abandons -2 ## 4 abducted -2 ## 5 abduction -2 ## 6 abductions -2 ## 7 abhor -3 ## 8 abhorred -3 ## 9 abhorrent -3 ## 10 abhors -3 ## # … with 2,466 more rows ``` ] .pull-right[ ```r get_sentiments("bing") ``` ``` ## # A tibble: 6,788 x 2 ## word sentiment ## <chr> <chr> ## 1 2-faced negative ## 2 2-faces negative ## 3 a+ positive ## 4 abnormal negative ## 5 abolish negative ## 6 abominable negative ## 7 abominably negative ## 8 abominate negative ## 9 abomination negative ## 10 abort negative ## # … with 6,778 more rows ``` ] --- ## Sentiment lexicons .pull-left[ ```r get_sentiments("nrc") ``` ``` ## # A tibble: 13,901 x 2 ## word sentiment ## <chr> <chr> ## 1 abacus trust ## 2 abandon fear ## 3 abandon negative ## 4 abandon sadness ## 5 abandoned anger ## 6 abandoned fear ## 7 abandoned negative ## 8 abandoned sadness ## 9 abandonment anger ## 10 abandonment fear ## # … with 13,891 more rows ``` ] .pull-right[ ```r get_sentiments("loughran") ``` ``` ## # A tibble: 4,149 x 2 ## word sentiment ## <chr> <chr> ## 1 abandon negative ## 2 abandoned negative ## 3 abandoning negative ## 4 abandonment negative ## 5 abandonments negative ## 6 abandons negative ## 7 abdicated negative ## 8 abdicates negative ## 9 abdicating negative ## 10 abdication negative ## # … with 4,139 more rows ``` ] --- ## Sentiments in Hamilton lyrics ```r hamilton_lyrics %>% inner_join(get_sentiments("bing")) %>% count(sentiment, word) %>% arrange(desc(n)) ``` ``` ## # A tibble: 482 x 3 ## sentiment word n ## <chr> <chr> <int> ## 1 positive like 73 ## 2 positive work 46 ## 3 positive right 44 ## 4 positive whoa 42 ## 5 positive well 38 ## 6 positive satisfied 35 ## 7 negative helpless 32 ## 8 positive enough 31 ## 9 positive nice 24 ## 10 positive love 20 ## # … with 472 more rows ``` --- ## Visualizing sentiments ```r hamilton_lyrics %>% inner_join(get_sentiments("bing")) %>% count(sentiment, word) %>% arrange(desc(n)) %>% group_by(sentiment) %>% top_n(10) %>% ungroup() %>% ggplot(aes(fct_reorder(word, n), n, fill = sentiment)) + geom_col() + coord_flip() + facet_wrap(~ sentiment, scales = "free_y") + theme_minimal() + labs(title = "Sentiments in Hamilton Lyrics", x = "") ``` --- <!-- --> --- ## Hamilton word cloud ```r library(wordcloud) set.seed(04252019) hamilton_lyrics %>% anti_join(stop_words) %>% count(word) %>% with(wordcloud(word, n, max.words = 100)) ``` --- ## Hamilton word cloud <!-- --> --- ## Hamilton sentiment word cloud ```r library(reshape2) set.seed(04252019) hamilton_lyrics %>% inner_join(get_sentiments("bing")) %>% count(word, sentiment, sort = TRUE) %>% acast(word ~ sentiment, value.var = "n", fill = 0) %>% comparison.cloud(max.words = 100) ``` --- ## Hamilton sentiment word cloud <!-- --> --- class: center, middle ### What do Beyonce, Ariana Grande, and Taylor Swift have in common?    --- ## Get the data Get data from three artists: ```r beyonce <- genius_album(artist = "Beyonce", album = "Lemonade") %>% mutate(artist = "Beyonce", album = "Lemonade") ariana <- genius_album(artist = "Ariana Grande", album = "thank you, next") %>% mutate(artist = "Ariana Grande", album = "thank you, next") tswift <- genius_album(artist = "Taylor Swift", album = "reputation") %>% mutate(artist = "Taylor Swift", album = "reputation") ``` -- Combine data: ```r ldoc <- bind_rows(beyonce, ariana, tswift) ``` --- ## LDOC lyrics ```r ldoc_lyrics <- ldoc %>% unnest_tokens(word, lyric) ``` --- ## Common lyrics: Without stop words: ```r ldoc_lyrics %>% anti_join(get_stopwords(source = "smart")) %>% count(artist, word, sort = TRUE) # alternative way to sort ``` ``` ## # A tibble: 1,776 x 3 ## artist word n ## <chr> <chr> <int> ## 1 Ariana Grande yeah 194 ## 2 Beyonce love 82 ## 3 Taylor Swift di 81 ## 4 Taylor Swift made 53 ## 5 Beyonce slay 49 ## 6 Taylor Swift call 46 ## 7 Ariana Grande eh 42 ## 8 Ariana Grande love 41 ## 9 Taylor Swift ooh 39 ## 10 Ariana Grande i'ma 37 ## # … with 1,766 more rows ``` --- ## Common lyrics With stop words: ```r ldoc_lyrics_counts <- ldoc_lyrics %>% count(artist, word, sort = TRUE) ``` --- ## What is a document about? - Term frequency - Inverse document frequency `$$idf(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}$$` tf-idf is about comparing **documents** within a **collection**. --- ## Calculating tf-idf This is not that exciting... What's the issue? ```r ldoc_words <- ldoc_lyrics_counts %>% bind_tf_idf(word, artist, n) ldoc_words ``` ``` ## # A tibble: 2,443 x 6 ## artist word n tf idf tf_idf ## <chr> <chr> <int> <dbl> <dbl> <dbl> ## 1 Taylor Swift you 322 0.0441 0 0 ## 2 Taylor Swift i 294 0.0403 0 0 ## 3 Ariana Grande i 241 0.0503 0 0 ## 4 Beyonce i 236 0.0514 0 0 ## 5 Beyonce you 233 0.0507 0 0 ## 6 Ariana Grande you 230 0.0480 0 0 ## 7 Taylor Swift the 226 0.0310 0 0 ## 8 Taylor Swift it 212 0.0291 0 0 ## 9 Ariana Grande yeah 194 0.0405 0 0 ## 10 Taylor Swift me 190 0.0260 0 0 ## # … with 2,433 more rows ``` --- ## Re-calculating tf-idf ```r ldoc_words %>% bind_tf_idf(word, artist, n) %>% arrange(-tf_idf) ``` ``` ## # A tibble: 2,443 x 6 ## artist word n tf idf tf_idf ## <chr> <chr> <int> <dbl> <dbl> <dbl> ## 1 Taylor Swift di 81 0.0111 1.10 0.0122 ## 2 Beyonce slay 49 0.0107 1.10 0.0117 ## 3 Beyonce okay 42 0.00914 1.10 0.0100 ## 4 Ariana Grande eh 42 0.00876 1.10 0.00962 ## 5 Ariana Grande thank 39 0.00814 1.10 0.00894 ## 6 Beyonce daddy 27 0.00588 1.10 0.00646 ## 7 Ariana Grande space 24 0.00501 1.10 0.00550 ## 8 Taylor Swift ha 34 0.00466 1.10 0.00512 ## 9 Taylor Swift da 30 0.00411 1.10 0.00452 ## 10 Beyonce catch 16 0.00348 1.10 0.00383 ## # … with 2,433 more rows ``` --- <!-- --> --- ### `tidy`ing The Mueller Report .center[  ] See more at [Using R To Analyze The Redacted Mueller Report](https://www.jlukito.com/blog/2019/4/20/using-r-to-analyze-the-redacted-mueller-report) --- ## Extra practice Use the **tidytext** project in RStudio Cloud for additional practice with song lyrics and books. --- ## Acknowledgements - Julia Silge: https://github.com/juliasilge/tidytext-tutorial - Julia Silge and David Robinson: https://www.tidytextmining.com/ - Josiah Parry: https://github.com/JosiahParry/geniusR --- class: middle, center ### Congrats on completing STA 199 🎉