class: center, middle, inverse, title-slide # Text analysis
📄 --- layout: true <div class="my-footer"> <span> Dr. Mine Çetinkaya-Rundel - <a href="http://www2.stat.duke.edu/courses/Fall18/sta112.01/schedule" target="_blank">stat.duke.edu/courses/Fall18/sta112.01 </a> </span> </div> --- ## Announcements - Please fill out the course (and TA) evaluations. - A little incentive — if everyone in the class turns in their course evaluation, you will each get 5 points added to your lowest non-dropped HW score. --- ## Tips for your final project .center[ <iframe width="560" height="315" src="https://www.youtube.com/embed/vGUNqq3jVLg" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> ] --- class: center, middle # Text analysis --- ## Text analysis - Text mining (text analysis) is the process of deriving high-quality information from text. - Typical text mining tasks include - text categorization - text clustering - concept/entity extraction - production of granular taxonomies - sentiment analysis - document summarization - entity relation modeling (i.e., learning relations between named entities) --- ## Tidytext .pull-left[ - Using tidy data principles can make many text mining tasks easier, more effective, and consistent with tools we've already learned about - **Tidy text** = one-**token**-per-row - A token can be a word or an n-gram (e.g. bigram = 2 words) - Learn more at [tidytextmining.com](https://www.tidytextmining.com/) ] .pull-right[ .center[  ] ] ```r library(tidytext) ``` --- ## What is tidy text? ```r text <- c("Take me out tonight", "Where there's music and there's people", "And they're young and alive", "Driving in your car", "I never never want to go home", "Because I haven't got one", "Anymore") text ``` ``` ## [1] "Take me out tonight" ## [2] "Where there's music and there's people" ## [3] "And they're young and alive" ## [4] "Driving in your car" ## [5] "I never never want to go home" ## [6] "Because I haven't got one" ## [7] "Anymore" ``` --- ## What is tidy text? ```r text_df <- tibble(line = 1:7, text = text) text_df ``` ``` ## # A tibble: 7 x 2 ## line text ## <int> <chr> ## 1 1 Take me out tonight ## 2 2 Where there's music and there's people ## 3 3 And they're young and alive ## 4 4 Driving in your car ## 5 5 I never never want to go home ## 6 6 Because I haven't got one ## 7 7 Anymore ``` --- ## What is tidy text? ```r text_df %>% unnest_tokens(word, text) ``` ``` ## # A tibble: 32 x 2 ## line word ## <int> <chr> ## 1 1 take ## 2 1 me ## 3 1 out ## 4 1 tonight ## 5 2 where ## 6 2 there's ## 7 2 music ## 8 2 and ## 9 2 there's ## 10 2 people ## # ... with 22 more rows ``` --- ## Ex. Trump's tweets **Text analysis of Trump's tweets confirms he writes only the (angrier) Android half**, by David Robinson .center[  ] .small[ [varianceexplained.org/r/trump-tweets/](http://varianceexplained.org/r/trump-tweets/) ] --- class: center, middle  --- ## Let's get more data We'll use the `geniusR` package to get song lyric data from [Genius](https://genius.com/). ```r library(geniusR) # https://github.com/JosiahParry/geniusR ``` - `genius_album()` allows you to download the lyrics for an entire album in a tidy format. - Input: Two arguments artists and album. Supply the quoted name of artist and the album (if it gives you issues check that you have the album name and artists as specified on [Genius](https://genius.com/)). - Output: A tidy data frame with three columns: - `title`: track name - `track_n`: track number - `text`: lyrics --- ## Hamilton! ```r hamilton <- genius_album( artist = "Lin-Manuel Miranda", album = "Hamilton: An American Musical (Off-Broadway)" ) ``` ``` ## Joining, by = c("track_title", "track_n", "track_url") ``` ```r glimpse(hamilton) ``` ``` ## Observations: 4,215 ## Variables: 4 ## $ track_title <chr> "Alexander Hamilton (Off-Broadway) (Ft. Anthony Ra... ## $ track_n <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,... ## $ lyric <chr> "How does a bastard, orphan, son of a whore and a"... ## $ line <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,... ``` --- ## Save for later ```r hamilton <- hamilton %>% mutate( album = "Hamilton", artist = "Lin-Manuel Miranda" ) ``` --- ## How many songs are in the album? ```r hamilton %>% distinct(track_title) %>% nrow() ``` ``` ## [1] 52 ``` --- ## How long are Hamilton songs? Length measured by number of lines ```r hamilton %>% count(track_title) %>% arrange(desc(n)) %>% select(n, track_title) ``` ``` ## # A tibble: 52 x 2 ## n track_title ## <int> <chr> ## 1 268 Non-Stop (Off-Broadway) (Ft. Isaiah Johnson, Leslie Odom Jr., Ph… ## 2 214 My Shot (Off-Broadway) (Ft. Anthony Ramos, Daveed Diggs, Leslie … ## 3 190 The Room Where It Happens (Off-Broadway) (Ft. Daveed Diggs, Lesl… ## 4 188 Satisfied (Off-Broadway) (Ft. Anthony Ramos, Phillipa Soo & René… ## 5 168 Helpless (Off-Broadway) (Ft. Phillipa Soo & Renée Elise Goldsber… ## 6 149 Right Hand Man (Off-Broadway) (Ft. Anthony Ramos, Daveed Diggs, … ## 7 146 Wait For it (Off-Broadway) (Ft. Leslie Odom Jr.) ## 8 143 Say No to This (Off-Broadway) (Ft. Ciara Renee, Phillipa Soo, Re… ## 9 137 Take a Break (Off-Broadway) (Ft. Anthony Ramos, Phillipa Soo & R… ## 10 124 The Election of 1800 (Off-Broadway) ## # ... with 42 more rows ``` --- ## Tidy up your lyrics! .small[ ```r hamilton_lyrics <- hamilton %>% unnest_tokens(word, lyric) hamilton_lyrics ``` ``` ## # A tibble: 22,611 x 6 ## track_title track_n line album artist word ## <chr> <int> <int> <chr> <chr> <chr> ## 1 Alexander Hamilton (Off-Broadway… 1 1 Hamil… Lin-Manue… how ## 2 Alexander Hamilton (Off-Broadway… 1 1 Hamil… Lin-Manue… does ## 3 Alexander Hamilton (Off-Broadway… 1 1 Hamil… Lin-Manue… a ## 4 Alexander Hamilton (Off-Broadway… 1 1 Hamil… Lin-Manue… bast… ## 5 Alexander Hamilton (Off-Broadway… 1 1 Hamil… Lin-Manue… orph… ## 6 Alexander Hamilton (Off-Broadway… 1 1 Hamil… Lin-Manue… son ## 7 Alexander Hamilton (Off-Broadway… 1 1 Hamil… Lin-Manue… of ## 8 Alexander Hamilton (Off-Broadway… 1 1 Hamil… Lin-Manue… a ## 9 Alexander Hamilton (Off-Broadway… 1 1 Hamil… Lin-Manue… whore ## 10 Alexander Hamilton (Off-Broadway… 1 1 Hamil… Lin-Manue… and ## # ... with 22,601 more rows ``` ] --- ## What are the most common words? ```r hamilton_lyrics %>% count(word) %>% arrange(desc(n)) ``` ``` ## # A tibble: 3,132 x 2 ## word n ## <chr> <int> ## 1 the 922 ## 2 i 655 ## 3 you 631 ## 4 to 570 ## 5 a 499 ## 6 and 418 ## 7 in 350 ## 8 it 325 ## 9 of 303 ## 10 my 287 ## # ... with 3,122 more rows ``` --- ## Stop words - **Stop words** are words which are filtered out before or after processing of natural language data (text) - They usually refer to the most common words in a language, but there is not a single list of stop words used by all natural language processing tools --- ## Spanish stop words ```r get_stopwords(language = "es") ``` ``` ## # A tibble: 308 x 2 ## word lexicon ## <chr> <chr> ## 1 de snowball ## 2 la snowball ## 3 que snowball ## 4 el snowball ## 5 en snowball ## 6 y snowball ## 7 a snowball ## 8 los snowball ## 9 del snowball ## 10 se snowball ## # ... with 298 more rows ``` --- ## Various lexicons See `?get_stopwords` for more info. ```r get_stopwords(source = "smart") ``` ``` ## # A tibble: 571 x 2 ## word lexicon ## <chr> <chr> ## 1 a smart ## 2 a's smart ## 3 able smart ## 4 about smart ## 5 above smart ## 6 according smart ## 7 accordingly smart ## 8 across smart ## 9 actually smart ## 10 after smart ## # ... with 561 more rows ``` --- ## What are the most common words? ```r hamilton_lyrics %>% anti_join(get_stopwords(source = "smart")) %>% count(word) %>% arrange(desc(n)) ``` ``` ## Joining, by = "word" ``` ``` ## # A tibble: 2,762 x 2 ## word n ## <chr> <int> ## 1 da 103 ## 2 hey 95 ## 3 wait 83 ## 4 hamilton 72 ## 5 room 67 ## 6 burr 66 ## 7 time 66 ## 8 alexander 58 ## 9 sir 54 ## 10 man 48 ## # ... with 2,752 more rows ``` --- ## What are the most common words? ```r hamilton_lyrics %>% anti_join(get_stopwords(source = "smart")) %>% count(word) %>% arrange(desc(n)) %>% top_n(20) %>% ggplot(aes(fct_reorder(word, n), n)) + geom_col() + coord_flip() + theme_minimal() + labs(title = "Frequency of Hamilton lyrics", subtitle = "Da da da dat da dat da da da da ya da", y = "", x = "") ``` --- <!-- --> --- ## Sentiment analysis - One way to analyze the sentiment of a text is to consider the text as a combination of its individual words - and the sentiment content of the whole text as the sum of the sentiment content of the individual words --- ## Sentiment lexicons .pull-left[ ```r get_sentiments("afinn") ``` ``` ## # A tibble: 2,476 x 2 ## word score ## <chr> <int> ## 1 abandon -2 ## 2 abandoned -2 ## 3 abandons -2 ## 4 abducted -2 ## 5 abduction -2 ## 6 abductions -2 ## 7 abhor -3 ## 8 abhorred -3 ## 9 abhorrent -3 ## 10 abhors -3 ## # ... with 2,466 more rows ``` ] .pull-right[ ```r get_sentiments("bing") ``` ``` ## # A tibble: 6,788 x 2 ## word sentiment ## <chr> <chr> ## 1 2-faced negative ## 2 2-faces negative ## 3 a+ positive ## 4 abnormal negative ## 5 abolish negative ## 6 abominable negative ## 7 abominably negative ## 8 abominate negative ## 9 abomination negative ## 10 abort negative ## # ... with 6,778 more rows ``` ] --- ## Sentiment lexicons .pull-left[ ```r get_sentiments("nrc") ``` ``` ## # A tibble: 13,901 x 2 ## word sentiment ## <chr> <chr> ## 1 abacus trust ## 2 abandon fear ## 3 abandon negative ## 4 abandon sadness ## 5 abandoned anger ## 6 abandoned fear ## 7 abandoned negative ## 8 abandoned sadness ## 9 abandonment anger ## 10 abandonment fear ## # ... with 13,891 more rows ``` ] .pull-right[ ```r get_sentiments("loughran") ``` ``` ## # A tibble: 4,149 x 2 ## word sentiment ## <chr> <chr> ## 1 abandon negative ## 2 abandoned negative ## 3 abandoning negative ## 4 abandonment negative ## 5 abandonments negative ## 6 abandons negative ## 7 abdicated negative ## 8 abdicates negative ## 9 abdicating negative ## 10 abdication negative ## # ... with 4,139 more rows ``` ] --- ## Sentiments in Hamilton lyrics ```r hamilton_lyrics %>% inner_join(get_sentiments("bing")) %>% count(sentiment, word) %>% arrange(desc(n)) ``` ``` ## Joining, by = "word" ``` ``` ## # A tibble: 540 x 3 ## sentiment word n ## <chr> <chr> <int> ## 1 positive like 82 ## 2 positive right 45 ## 3 positive work 40 ## 4 negative helpless 34 ## 5 positive satisfied 34 ## 6 positive well 34 ## 7 positive enough 33 ## 8 positive nice 33 ## 9 negative shit 21 ## 10 positive love 19 ## # ... with 530 more rows ``` --- ## Visualizing sentiments <!-- --> --- ## Visualizing sentiments ```r hamilton_lyrics %>% inner_join(get_sentiments("bing")) %>% count(sentiment, word) %>% arrange(desc(n)) %>% group_by(sentiment) %>% top_n(10) %>% ungroup() %>% ggplot(aes(fct_reorder(word, n), n, fill = sentiment)) + geom_col() + coord_flip() + facet_wrap(~ sentiment, scales = "free") + theme_minimal() + labs(title = "Sentiments in Hamilton lyrics", x = "") + guides(fill = FALSE) ``` --- class: center, middle # Comparing lyrics across artists --- ## Get more data Get data from two more artists: ```r american_teen <- genius_album(artist = "Khalid", album = "American Teen") %>% mutate(artist = "Khalid", album = "American Teen") ``` ``` ## Joining, by = c("track_title", "track_n", "track_url") ``` ```r sinister <- genius_album(artist = "Belle and Sebastian", album = "If You're Feeling Sinister") %>% mutate(artist = "Belle and Sebastian", album = "If You're Feeling Sinister") ``` ``` ## Joining, by = c("track_title", "track_n", "track_url") ``` ```r drones <- genius_album(artist = "Muse", album = "Drones") %>% mutate(artist = "Muse", album = "Drones") ``` ``` ## Joining, by = c("track_title", "track_n", "track_url") ``` --- ## Combine data: .pull-left[ ```r mixtape <- bind_rows( hamilton, american_teen, sinister, drones ) ``` ] .pull-right[  ] ```r glimpse(mixtape) ``` ``` ## Observations: 5,777 ## Variables: 6 ## $ track_title <chr> "Alexander Hamilton (Off-Broadway) (Ft. Anthony Ra... ## $ track_n <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,... ## $ lyric <chr> "How does a bastard, orphan, son of a whore and a"... ## $ line <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,... ## $ album <chr> "Hamilton", "Hamilton", "Hamilton", "Hamilton", "H... ## $ artist <chr> "Lin-Manuel Miranda", "Lin-Manuel Miranda", "Lin-M... ``` --- ## All lyrics .small[ ```r mixtape_lyrics <- mixtape %>% unnest_tokens(word, lyric) mixtape_lyrics ``` ``` ## # A tibble: 31,952 x 6 ## track_title track_n line album artist word ## <chr> <int> <int> <chr> <chr> <chr> ## 1 Alexander Hamilton (Off-Broadway… 1 1 Hamil… Lin-Manue… how ## 2 Alexander Hamilton (Off-Broadway… 1 1 Hamil… Lin-Manue… does ## 3 Alexander Hamilton (Off-Broadway… 1 1 Hamil… Lin-Manue… a ## 4 Alexander Hamilton (Off-Broadway… 1 1 Hamil… Lin-Manue… bast… ## 5 Alexander Hamilton (Off-Broadway… 1 1 Hamil… Lin-Manue… orph… ## 6 Alexander Hamilton (Off-Broadway… 1 1 Hamil… Lin-Manue… son ## 7 Alexander Hamilton (Off-Broadway… 1 1 Hamil… Lin-Manue… of ## 8 Alexander Hamilton (Off-Broadway… 1 1 Hamil… Lin-Manue… a ## 9 Alexander Hamilton (Off-Broadway… 1 1 Hamil… Lin-Manue… whore ## 10 Alexander Hamilton (Off-Broadway… 1 1 Hamil… Lin-Manue… and ## # ... with 31,942 more rows ``` ] --- ## Common lyrics Without stop words: ```r mixtape_lyrics_counts <- mixtape_lyrics %>% anti_join(get_stopwords(source = "smart")) %>% filter( # stop words that weren't in the lexicon !(word %in% c("da", "dadadadada", "la", "ooh", "em")) ) %>% count(artist, word, sort = TRUE) # alternative way to sort ``` ``` ## Joining, by = "word" ``` --- ## What is a document about? - Term frequency - Inverse document frequency `$$idf(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}$$` tf-idf is about comparing **documents** within a **collection**. --- ## Calculating tf-idf This is not that exciting... What's the issue? ```r mixtape_words <- mixtape_lyrics_counts %>% bind_tf_idf(word, artist, n) mixtape_words ``` ``` ## # A tibble: 4,001 x 6 ## artist word n tf idf tf_idf ## <chr> <chr> <int> <dbl> <dbl> <dbl> ## 1 Lin-Manuel Miranda hey 95 0.0112 0.288 0.00322 ## 2 Lin-Manuel Miranda wait 83 0.00977 0.693 0.00677 ## 3 Lin-Manuel Miranda hamilton 72 0.00847 1.39 0.0117 ## 4 Lin-Manuel Miranda room 67 0.00788 0.288 0.00227 ## 5 Lin-Manuel Miranda burr 66 0.00777 1.39 0.0108 ## 6 Lin-Manuel Miranda time 66 0.00777 0 0 ## 7 Lin-Manuel Miranda alexander 58 0.00683 1.39 0.00946 ## 8 Khalid love 57 0.0405 0 0 ## 9 Lin-Manuel Miranda sir 54 0.00635 0.693 0.00440 ## 10 Lin-Manuel Miranda man 48 0.00565 0 0 ## # ... with 3,991 more rows ``` --- ## Sorting tf-idf ```r mixtape_words %>% bind_tf_idf(word, artist, n) %>% arrange(-tf_idf) ``` ``` ## # A tibble: 4,001 x 6 ## artist word n tf idf tf_idf ## <chr> <chr> <int> <dbl> <dbl> <dbl> ## 1 Belle and Sebastian track 19 0.0219 1.39 0.0304 ## 2 Belle and Sebastian stars 18 0.0208 1.39 0.0288 ## 3 Khalid turning 29 0.0206 1.39 0.0286 ## 4 Muse drones 15 0.0183 1.39 0.0253 ## 5 Muse psycho 14 0.0171 1.39 0.0236 ## 6 Belle and Sebastian judy 14 0.0162 1.39 0.0224 ## 7 Khalid blooded 18 0.0128 1.39 0.0177 ## 8 Belle and Sebastian snow 11 0.0127 1.39 0.0176 ## 9 Khalid promise 32 0.0227 0.693 0.0158 ## 10 Muse revolt 9 0.0110 1.39 0.0152 ## # ... with 3,991 more rows ``` --- ## <!-- --> --- class: center, middle  --- ## Get the data ```r library(gutenbergr) im <- gutenberg_download(5230) ```  --- ```r tidy_im <- im %>% mutate(line = row_number()) %>% unnest_tokens(word, text) %>% anti_join(stop_words) ``` ``` ## Joining, by = "word" ``` ```r tidy_im ``` ``` ## # A tibble: 17,592 x 3 ## gutenberg_id line word ## <int> <int> <chr> ## 1 5230 1 invisible ## 2 5230 3 grotesque ## 3 5230 3 romance ## 4 5230 9 contents ## 5 5230 11 strange ## 6 5230 11 man's ## 7 5230 11 arrival ## 8 5230 12 ii ## 9 5230 12 teddy ## 10 5230 12 henfrey's ## # ... with 17,582 more rows ``` --- ## Word frequency ```r tidy_im %>% count(word, sort = TRUE) ``` ``` ## # A tibble: 5,306 x 2 ## word n ## <chr> <int> ## 1 kemp 213 ## 2 invisible 180 ## 3 door 169 ## 4 hall 149 ## 5 marvel 114 ## 6 voice 92 ## 7 suddenly 79 ## 8 stood 78 ## 9 window 78 ## 10 heard 76 ## # ... with 5,296 more rows ``` --- ## Bigram frequency ```r im %>% unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% count(bigram, sort = TRUE) %>% separate(bigram, c("word1", "word2"), sep = " ") %>% anti_join(stop_words, by = c("word1" = "word")) %>% anti_join(stop_words, by = c("word2" = "word")) ``` ``` ## # A tibble: 4,361 x 3 ## word1 word2 n ## <chr> <chr> <int> ## 1 thomas marvel 19 ## 2 dr kemp 15 ## 3 front door 11 ## 4 teddy henfrey 11 ## 5 dressing gown 10 ## 6 empty sleeve 9 ## 7 extra ordinary 9 ## 8 parlour door 9 ## 9 port stowe 9 ## 10 jolly cricketers 8 ## # ... with 4,351 more rows ``` --- ## Acknowledgements - Julia Silge: https://github.com/juliasilge/tidytext-tutorial - Julia Silge and David Robinson: https://www.tidytextmining.com/ - Josiah Parry: https://github.com/JosiahParry/geniusR