Tidytext analysis

# Tidytext analysis
### Dr. Maria Tackett
### 04.24.19

---

<div class="my-footer">
<span>
<a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a>
</span>
</div>

---

## Announcements

- Office hours: 
    - TAs have regular office hours Thursday - Sunday 
    - Professor Tackett: Thursday 10:30a - 12p and Friday 3:30p - 5p
    - Jose: Saturday, 12:30p - 2:30p
 
- All grades (except final project) should be in by the end of the day on Monday, April 29
  - At that point check Sakai to make sure your grades are correctly recorded
  - If you catch any issues (in recorded grade) email me -- for regrade
  requests use the usual regrade process, this is just for errors/missingness
  in recorded grades
  
- Extra credit 2: If we can get *both* the [post-study survey](https://duke.qualtrics.com/jfe/form/SV_6WfAY1m6YiHUmIl) and course evaluation response rates 
to above 90%, everyone gets +5 
pts on their total (not average) HW score.

---

## Announcements (cont.)

- Presentation schedule: Mon, April 29
  - Lab 01L: 2p - 3p
  - Lab 02L: 3p - 4p
  - Lab 03L: 4p - 5p
  
- There will be one more peer eval, specifically for the project, due Wed, May 1

---

# Tidytext analysis

---

## Packages

In addition to `tidyverse` we will be using a few other packages today

```r
library(tidytext)
library(genius) # https://github.com/JosiahParry/genius
library(wordcloud)
library(reshape2)
library(gutenbergr) # repository of books
```

---

## Tidytext

- Using tidy data principles can make many text mining tasks easier, more effective, and consistent with tools already in wide use.

- Learn more at https://www.tidytextmining.com/.

---

## What is tidy text?

```r
text <- c("On your mark ready set let's go", 
          "dance floor pro",
          "I know you know I go psycho", 
          "When my new joint hit", 
          "just can't sit",
          "Got to get jiggy wit it", 
          "ooh, that's it")

text
```

```
## [1] "On your mark ready set let's go" "dance floor pro"                
## [3] "I know you know I go psycho"     "When my new joint hit"          
## [5] "just can't sit"                  "Got to get jiggy wit it"        
## [7] "ooh, that's it"
```

---

## What is tidy text?

```r
text_df <- tibble(line = 1:7, text = text)

text_df
```

```
## # A tibble: 7 x 2
##    line text                           
##   <int> <chr>                          
## 1     1 On your mark ready set let's go
## 2     2 dance floor pro                
## 3     3 I know you know I go psycho    
## 4     4 When my new joint hit          
## 5     5 just can't sit                 
## 6     6 Got to get jiggy wit it        
## 7     7 ooh, that's it
```

---

## What is tidy text?

```r
text_df %>%
  unnest_tokens(word, text)
```

```
## # A tibble: 34 x 2
##     line word 
##    <int> <chr>
##  1     1 on   
##  2     1 your 
##  3     1 mark 
##  4     1 ready
##  5     1 set  
##  6     1 let's
##  7     1 go   
##  8     2 dance
##  9     2 floor
## 10     2 pro  
## # … with 24 more rows
```

---

## Hamilton the `tidy` way!

---

## Let's get more data

We'll use the `genius` package to get song lyric data from [Genius](https://genius.com/).

- `genius_album()` allows you to download the lyrics for an entire album in a 
tidy format.

- Input: Two arguments artists and album. Supply the quoted name of artist 
and the album (if it gives you issues check that you have the album name and 
artists as specified on [Genius](https://genius.com/)).

- Output: A tidy data frame with three columns:
    - `title`: track name
    - `track_n`: track number
    - `text`: lyrics

---

### "What's your name, man?"

```r
hamilton <- genius_album(
  artist = "Original Broadway Cast of Hamilton", 
  album = "Hamilton (Original Broadway Cast Recording)"
  )
hamilton
```

```
## # A tibble: 3,319 x 4
##    track_title      track_n  line lyric                                    
##    <chr>              <int> <int> <chr>                                    
##  1 Alexander Hamil…       1     1 How does a bastard, orphan, son of a who…
##  2 Alexander Hamil…       1     2 Scotsman, dropped in the middle of a for…
##  3 Alexander Hamil…       1     3 Spot in the Caribbean by providence, imp…
##  4 Alexander Hamil…       1     4 Grow up to be a hero and a scholar?      
##  5 Alexander Hamil…       1     5 The ten-dollar Founding Father without a…
##  6 Alexander Hamil…       1     6 Got a lot farther by working a lot harder
##  7 Alexander Hamil…       1     7 By being a lot smarter                   
##  8 Alexander Hamil…       1     8 By being a self-starter                  
##  9 Alexander Hamil…       1     9 By fourteen, they placed him in charge o…
## 10 Alexander Hamil…       1    10 And every day while slaves were being sl…
## # … with 3,309 more rows
```

---

## Save for later

```r
hamilton <- hamilton %>%
  mutate(
    album = "Hamilton (Original Broadway Cast Recording)",
    artist = "Original Broadway Cast of Hamilton"
    )
```

---

## What songs are in the album?

```r
hamilton %>%
  distinct(track_title)
```

```
## # A tibble: 46 x 1
##    track_title         
##    <chr>               
##  1 Alexander Hamilton  
##  2 Aaron Burr, Sir     
##  3 My Shot             
##  4 The Story of Tonight
##  5 The Schuyler Sisters
##  6 Farmer Refuted      
##  7 You'll Be Back      
##  8 Right Hand Man      
##  9 A Winter's Ball     
## 10 Helpless            
## # … with 36 more rows
```

---

### How long are the songs in Hamilton?

Length measured by number of lines

```r
hamilton %>%
  count(track_title) %>%
  arrange(desc(n))
```

```
## # A tibble: 46 x 2
##    track_title                   n
##    <chr>                     <int>
##  1 Non-Stop                    182
##  2 The Room Where It Happens   173
##  3 My Shot                     164
##  4 Right Hand Man              162
##  5 Satisfied                   132
##  6 Take a Break                119
##  7 The Election of 1800        117
##  8 The World Was Wide Enough   108
##  9 Say No to This              106
## 10 Wait for It                 100
## # … with 36 more rows
```

---

## Tidy up your lyrics!

```r
hamilton_lyrics <- hamilton %>%
  unnest_tokens(word, lyric)

hamilton_lyrics
```

```
## # A tibble: 21,122 x 6
##    track_title    track_n  line album                artist           word 
##    <chr>            <int> <int> <chr>                <chr>            <chr>
##  1 Alexander Ham…       1     1 Hamilton (Original … Original Broadw… how  
##  2 Alexander Ham…       1     1 Hamilton (Original … Original Broadw… does 
##  3 Alexander Ham…       1     1 Hamilton (Original … Original Broadw… a    
##  4 Alexander Ham…       1     1 Hamilton (Original … Original Broadw… bast…
##  5 Alexander Ham…       1     1 Hamilton (Original … Original Broadw… orph…
##  6 Alexander Ham…       1     1 Hamilton (Original … Original Broadw… son  
##  7 Alexander Ham…       1     1 Hamilton (Original … Original Broadw… of   
##  8 Alexander Ham…       1     1 Hamilton (Original … Original Broadw… a    
##  9 Alexander Ham…       1     1 Hamilton (Original … Original Broadw… whore
## 10 Alexander Ham…       1     1 Hamilton (Original … Original Broadw… and  
## # … with 21,112 more rows
```

---

### What are the most common words?

```r
hamilton_lyrics %>%
  count(word) %>%
  arrange(desc(n))
```

```
## # A tibble: 2,948 x 2
##    word      n
##    <chr> <int>
##  1 the     847
##  2 i       634
##  3 you     576
##  4 to      544
##  5 a       472
##  6 and     383
##  7 in      317
##  8 it      294
##  9 of      274
## 10 my      257
## # … with 2,938 more rows
```

---

## Stop words

- In computing, stop words are words which are filtered out before or after processing of natural language data (text).

- They usually refer to the most common words in a language, but there is not a single list of stop words used by all natural language processing tools.

---

## Spanish stop words

```r
get_stopwords(language = "es")
```

```
## # A tibble: 308 x 2
##    word  lexicon 
##    <chr> <chr>   
##  1 de    snowball
##  2 la    snowball
##  3 que   snowball
##  4 el    snowball
##  5 en    snowball
##  6 y     snowball
##  7 a     snowball
##  8 los   snowball
##  9 del   snowball
## 10 se    snowball
## # … with 298 more rows
```

---

## Various lexicons

See `?get_stopwords` for more info.

```r
get_stopwords(source = "smart")
```

```
## # A tibble: 571 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 a           smart  
##  2 a's         smart  
##  3 able        smart  
##  4 about       smart  
##  5 above       smart  
##  6 according   smart  
##  7 accordingly smart  
##  8 across      smart  
##  9 actually    smart  
## 10 after       smart  
## # … with 561 more rows
```

---

### What are the most common words?

```r
hamilton_lyrics %>%
  anti_join(get_stopwords(source = "smart")) %>%
  count(word) %>%
  arrange(desc(n))
```

```
## # A tibble: 2,591 x 2
##    word         n
##    <chr>    <int>
##  1 da          89
##  2 wait        81
##  3 time        79
##  4 hamilton    77
##  5 hey         71
##  6 room        71
##  7 burr        63
##  8 shot        58
##  9 sir         56
## 10 man         51
## # … with 2,581 more rows
```

---

### What are the most common words?

```r
hamilton_lyrics %>%
  anti_join(get_stopwords(source = "smart")) %>%
  count(word) %>%
  arrange(desc(n)) %>%
  top_n(20) %>%
  ggplot(aes(fct_reorder(word, n), n)) +
    geom_col() +
    coord_flip() + 
    theme_minimal() +
    labs(title = "Frequency of Hamilton lyrics",
         y = "",
         x = "")
```

---

![](14b-text-analysis_files/figure-html/unnamed-chunk-15-1.png)

---

## Sentiment analysis

- One way to analyze the sentiment of a text is to consider the text as a combination of its individual words

- and the sentiment content of the whole text as the sum of the sentiment content of the individual words

---

## Sentiment lexicons

```r
get_sentiments("afinn")
```

```
## # A tibble: 2,476 x 2
##    word       score
##    <chr>      <int>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # … with 2,466 more rows
```
]
.pull-right[

```r
get_sentiments("bing") 
```

```
## # A tibble: 6,788 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faced     negative 
##  2 2-faces     negative 
##  3 a+          positive 
##  4 abnormal    negative 
##  5 abolish     negative 
##  6 abominable  negative 
##  7 abominably  negative 
##  8 abominate   negative 
##  9 abomination negative 
## 10 abort       negative 
## # … with 6,778 more rows
```
]

---

## Sentiment lexicons

```r
get_sentiments("nrc")
```

```
## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # … with 13,891 more rows
```
]
.pull-right[

```r
get_sentiments("loughran") 
```

```
## # A tibble: 4,149 x 2
##    word         sentiment
##    <chr>        <chr>    
##  1 abandon      negative 
##  2 abandoned    negative 
##  3 abandoning   negative 
##  4 abandonment  negative 
##  5 abandonments negative 
##  6 abandons     negative 
##  7 abdicated    negative 
##  8 abdicates    negative 
##  9 abdicating   negative 
## 10 abdication   negative 
## # … with 4,139 more rows
```
]

---

## Sentiments in Hamilton lyrics

```r
hamilton_lyrics %>%
  inner_join(get_sentiments("bing")) %>%
  count(sentiment, word) %>%
  arrange(desc(n))
```

```
## # A tibble: 482 x 3
##    sentiment word          n
##    <chr>     <chr>     <int>
##  1 positive  like         73
##  2 positive  work         46
##  3 positive  right        44
##  4 positive  whoa         42
##  5 positive  well         38
##  6 positive  satisfied    35
##  7 negative  helpless     32
##  8 positive  enough       31
##  9 positive  nice         24
## 10 positive  love         20
## # … with 472 more rows
```

---

## Visualizing sentiments

```r
hamilton_lyrics %>%
  inner_join(get_sentiments("bing")) %>%
  count(sentiment, word) %>%
  arrange(desc(n)) %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  ggplot(aes(fct_reorder(word, n), n, fill = sentiment)) +
    geom_col() +
    coord_flip() +
    facet_wrap(~ sentiment, scales = "free_y") +
    theme_minimal() +
    labs(title = "Sentiments in Hamilton Lyrics",
         x = "")
```

---

![](14b-text-analysis_files/figure-html/unnamed-chunk-22-1.png)

---

## Hamilton word cloud

```r
library(wordcloud)
set.seed(04252019)

hamilton_lyrics %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))
```

---

## Hamilton word cloud

![](14b-text-analysis_files/figure-html/unnamed-chunk-24-1.png)

---

## Hamilton sentiment word cloud

```r
library(reshape2)
set.seed(04252019)

hamilton_lyrics %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(max.words = 100)
```

---

## Hamilton sentiment word cloud

![](14b-text-analysis_files/figure-html/unnamed-chunk-26-1.png)

---

### What do Beyonce, Ariana Grande, and Taylor Swift have in common?

![](img/14b/beyonce.jpg)
![](img/14b/ariana.jpg)
![](img/14b/tswift.jpg)

---

## Get the data

Get data from three artists:

```r
beyonce <- genius_album(artist = "Beyonce", album = "Lemonade") %>%
  mutate(artist = "Beyonce", album = "Lemonade")

ariana <- genius_album(artist = "Ariana Grande", album = "thank you, next") %>%
  mutate(artist = "Ariana Grande", album = "thank you, next")

tswift <- genius_album(artist = "Taylor Swift", album = "reputation") %>%
  mutate(artist = "Taylor Swift", album = "reputation")
```

Combine data:

```r
ldoc <- bind_rows(beyonce, ariana, tswift)
```

---

## LDOC lyrics

```r
ldoc_lyrics <- ldoc %>%
  unnest_tokens(word, lyric)
```

---

## Common lyrics:

Without stop words:

```r
ldoc_lyrics %>%
  anti_join(get_stopwords(source = "smart")) %>%
  count(artist, word, sort = TRUE) # alternative way to sort
```

```
## # A tibble: 1,776 x 3
##    artist        word      n
##    <chr>         <chr> <int>
##  1 Ariana Grande yeah    194
##  2 Beyonce       love     82
##  3 Taylor Swift  di       81
##  4 Taylor Swift  made     53
##  5 Beyonce       slay     49
##  6 Taylor Swift  call     46
##  7 Ariana Grande eh       42
##  8 Ariana Grande love     41
##  9 Taylor Swift  ooh      39
## 10 Ariana Grande i'ma     37
## # … with 1,766 more rows
```

---

## Common lyrics

With stop words:

```r
ldoc_lyrics_counts <- ldoc_lyrics %>%
  count(artist, word, sort = TRUE)
```

---

## What is a document about?

- Term frequency
- Inverse document frequency

`$$idf(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}$$`

tf-idf is about comparing **documents** within a **collection**.

---

## Calculating tf-idf

This is not that exciting... What's the issue?

```r
ldoc_words <- ldoc_lyrics_counts %>%
  bind_tf_idf(word, artist, n)

ldoc_words
```

```
## # A tibble: 2,443 x 6
##    artist        word      n     tf   idf tf_idf
##    <chr>         <chr> <int>  <dbl> <dbl>  <dbl>
##  1 Taylor Swift  you     322 0.0441     0      0
##  2 Taylor Swift  i       294 0.0403     0      0
##  3 Ariana Grande i       241 0.0503     0      0
##  4 Beyonce       i       236 0.0514     0      0
##  5 Beyonce       you     233 0.0507     0      0
##  6 Ariana Grande you     230 0.0480     0      0
##  7 Taylor Swift  the     226 0.0310     0      0
##  8 Taylor Swift  it      212 0.0291     0      0
##  9 Ariana Grande yeah    194 0.0405     0      0
## 10 Taylor Swift  me      190 0.0260     0      0
## # … with 2,433 more rows
```

---

## Re-calculating tf-idf

```r
ldoc_words %>%
  bind_tf_idf(word, artist, n) %>%
  arrange(-tf_idf)
```

```
## # A tibble: 2,443 x 6
##    artist        word      n      tf   idf  tf_idf
##    <chr>         <chr> <int>   <dbl> <dbl>   <dbl>
##  1 Taylor Swift  di       81 0.0111   1.10 0.0122 
##  2 Beyonce       slay     49 0.0107   1.10 0.0117 
##  3 Beyonce       okay     42 0.00914  1.10 0.0100 
##  4 Ariana Grande eh       42 0.00876  1.10 0.00962
##  5 Ariana Grande thank    39 0.00814  1.10 0.00894
##  6 Beyonce       daddy    27 0.00588  1.10 0.00646
##  7 Ariana Grande space    24 0.00501  1.10 0.00550
##  8 Taylor Swift  ha       34 0.00466  1.10 0.00512
##  9 Taylor Swift  da       30 0.00411  1.10 0.00452
## 10 Beyonce       catch    16 0.00348  1.10 0.00383
## # … with 2,433 more rows
```

---

![](14b-text-analysis_files/figure-html/unnamed-chunk-34-1.png)

---

### `tidy`ing The Mueller Report

See more at [Using R To Analyze The Redacted Mueller Report](https://www.jlukito.com/blog/2019/4/20/using-r-to-analyze-the-redacted-mueller-report)

---

## Extra practice

Use the **tidytext** project in RStudio Cloud for additional practice with song lyrics and books.

---

## Acknowledgements

- Julia Silge: https://github.com/juliasilge/tidytext-tutorial

- Julia Silge and David Robinson: https://www.tidytextmining.com/

- Josiah Parry: https://github.com/JosiahParry/geniusR

---

### Congrats on completing STA 199 🎉