Text analysis 📄

# Text analysis <br> 📄

---

layout: true
  
<div class="my-footer">
<span>
Dr. Mine Çetinkaya-Rundel -
<a href="http://www2.stat.duke.edu/courses/Fall18/sta112.01/schedule" target="_blank">stat.duke.edu/courses/Fall18/sta112.01
</a>
</span>
</div>

---

## Announcements

- Please fill out the course (and TA) evaluations. 
  - A little incentive — if everyone in the class turns in their course evaluation, you will each get 5 points added to your lowest non-dropped HW score.

---

## Tips for your final project

.center[
<iframe width="560" height="315" src="https://www.youtube.com/embed/vGUNqq3jVLg" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
]

---

# Text analysis

---

## Text analysis

- Text mining (text analysis) is the process of deriving high-quality information from text.
- Typical text mining tasks include 
  - text categorization
  - text clustering
  - concept/entity extraction
  - production of granular taxonomies
  - sentiment analysis
  - document summarization
  - entity relation modeling (i.e., learning relations between named entities)

---

## Tidytext

.pull-left[
- Using tidy data principles can make many text mining tasks easier, more effective, and consistent with tools we've already learned about
- **Tidy text** = one-**token**-per-row
  - A token can be a word or an n-gram (e.g. bigram = 2 words) 
- Learn more at [tidytextmining.com](https://www.tidytextmining.com/)
]
.pull-right[
.center[
![:scale 80%](img/tidytext.png)
]
]

```r
library(tidytext)
```

---

## What is tidy text?

```r
text <- c("Take me out tonight",
          "Where there's music and there's people",
          "And they're young and alive",
          "Driving in your car",
          "I never never want to go home",
          "Because I haven't got one",
          "Anymore")

text
```

```
## [1] "Take me out tonight"                   
## [2] "Where there's music and there's people"
## [3] "And they're young and alive"           
## [4] "Driving in your car"                   
## [5] "I never never want to go home"         
## [6] "Because I haven't got one"             
## [7] "Anymore"
```

---

## What is tidy text?

```r
text_df <- tibble(line = 1:7, text = text)

text_df
```

```
## # A tibble: 7 x 2
##    line text                                  
##   <int> <chr>                                 
## 1     1 Take me out tonight                   
## 2     2 Where there's music and there's people
## 3     3 And they're young and alive           
## 4     4 Driving in your car                   
## 5     5 I never never want to go home         
## 6     6 Because I haven't got one             
## 7     7 Anymore
```

---

## What is tidy text?

```r
text_df %>%
  unnest_tokens(word, text)
```

```
## # A tibble: 32 x 2
##     line word   
##    <int> <chr>  
##  1     1 take   
##  2     1 me     
##  3     1 out    
##  4     1 tonight
##  5     2 where  
##  6     2 there's
##  7     2 music  
##  8     2 and    
##  9     2 there's
## 10     2 people 
## # ... with 22 more rows
```

---

## Ex. Trump's tweets

**Text analysis of Trump's tweets confirms he writes only the (angrier) Android half**, by David Robinson

---

---

## Let's get more data

We'll use the `geniusR` package to get song lyric data from [Genius](https://genius.com/).

```r
library(geniusR) # https://github.com/JosiahParry/geniusR
```

- `genius_album()` allows you to download the lyrics for an entire album in a 
tidy format. 
- Input: Two arguments artists and album. Supply the quoted name of artist 
and the album (if it gives you issues check that you have the album name and 
artists as specified on [Genius](https://genius.com/)).
- Output: A tidy data frame with three columns:
    - `title`: track name
    - `track_n`: track number
    - `text`: lyrics

---

## Hamilton!

```r
hamilton <- genius_album(
  artist = "Lin-Manuel Miranda", 
  album = "Hamilton: An American Musical (Off-Broadway)"
  )
```

```
## Joining, by = c("track_title", "track_n", "track_url")
```

```r
glimpse(hamilton)
```

```
## Observations: 4,215
## Variables: 4
## $ track_title <chr> "Alexander Hamilton (Off-Broadway) (Ft. Anthony Ra...
## $ track_n     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ lyric       <chr> "How does a bastard, orphan, son of a whore and a"...
## $ line        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
```

---

## Save for later

```r
hamilton <- hamilton %>%
  mutate(
    album = "Hamilton",
    artist = "Lin-Manuel Miranda"
    )
```

---

## How many songs are in the album?

```r
hamilton %>%
  distinct(track_title) %>%
  nrow()
```

```
## [1] 52
```

---

## How long are Hamilton songs?

Length measured by number of lines

```r
hamilton %>%
  count(track_title) %>%
  arrange(desc(n)) %>%
  select(n, track_title)
```

```
## # A tibble: 52 x 2
##        n track_title                                                      
##    <int> <chr>                                                            
##  1   268 Non-Stop (Off-Broadway) (Ft. Isaiah Johnson, Leslie Odom Jr., Ph…
##  2   214 My Shot (Off-Broadway) (Ft. Anthony Ramos, Daveed Diggs, Leslie …
##  3   190 The Room Where It Happens (Off-Broadway) (Ft. Daveed Diggs, Lesl…
##  4   188 Satisfied (Off-Broadway) (Ft. Anthony Ramos, Phillipa Soo & René…
##  5   168 Helpless (Off-Broadway) (Ft. Phillipa Soo & Renée Elise Goldsber…
##  6   149 Right Hand Man (Off-Broadway) (Ft. Anthony Ramos, Daveed Diggs, …
##  7   146 Wait For it (Off-Broadway) (Ft. Leslie Odom Jr.)                 
##  8   143 Say No to This (Off-Broadway) (Ft. Ciara Renee, Phillipa Soo, Re…
##  9   137 Take a Break (Off-Broadway) (Ft. Anthony Ramos, Phillipa Soo & R…
## 10   124 The Election of 1800 (Off-Broadway)                              
## # ... with 42 more rows
```

---

## Tidy up your lyrics!

```r
hamilton_lyrics <- hamilton %>%
  unnest_tokens(word, lyric)

hamilton_lyrics
```

```
## # A tibble: 22,611 x 6
##    track_title                       track_n  line album  artist     word 
##    <chr>                               <int> <int> <chr>  <chr>      <chr>
##  1 Alexander Hamilton (Off-Broadway…       1     1 Hamil… Lin-Manue… how  
##  2 Alexander Hamilton (Off-Broadway…       1     1 Hamil… Lin-Manue… does 
##  3 Alexander Hamilton (Off-Broadway…       1     1 Hamil… Lin-Manue… a    
##  4 Alexander Hamilton (Off-Broadway…       1     1 Hamil… Lin-Manue… bast…
##  5 Alexander Hamilton (Off-Broadway…       1     1 Hamil… Lin-Manue… orph…
##  6 Alexander Hamilton (Off-Broadway…       1     1 Hamil… Lin-Manue… son  
##  7 Alexander Hamilton (Off-Broadway…       1     1 Hamil… Lin-Manue… of   
##  8 Alexander Hamilton (Off-Broadway…       1     1 Hamil… Lin-Manue… a    
##  9 Alexander Hamilton (Off-Broadway…       1     1 Hamil… Lin-Manue… whore
## 10 Alexander Hamilton (Off-Broadway…       1     1 Hamil… Lin-Manue… and  
## # ... with 22,601 more rows
```
]

---

## What are the most common words?

```r
hamilton_lyrics %>%
  count(word) %>%
  arrange(desc(n))
```

```
## # A tibble: 3,132 x 2
##    word      n
##    <chr> <int>
##  1 the     922
##  2 i       655
##  3 you     631
##  4 to      570
##  5 a       499
##  6 and     418
##  7 in      350
##  8 it      325
##  9 of      303
## 10 my      287
## # ... with 3,122 more rows
```

---

## Stop words

- **Stop words** are words which are filtered out before or after processing of natural language data (text)
- They usually refer to the most common words in a language, but there is not a single list of stop words used by all natural language processing tools

---

## Spanish stop words

```r
get_stopwords(language = "es")
```

```
## # A tibble: 308 x 2
##    word  lexicon 
##    <chr> <chr>   
##  1 de    snowball
##  2 la    snowball
##  3 que   snowball
##  4 el    snowball
##  5 en    snowball
##  6 y     snowball
##  7 a     snowball
##  8 los   snowball
##  9 del   snowball
## 10 se    snowball
## # ... with 298 more rows
```

---

## Various lexicons

See `?get_stopwords` for more info.

```r
get_stopwords(source = "smart")
```

```
## # A tibble: 571 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 a           smart  
##  2 a's         smart  
##  3 able        smart  
##  4 about       smart  
##  5 above       smart  
##  6 according   smart  
##  7 accordingly smart  
##  8 across      smart  
##  9 actually    smart  
## 10 after       smart  
## # ... with 561 more rows
```

---

## What are the most common words?

```r
hamilton_lyrics %>%
  anti_join(get_stopwords(source = "smart")) %>%
  count(word) %>%
  arrange(desc(n))
```

```
## Joining, by = "word"
```

```
## # A tibble: 2,762 x 2
##    word          n
##    <chr>     <int>
##  1 da          103
##  2 hey          95
##  3 wait         83
##  4 hamilton     72
##  5 room         67
##  6 burr         66
##  7 time         66
##  8 alexander    58
##  9 sir          54
## 10 man          48
## # ... with 2,752 more rows
```

---

## What are the most common words?

```r
hamilton_lyrics %>%
  anti_join(get_stopwords(source = "smart")) %>%
  count(word) %>%
  arrange(desc(n)) %>%
  top_n(20) %>%
  ggplot(aes(fct_reorder(word, n), n)) +
    geom_col() +
    coord_flip() + 
    theme_minimal() +
    labs(title = "Frequency of Hamilton lyrics",
         subtitle = "Da da da dat da dat da da da da ya da",
         y = "",
         x = "")
```

---

![](u3_d04-tidytext_files/figure-html/unnamed-chunk-16-1.png)

---

## Sentiment analysis

- One way to analyze the sentiment of a text is to consider the text as a combination of its individual words 
- and the sentiment content of the whole text as the sum of the sentiment content of the individual words

---

## Sentiment lexicons

```r
get_sentiments("afinn")
```

```
## # A tibble: 2,476 x 2
##    word       score
##    <chr>      <int>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ... with 2,466 more rows
```
]
.pull-right[

```r
get_sentiments("bing") 
```

```
## # A tibble: 6,788 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faced     negative 
##  2 2-faces     negative 
##  3 a+          positive 
##  4 abnormal    negative 
##  5 abolish     negative 
##  6 abominable  negative 
##  7 abominably  negative 
##  8 abominate   negative 
##  9 abomination negative 
## 10 abort       negative 
## # ... with 6,778 more rows
```
]

---

## Sentiment lexicons

```r
get_sentiments("nrc")
```

```
## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ... with 13,891 more rows
```
]
.pull-right[

```r
get_sentiments("loughran") 
```

```
## # A tibble: 4,149 x 2
##    word         sentiment
##    <chr>        <chr>    
##  1 abandon      negative 
##  2 abandoned    negative 
##  3 abandoning   negative 
##  4 abandonment  negative 
##  5 abandonments negative 
##  6 abandons     negative 
##  7 abdicated    negative 
##  8 abdicates    negative 
##  9 abdicating   negative 
## 10 abdication   negative 
## # ... with 4,139 more rows
```
]

---

## Sentiments in Hamilton lyrics

```r
hamilton_lyrics %>%
  inner_join(get_sentiments("bing")) %>%
  count(sentiment, word) %>%
  arrange(desc(n))
```

```
## Joining, by = "word"
```

```
## # A tibble: 540 x 3
##    sentiment word          n
##    <chr>     <chr>     <int>
##  1 positive  like         82
##  2 positive  right        45
##  3 positive  work         40
##  4 negative  helpless     34
##  5 positive  satisfied    34
##  6 positive  well         34
##  7 positive  enough       33
##  8 positive  nice         33
##  9 negative  shit         21
## 10 positive  love         19
## # ... with 530 more rows
```

---

## Visualizing sentiments

![](u3_d04-tidytext_files/figure-html/unnamed-chunk-22-1.png)

---

## Visualizing sentiments

```r
hamilton_lyrics %>%
  inner_join(get_sentiments("bing")) %>%
  count(sentiment, word) %>%
  arrange(desc(n)) %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  ggplot(aes(fct_reorder(word, n), n, fill = sentiment)) +
    geom_col() +
    coord_flip() +
    facet_wrap(~ sentiment, scales = "free") +
    theme_minimal() +
    labs(title = "Sentiments in Hamilton lyrics",
         x = "") +
    guides(fill = FALSE)
```

---

# Comparing lyrics across artists

---

## Get more data

Get data from two more artists:

```r
american_teen <- genius_album(artist = "Khalid", album = "American Teen") %>%
  mutate(artist = "Khalid", album = "American Teen")
```

```
## Joining, by = c("track_title", "track_n", "track_url")
```

```r
sinister <- genius_album(artist = "Belle and Sebastian", album = "If You're Feeling Sinister") %>%
  mutate(artist = "Belle and Sebastian", album = "If You're Feeling Sinister")
```

```
## Joining, by = c("track_title", "track_n", "track_url")
```

```r
drones <- genius_album(artist = "Muse", album = "Drones") %>%
  mutate(artist = "Muse", album = "Drones")
```

```
## Joining, by = c("track_title", "track_n", "track_url")
```

---

## Combine data:

```r
mixtape <- bind_rows(
  hamilton, 
  american_teen, 
  sinister, 
  drones
  )
```
]
.pull-right[
![](img/mixtape.png)
]

```r
glimpse(mixtape)
```

```
## Observations: 5,777
## Variables: 6
## $ track_title <chr> "Alexander Hamilton (Off-Broadway) (Ft. Anthony Ra...
## $ track_n     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ lyric       <chr> "How does a bastard, orphan, son of a whore and a"...
## $ line        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ album       <chr> "Hamilton", "Hamilton", "Hamilton", "Hamilton", "H...
## $ artist      <chr> "Lin-Manuel Miranda", "Lin-Manuel Miranda", "Lin-M...
```

---

## All lyrics

```r
mixtape_lyrics <- mixtape %>%
  unnest_tokens(word, lyric)

mixtape_lyrics
```

```
## # A tibble: 31,952 x 6
##    track_title                       track_n  line album  artist     word 
##    <chr>                               <int> <int> <chr>  <chr>      <chr>
##  1 Alexander Hamilton (Off-Broadway…       1     1 Hamil… Lin-Manue… how  
##  2 Alexander Hamilton (Off-Broadway…       1     1 Hamil… Lin-Manue… does 
##  3 Alexander Hamilton (Off-Broadway…       1     1 Hamil… Lin-Manue… a    
##  4 Alexander Hamilton (Off-Broadway…       1     1 Hamil… Lin-Manue… bast…
##  5 Alexander Hamilton (Off-Broadway…       1     1 Hamil… Lin-Manue… orph…
##  6 Alexander Hamilton (Off-Broadway…       1     1 Hamil… Lin-Manue… son  
##  7 Alexander Hamilton (Off-Broadway…       1     1 Hamil… Lin-Manue… of   
##  8 Alexander Hamilton (Off-Broadway…       1     1 Hamil… Lin-Manue… a    
##  9 Alexander Hamilton (Off-Broadway…       1     1 Hamil… Lin-Manue… whore
## 10 Alexander Hamilton (Off-Broadway…       1     1 Hamil… Lin-Manue… and  
## # ... with 31,942 more rows
```
]

---

## Common lyrics

Without stop words:

```r
mixtape_lyrics_counts <- mixtape_lyrics %>%
  anti_join(get_stopwords(source = "smart")) %>%
  filter(
    # stop words that weren't in the lexicon
    !(word %in% c("da", "dadadadada", "la", "ooh", "em"))
    ) %>%
  count(artist, word, sort = TRUE) # alternative way to sort
```

```
## Joining, by = "word"
```

---

## What is a document about?

- Term frequency
- Inverse document frequency

`$$idf(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}$$`

tf-idf is about comparing **documents** within a **collection**.

---

## Calculating tf-idf

This is not that exciting... What's the issue?

```r
mixtape_words <- mixtape_lyrics_counts %>%
  bind_tf_idf(word, artist, n)

mixtape_words
```

```
## # A tibble: 4,001 x 6
##    artist             word          n      tf   idf  tf_idf
##    <chr>              <chr>     <int>   <dbl> <dbl>   <dbl>
##  1 Lin-Manuel Miranda hey          95 0.0112  0.288 0.00322
##  2 Lin-Manuel Miranda wait         83 0.00977 0.693 0.00677
##  3 Lin-Manuel Miranda hamilton     72 0.00847 1.39  0.0117 
##  4 Lin-Manuel Miranda room         67 0.00788 0.288 0.00227
##  5 Lin-Manuel Miranda burr         66 0.00777 1.39  0.0108 
##  6 Lin-Manuel Miranda time         66 0.00777 0     0      
##  7 Lin-Manuel Miranda alexander    58 0.00683 1.39  0.00946
##  8 Khalid             love         57 0.0405  0     0      
##  9 Lin-Manuel Miranda sir          54 0.00635 0.693 0.00440
## 10 Lin-Manuel Miranda man          48 0.00565 0     0      
## # ... with 3,991 more rows
```

---

## Sorting tf-idf

```r
mixtape_words %>%
  bind_tf_idf(word, artist, n) %>%
  arrange(-tf_idf)
```

```
## # A tibble: 4,001 x 6
##    artist              word        n     tf   idf tf_idf
##    <chr>               <chr>   <int>  <dbl> <dbl>  <dbl>
##  1 Belle and Sebastian track      19 0.0219 1.39  0.0304
##  2 Belle and Sebastian stars      18 0.0208 1.39  0.0288
##  3 Khalid              turning    29 0.0206 1.39  0.0286
##  4 Muse                drones     15 0.0183 1.39  0.0253
##  5 Muse                psycho     14 0.0171 1.39  0.0236
##  6 Belle and Sebastian judy       14 0.0162 1.39  0.0224
##  7 Khalid              blooded    18 0.0128 1.39  0.0177
##  8 Belle and Sebastian snow       11 0.0127 1.39  0.0176
##  9 Khalid              promise    32 0.0227 0.693 0.0158
## 10 Muse                revolt      9 0.0110 1.39  0.0152
## # ... with 3,991 more rows
```

---

![](u3_d04-tidytext_files/figure-html/unnamed-chunk-31-1.png)

---

---

## Get the data

```r
library(gutenbergr)

im <- gutenberg_download(5230)
```

![](img/invisible-man-gutenberg.png)

---

```r
tidy_im <- im %>%
  mutate(line = row_number()) %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)
```

```
## Joining, by = "word"
```

```r
tidy_im
```

```
## # A tibble: 17,592 x 3
##    gutenberg_id  line word     
##           <int> <int> <chr>    
##  1         5230     1 invisible
##  2         5230     3 grotesque
##  3         5230     3 romance  
##  4         5230     9 contents 
##  5         5230    11 strange  
##  6         5230    11 man's    
##  7         5230    11 arrival  
##  8         5230    12 ii       
##  9         5230    12 teddy    
## 10         5230    12 henfrey's
## # ... with 17,582 more rows
```

---

## Word frequency

```r
tidy_im %>%
  count(word, sort = TRUE)
```

```
## # A tibble: 5,306 x 2
##    word          n
##    <chr>     <int>
##  1 kemp        213
##  2 invisible   180
##  3 door        169
##  4 hall        149
##  5 marvel      114
##  6 voice        92
##  7 suddenly     79
##  8 stood        78
##  9 window       78
## 10 heard        76
## # ... with 5,296 more rows
```

---

## Bigram frequency

```r
im %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  count(bigram, sort = TRUE) %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  anti_join(stop_words, by = c("word1" = "word")) %>%
  anti_join(stop_words, by = c("word2" = "word")) 
```

```
## # A tibble: 4,361 x 3
##    word1    word2          n
##    <chr>    <chr>      <int>
##  1 thomas   marvel        19
##  2 dr       kemp          15
##  3 front    door          11
##  4 teddy    henfrey       11
##  5 dressing gown          10
##  6 empty    sleeve         9
##  7 extra    ordinary       9
##  8 parlour  door           9
##  9 port     stowe          9
## 10 jolly    cricketers     8
## # ... with 4,351 more rows
```

---

## Acknowledgements

- Julia Silge: https://github.com/juliasilge/tidytext-tutorial
- Julia Silge and David Robinson: https://www.tidytextmining.com/
- Josiah Parry: https://github.com/JosiahParry/geniusR