Visualizing text data

Follow along with the code and results presented in the following section. We will be exploring song lyrics from two of the most critically acclaimed albums released in 2019: Ariana Grande’s album thank u, next and The National’s album I Am East to Find. Try to reproduce these results on your own, before the team portion (no need to include them in your final write-up).

Packages

Using tidy data principles can make many text mining tasks easier, more effective, and consistent with tools already in wide use. Learn more at https://www.tidytextmining.com/. In addition to the tidyverse, we will be using a few other packages. Run the following code to load the needed packages. You may need to install.packages() them first:

library(tidyverse)
library(tidytext)
library(genius)
library(reshape2)

Tidy text

Remember that tidy data has a specific structure: each observation corresponds to a row, each variable corresponds to a column, and each type of observational unit corresponds to a table. Tidy text is text data in a specific format: a table with a single token in each row. A token is a meaningful unit of text, such as a word, pair of words, or sentence, etc. To convert raw text into tokens, we will tokenize it using R by “un-nesting” the tokens.

Let’s go from raw text to a tidy text dataset. For illustration, let’s take the first half of the pre-chorus to thank you, next (the song). Follow along in your own RStudio Cloud instance, comparing and contrasting the format of the data in each chunk.

First, the raw lyrics, read into R as a vector consisting of four strings:

lyrics <- c("One taught me love",
          "One taught me patience",
          "And one taught me pain",
          "Now, I'm so amazing")
lyrics
## [1] "One taught me love"     "One taught me patience" "And one taught me pain"
## [4] "Now, I'm so amazing"

Now a tidy tibble, where we have two variables, line and text corresponding to the line number and text in each line:

text_df <- tibble(line = 1:4, text = lyrics)
text_df
## # A tibble: 4 x 2
##    line text                  
##   <int> <chr>                 
## 1     1 One taught me love    
## 2     2 One taught me patience
## 3     3 And one taught me pain
## 4     4 Now, I'm so amazing

Now let’s unnest the tokens:

text_df2 <- text_df %>% 
  unnest_tokens(word, text)
text_df2
## # A tibble: 17 x 2
##     line word    
##    <int> <chr>   
##  1     1 one     
##  2     1 taught  
##  3     1 me      
##  4     1 love    
##  5     2 one     
##  6     2 taught  
##  7     2 me      
##  8     2 patience
##  9     3 and     
## 10     3 one     
## 11     3 taught  
## 12     3 me      
## 13     3 pain    
## 14     4 now     
## 15     4 i'm     
## 16     4 so      
## 17     4 amazing

Getting data

We’ll use the genius package (written by Josiah Parry) to scrape song lyric data from Genius. The function genius_album() lets us obtain lyrics for an entire album in a tidy format. We must specify the artist and album (if there are issues, check that we have the album name and artists as specified on Genius).

ariana <- genius_album(
  artist = "Ariana Grande", 
  album = "thank you, next"
  )

We see that the output is a tidy data frame. Consider words as tokens, unnest them, and save the output as a new tidy data frame. What does this data frame look like? You can ignore the HTML character <U+200B>, as this is just an artifact of the data-scraping process.

ariana_lyrics <- ariana %>%
  unnest_tokens(word, lyric)
ariana_lyrics
## # A tibble: 4,793 x 4
##    track_n  line track_title word  
##      <int> <int> <chr>       <chr> 
##  1       1     1 <U+200B>imagine     step  
##  2       1     1 <U+200B>imagine     up    
##  3       1     1 <U+200B>imagine     the   
##  4       1     1 <U+200B>imagine     two   
##  5       1     1 <U+200B>imagine     of    
##  6       1     1 <U+200B>imagine     us    
##  7       1     1 <U+200B>imagine     nobody
##  8       1     1 <U+200B>imagine     knows 
##  9       1     1 <U+200B>imagine     us    
## 10       1     2 <U+200B>imagine     get   
## # ... with 4,783 more rows

Stop words

Let’s take a look at the most common words in the data frames ariana_lyrics and national_lyrics:

ariana_lyrics %>%
  count(word) %>%
  arrange(desc(n))
## # A tibble: 685 x 2
##    word      n
##    <chr> <int>
##  1 i       241
##  2 you     232
##  3 yeah    193
##  4 it      163
##  5 a        89
##  6 i'm      86
##  7 me       83
##  8 the      76
##  9 and      74
## 10 my       71
## # ... with 675 more rows

What do you notice?

Stop words are words which are filtered out before or after processing of text data. They usually refer to the most common words in a language, but there is not a single list of stop words used by all natural language processing tools (use ?get_stopwords for more info).

Let’s take a look at an example of a list of the 571 stop words from the smart list:

get_stopwords(source = "smart")
## # A tibble: 571 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 a           smart  
##  2 a's         smart  
##  3 able        smart  
##  4 about       smart  
##  5 above       smart  
##  6 according   smart  
##  7 accordingly smart  
##  8 across      smart  
##  9 actually    smart  
## 10 after       smart  
## # ... with 561 more rows

And let’s save them into a new vector called stopwords. Notice the use of the pull() function to save as a vector instead of as a data frame. This is because we simply want a list of the words to exclude from our analysis:

stopwords <- get_stopwords(source = "smart") %>% 
  select(word) %>% 
  pull()

Now let’s look at the most common words from the two albums, with stop words removed. Pay attention to the way in which we’ve filtered:

ariana_lyrics %>%
  filter(!(word %in% stopwords)) %>%
  count(word) %>%
  arrange(desc(n))
## # A tibble: 458 x 2
##    word           n
##    <chr>      <int>
##  1 yeah         193
##  2 eh            42
##  3 love          41
##  4 i'ma          37
##  5 girlfriend    31
##  6 imagine       30
##  7 forget        27
##  8 make          24
##  9 space         24
## 10 bad           19
## # ... with 448 more rows

Let’s save the top 20 most commonly used words (with stop words taken out) as a new data frame, ariana_top20_words:

ariana_top20_words <- ariana_lyrics %>%
  filter(!(word %in% stopwords)) %>%
  count(word) %>%
  arrange(desc(n)) %>% 
  top_n(20)

Visualizations

Finally, let’s create a bar chart of her top 20 most commonly used words:

# Note: axis labels are "backwards" due to coord_flip()
ggplot(data = ariana_top20_words, 
       mapping = aes(fct_reorder(word, n), n)) +
  geom_col() +
  coord_flip() + 
  theme_minimal() +
  labs(title = "Ariana Grande loves the word 'yeah'",
       y = "Count",
       x = "Words")

Now let’s compare this bar chart to the top 20 most commonly used words from `I Am Easy To Find:

national <- genius_album(
  artist = "The National",
  album = "I Am Easy To Find"
)

national_top20_words <- national %>%
  unnest_tokens(word, lyric)%>%
  filter(!(word %in% stopwords)) %>% 
  count(word) %>%
  arrange(desc(n)) %>%
  top_n(20)

ggplot(data = national_top20_words, 
       mapping = aes(fct_reorder(word, n), n)) +
  geom_col() +
  coord_flip() + 
  theme_minimal() +
  labs(title = "The National's lyrics suggest quiet introspection",
       y = "Count",
       x = "Words")

  1. Each team member: select an album of your choice and create a visualization
  2. Examine the visualizations created by your team. Are there any interesting patterns, usages, or comparisons?