Follow along with the code and results presented in the following section. We will be exploring song lyrics from two of the most critically acclaimed albums released in 2019: Ariana Grande’s album thank u, next and The National’s album I Am East to Find. Try to reproduce these results on your own, before the team portion (no need to include them in your final write-up).
Using tidy data principles can make many text mining tasks easier, more effective, and consistent with tools already in wide use. Learn more at https://www.tidytextmining.com/. In addition to the tidyverse, we will be using a few other packages. Run the following code to load the needed packages. You may need to install.packages() them first:
library(tidyverse)
library(tidytext)
library(genius)
library(reshape2)
Remember that tidy data has a specific structure: each observation corresponds to a row, each variable corresponds to a column, and each type of observational unit corresponds to a table. Tidy text is text data in a specific format: a table with a single token in each row. A token is a meaningful unit of text, such as a word, pair of words, or sentence, etc. To convert raw text into tokens, we will tokenize it using R by “un-nesting” the tokens.
Let’s go from raw text to a tidy text dataset. For illustration, let’s take the first half of the pre-chorus to thank you, next (the song). Follow along in your own RStudio Cloud instance, comparing and contrasting the format of the data in each chunk.
First, the raw lyrics, read into R as a vector consisting of four strings:
lyrics <- c("One taught me love",
"One taught me patience",
"And one taught me pain",
"Now, I'm so amazing")
lyrics
## [1] "One taught me love" "One taught me patience" "And one taught me pain"
## [4] "Now, I'm so amazing"
Now a tidy tibble, where we have two variables, line and text corresponding to the line number and text in each line:
text_df <- tibble(line = 1:4, text = lyrics)
text_df
## # A tibble: 4 x 2
## line text
## <int> <chr>
## 1 1 One taught me love
## 2 2 One taught me patience
## 3 3 And one taught me pain
## 4 4 Now, I'm so amazing
Now let’s unnest the tokens:
text_df2 <- text_df %>%
unnest_tokens(word, text)
text_df2
## # A tibble: 17 x 2
## line word
## <int> <chr>
## 1 1 one
## 2 1 taught
## 3 1 me
## 4 1 love
## 5 2 one
## 6 2 taught
## 7 2 me
## 8 2 patience
## 9 3 and
## 10 3 one
## 11 3 taught
## 12 3 me
## 13 3 pain
## 14 4 now
## 15 4 i'm
## 16 4 so
## 17 4 amazing
We’ll use the genius package (written by Josiah Parry) to scrape song lyric data from Genius. The function genius_album() lets us obtain lyrics for an entire album in a tidy format. We must specify the artist and album (if there are issues, check that we have the album name and artists as specified on Genius).
ariana <- genius_album(
artist = "Ariana Grande",
album = "thank you, next"
)
We see that the output is a tidy data frame. Consider words as tokens, unnest them, and save the output as a new tidy data frame. What does this data frame look like? You can ignore the HTML character <U+200B>, as this is just an artifact of the data-scraping process.
ariana_lyrics <- ariana %>%
unnest_tokens(word, lyric)
ariana_lyrics
## # A tibble: 4,793 x 4
## track_n line track_title word
## <int> <int> <chr> <chr>
## 1 1 1 <U+200B>imagine step
## 2 1 1 <U+200B>imagine up
## 3 1 1 <U+200B>imagine the
## 4 1 1 <U+200B>imagine two
## 5 1 1 <U+200B>imagine of
## 6 1 1 <U+200B>imagine us
## 7 1 1 <U+200B>imagine nobody
## 8 1 1 <U+200B>imagine knows
## 9 1 1 <U+200B>imagine us
## 10 1 2 <U+200B>imagine get
## # ... with 4,783 more rows
Let’s take a look at the most common words in the data frames ariana_lyrics and national_lyrics:
ariana_lyrics %>%
count(word) %>%
arrange(desc(n))
## # A tibble: 685 x 2
## word n
## <chr> <int>
## 1 i 241
## 2 you 232
## 3 yeah 193
## 4 it 163
## 5 a 89
## 6 i'm 86
## 7 me 83
## 8 the 76
## 9 and 74
## 10 my 71
## # ... with 675 more rows
What do you notice?
Stop words are words which are filtered out before or after processing of text data. They usually refer to the most common words in a language, but there is not a single list of stop words used by all natural language processing tools (use ?get_stopwords for more info).
Let’s take a look at an example of a list of the 571 stop words from the smart list:
get_stopwords(source = "smart")
## # A tibble: 571 x 2
## word lexicon
## <chr> <chr>
## 1 a smart
## 2 a's smart
## 3 able smart
## 4 about smart
## 5 above smart
## 6 according smart
## 7 accordingly smart
## 8 across smart
## 9 actually smart
## 10 after smart
## # ... with 561 more rows
And let’s save them into a new vector called stopwords. Notice the use of the pull() function to save as a vector instead of as a data frame. This is because we simply want a list of the words to exclude from our analysis:
stopwords <- get_stopwords(source = "smart") %>%
select(word) %>%
pull()
Now let’s look at the most common words from the two albums, with stop words removed. Pay attention to the way in which we’ve filtered:
ariana_lyrics %>%
filter(!(word %in% stopwords)) %>%
count(word) %>%
arrange(desc(n))
## # A tibble: 458 x 2
## word n
## <chr> <int>
## 1 yeah 193
## 2 eh 42
## 3 love 41
## 4 i'ma 37
## 5 girlfriend 31
## 6 imagine 30
## 7 forget 27
## 8 make 24
## 9 space 24
## 10 bad 19
## # ... with 448 more rows
Let’s save the top 20 most commonly used words (with stop words taken out) as a new data frame, ariana_top20_words:
ariana_top20_words <- ariana_lyrics %>%
filter(!(word %in% stopwords)) %>%
count(word) %>%
arrange(desc(n)) %>%
top_n(20)
Finally, let’s create a bar chart of her top 20 most commonly used words:
# Note: axis labels are "backwards" due to coord_flip()
ggplot(data = ariana_top20_words,
mapping = aes(fct_reorder(word, n), n)) +
geom_col() +
coord_flip() +
theme_minimal() +
labs(title = "Ariana Grande loves the word 'yeah'",
y = "Count",
x = "Words")
Now let’s compare this bar chart to the top 20 most commonly used words from `I Am Easy To Find:
national <- genius_album(
artist = "The National",
album = "I Am Easy To Find"
)
national_top20_words <- national %>%
unnest_tokens(word, lyric)%>%
filter(!(word %in% stopwords)) %>%
count(word) %>%
arrange(desc(n)) %>%
top_n(20)
ggplot(data = national_top20_words,
mapping = aes(fct_reorder(word, n), n)) +
geom_col() +
coord_flip() +
theme_minimal() +
labs(title = "The National's lyrics suggest quiet introspection",
y = "Count",
x = "Words")