class: center, middle, inverse, title-slide # Web scraping
🕸 --- layout: true <div class="my-footer"> <span> Dr. Mine Çetinkaya-Rundel - <a href="http://www2.stat.duke.edu/courses/Fall18/sta112.01/schedule" target="_blank">stat.duke.edu/courses/Fall18/sta112.01 </a> </span> </div> --- ## Announcements - Go to Sakai -> Tests & Quizzes -> MT 1 Reflection, complete by Thursday - No IDC on Monday before Thanksgiving --- class: center, middle # Scraping the web --- ## Scraping the web: what? why? - Increasing amount of data is available on the web -- - These data are provided in an unstructured format: you can always copy&paste, but it's time-consuming and prone to errors -- - Web scraping is the process of extracting this information automatically and transform it into a structured dataset -- - Two different scenarios: - Screen scraping: extract data from source code of website, with html parser (easy) or regular expression matching (less easy). - Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files. --- class: center, middle # Web Scraping with rvest --- ## Hypertext Markup Language - Most of the data on the web is still largely available as HTML - It is structured (hierarchical / tree based), but it''s often not available in a form useful for analysis (flat / tidy). ```html <html> <head> <title>This is a title</title> </head> <body> <p align="center">Hello world!</p> </body> </html> ``` --- ## rvest .pull-left[ - The **rvest** package makes basic processing and manipulation of HTML data straight forward - It's designed to work with pipelines built with `%>%` ] .pull-right[ <img src="img/rvest.png" width="230" style="display: block; margin: auto 0 auto auto;" /> ] --- ## Core rvest functions - `read_html` - Read HTML data from a url or character string - `html_node ` - Select a specified node from HTML document - `html_nodes` - Select specified nodes from HTML document - `html_table` - Parse an HTML table into a data frame - `html_text` - Extract tag pairs' content - `html_name` - Extract tags' names - `html_attrs` - Extract all of each tag's attributes - `html_attr` - Extract tags' attribute value by name --- ## SelectorGadget .pull-left[ - Open source tool that eases CSS selector generation and discovery - Easiest to use with the [Chrome Extension](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb) - Find out more on the [SelectorGadget vignette](https://cran.r-project.org/web/packages/rvest/vignettes/selectorgadget.html) ] .pull-right[ <img src="img/selector-gadget.png" width="456" /> ] --- ## Using the SelectorGadget .pull-left[ - Click on the app logo next to the search bar - A box will open in the bottom right of the website ] .pull-right[ <img src="img/selector-gadget.gif" height="250" style="display: block; margin: auto;" /> ] -- - Click on a page element (it will turn green), SelectorGadget will generate a minimal CSS selector for that element, and will highlight (yellow) everything that is matched by the selector -- - Click on a highlighted element to remove it from the selector (red), or click on an unhighlighted element to add it to the selector -- - Through this process of selection and rejection, SelectorGadget helps you come up with the appropriate CSS selector for your needs --- class: center, middle # Top 250 movies on IMDB --- ## Top 250 movies on IMDB Take a look at the source code, look for the tag `table` tag: <br> http://www.imdb.com/chart/top  --- ## First check if you're allowed! ```r library(robotstxt) paths_allowed("http://www.imdb.com") ``` ``` ## www.imdb.com No encoding supplied: defaulting to UTF-8. ``` ``` ## [1] TRUE ``` vs. e.g. ```r paths_allowed("http://www.facebook.com") ``` ``` ## www.facebook.com ``` ``` ## [1] FALSE ``` --- ## Demo <img src="img/demo.png" width="320" height="200" style="display: block; margin: auto;" /> <br><br> .center[ Go to [rstudio.cloud](https://rstudio.cloud/spaces/3518/projects) Make a copy of the project titled *Demo - Web scraping* Open `scrape-250.R` ] --- ## Select and format pieces .midi[ ```r page <- read_html("http://www.imdb.com/chart/top") titles <- page %>% html_nodes(".titleColumn a") %>% html_text() years <- page %>% html_nodes(".secondaryInfo") %>% html_text() %>% str_replace("\\(", "") %>% # remove ( str_replace("\\)", "") %>% # remove ) as.numeric() scores <- page %>% html_nodes("#main strong") %>% html_text() %>% as.numeric() imdb_top_250 <- tibble( title = titles, year = years, score = scores ) ``` ] --- <table> <thead> <tr> <th style="text-align:left;"> title </th> <th style="text-align:left;"> year </th> <th style="text-align:left;"> score </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> The Shawshank Redemption </td> <td style="text-align:left;"> 1994 </td> <td style="text-align:left;"> 9.2 </td> </tr> <tr> <td style="text-align:left;"> The Godfather </td> <td style="text-align:left;"> 1972 </td> <td style="text-align:left;"> 9.2 </td> </tr> <tr> <td style="text-align:left;"> The Godfather: Part II </td> <td style="text-align:left;"> 1974 </td> <td style="text-align:left;"> 9 </td> </tr> <tr> <td style="text-align:left;"> The Dark Knight </td> <td style="text-align:left;"> 2008 </td> <td style="text-align:left;"> 9 </td> </tr> <tr> <td style="text-align:left;"> 12 Angry Men </td> <td style="text-align:left;"> 1957 </td> <td style="text-align:left;"> 8.9 </td> </tr> <tr> <td style="text-align:left;"> Schindler's List </td> <td style="text-align:left;"> 1993 </td> <td style="text-align:left;"> 8.9 </td> </tr> <tr> <td style="text-align:left;"> The Lord of the Rings: The Return of the King </td> <td style="text-align:left;"> 2003 </td> <td style="text-align:left;"> 8.9 </td> </tr> <tr> <td style="text-align:left;"> Pulp Fiction </td> <td style="text-align:left;"> 1994 </td> <td style="text-align:left;"> 8.9 </td> </tr> <tr> <td style="text-align:left;"> The Good, the Bad and the Ugly </td> <td style="text-align:left;"> 1966 </td> <td style="text-align:left;"> 8.8 </td> </tr> <tr> <td style="text-align:left;"> Fight Club </td> <td style="text-align:left;"> 1999 </td> <td style="text-align:left;"> 8.8 </td> </tr> <tr> <td style="text-align:left;"> The Lord of the Rings: The Fellowship of the Ring </td> <td style="text-align:left;"> 2001 </td> <td style="text-align:left;"> 8.8 </td> </tr> <tr> <td style="text-align:left;"> Forrest Gump </td> <td style="text-align:left;"> 1994 </td> <td style="text-align:left;"> 8.7 </td> </tr> <tr> <td style="text-align:left;"> Star Wars: Episode V - The Empire Strikes Back </td> <td style="text-align:left;"> 1980 </td> <td style="text-align:left;"> 8.7 </td> </tr> <tr> <td style="text-align:left;"> Inception </td> <td style="text-align:left;"> 2010 </td> <td style="text-align:left;"> 8.7 </td> </tr> <tr> <td style="text-align:left;"> The Lord of the Rings: The Two Towers </td> <td style="text-align:left;"> 2002 </td> <td style="text-align:left;"> 8.7 </td> </tr> <tr> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> </tr> </tbody> </table> --- ## Clean up / enhance May or may not be a lot of work depending on how messy the data are - See if you like what you got: ```r glimpse(imdb_top_250) ``` ``` ## Observations: 250 ## Variables: 3 ## $ title <chr> "The Shawshank Redemption", "The Godfather", "The Godfat... ## $ year <dbl> 1994, 1972, 1974, 2008, 1957, 1993, 2003, 1994, 1966, 19... ## $ score <dbl> 9.2, 9.2, 9.0, 9.0, 8.9, 8.9, 8.9, 8.9, 8.8, 8.8, 8.8, 8... ``` - Add a variable for rank ```r imdb_top_250 <- imdb_top_250 %>% mutate( rank = 1:nrow(imdb_top_250) ) ``` --- <table> <thead> <tr> <th style="text-align:left;"> title </th> <th style="text-align:left;"> year </th> <th style="text-align:left;"> score </th> <th style="text-align:left;"> rank </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> The Shawshank Redemption </td> <td style="text-align:left;"> 1994 </td> <td style="text-align:left;"> 9.2 </td> <td style="text-align:left;"> 1 </td> </tr> <tr> <td style="text-align:left;"> The Godfather </td> <td style="text-align:left;"> 1972 </td> <td style="text-align:left;"> 9.2 </td> <td style="text-align:left;"> 2 </td> </tr> <tr> <td style="text-align:left;"> The Godfather: Part II </td> <td style="text-align:left;"> 1974 </td> <td style="text-align:left;"> 9 </td> <td style="text-align:left;"> 3 </td> </tr> <tr> <td style="text-align:left;"> The Dark Knight </td> <td style="text-align:left;"> 2008 </td> <td style="text-align:left;"> 9 </td> <td style="text-align:left;"> 4 </td> </tr> <tr> <td style="text-align:left;"> 12 Angry Men </td> <td style="text-align:left;"> 1957 </td> <td style="text-align:left;"> 8.9 </td> <td style="text-align:left;"> 5 </td> </tr> <tr> <td style="text-align:left;"> Schindler's List </td> <td style="text-align:left;"> 1993 </td> <td style="text-align:left;"> 8.9 </td> <td style="text-align:left;"> 6 </td> </tr> <tr> <td style="text-align:left;"> The Lord of the Rings: The Return of the King </td> <td style="text-align:left;"> 2003 </td> <td style="text-align:left;"> 8.9 </td> <td style="text-align:left;"> 7 </td> </tr> <tr> <td style="text-align:left;"> Pulp Fiction </td> <td style="text-align:left;"> 1994 </td> <td style="text-align:left;"> 8.9 </td> <td style="text-align:left;"> 8 </td> </tr> <tr> <td style="text-align:left;"> The Good, the Bad and the Ugly </td> <td style="text-align:left;"> 1966 </td> <td style="text-align:left;"> 8.8 </td> <td style="text-align:left;"> 9 </td> </tr> <tr> <td style="text-align:left;"> Fight Club </td> <td style="text-align:left;"> 1999 </td> <td style="text-align:left;"> 8.8 </td> <td style="text-align:left;"> 10 </td> </tr> <tr> <td style="text-align:left;"> The Lord of the Rings: The Fellowship of the Ring </td> <td style="text-align:left;"> 2001 </td> <td style="text-align:left;"> 8.8 </td> <td style="text-align:left;"> 11 </td> </tr> <tr> <td style="text-align:left;"> Forrest Gump </td> <td style="text-align:left;"> 1994 </td> <td style="text-align:left;"> 8.7 </td> <td style="text-align:left;"> 12 </td> </tr> <tr> <td style="text-align:left;"> Star Wars: Episode V - The Empire Strikes Back </td> <td style="text-align:left;"> 1980 </td> <td style="text-align:left;"> 8.7 </td> <td style="text-align:left;"> 13 </td> </tr> <tr> <td style="text-align:left;"> Inception </td> <td style="text-align:left;"> 2010 </td> <td style="text-align:left;"> 8.7 </td> <td style="text-align:left;"> 14 </td> </tr> <tr> <td style="text-align:left;"> The Lord of the Rings: The Two Towers </td> <td style="text-align:left;"> 2002 </td> <td style="text-align:left;"> 8.7 </td> <td style="text-align:left;"> 15 </td> </tr> <tr> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> </tr> </tbody> </table> --- ## Analyze .question[ How would you go about answering this question: Which 1995 movies made the list? ] -- ```r imdb_top_250 %>% filter(year == 1995) ``` ``` ## # A tibble: 8 x 4 ## title year score rank ## <chr> <dbl> <dbl> <int> ## 1 Se7en 1995 8.6 21 ## 2 The Usual Suspects 1995 8.5 26 ## 3 Braveheart 1995 8.3 73 ## 4 Toy Story 1995 8.3 91 ## 5 Heat 1995 8.2 122 ## 6 Casino 1995 8.2 143 ## 7 Before Sunrise 1995 8.1 204 ## 8 La Haine 1995 8 229 ``` --- ## Analyze .question[ How would you go about answering this question: Which years have the most movies on the list? ] -- ```r imdb_top_250 %>% group_by(year) %>% summarise(total = n()) %>% arrange(desc(total)) %>% head(5) ``` ``` ## # A tibble: 5 x 2 ## year total ## <dbl> <int> ## 1 1995 8 ## 2 1957 7 ## 3 2000 6 ## 4 2001 6 ## 5 2003 6 ``` --- ## Visualize .question[ How would you go about creating this visualization: Visualize the average yearly score for movies that made it on the top 250 list over time. ] -- .small[ <!-- --> ] --- ## Potential challenges - Unreliable formatting at the source - Data broken into many pages - ... .question[ Compare the display of information at [raleigh.craigslist.org/search/apa](https://raleigh.craigslist.org/search/apa) to the list on the IMDB top 250 list. What challenges can you foresee in scraping a list of the available apartments? ] --- class: center, middle # Application exercise --- ## <i class="fas fa-laptop"></i> AE 07 - Web scraping - Clone your assignment repo in RStudio Cloud (`ae-07-web-scraping-TEAMNAME`) - Open the R script called `scrape-tvshows.R` - Scrape the names, scores, and years of most popular TV shows on IMDB: [www.imdb.com/chart/tvmeter](http://www.imdb.com/chart/tvmeter) - Create a data frame called `tvshows` with four variables (`rank`, `name`, `score`, `year`) - Examine each of the **first three** TV shows to also obtain - Genre - Runtime - How many episodes so far - First five plot keywords - Add this information to the `tvshows` data frame you created earlier