class: center, middle, inverse, title-slide # Web scraping ### Dr. Çetinkaya-Rundel ### 2018-04-08 --- ## Announcements - HW 6 will be posted tonight --- class: center, middle # Scraping the web --- ## Scraping the web: what? why? - Increasing amount of data is available on the web. - These data are provided in an unstructured format: you can always copy&paste, but it's time-consuming and prone to errors. - Web scraping is the process of extracting this information automatically and transform it into a structured dataset. - Two different scenarios: - Screen scraping: extract data from source code of website, with html parser (easy) or regular expression matching (less easy). - Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files. - Why R? It includes all tools necessary to do web scraping, familiarity, direct analysis of data... But python, perl, java are also efficient tools. --- class: center, middle # Web Scraping with rvest --- ## Hypertext Markup Language Most of the data on the web is still largely available as HTML - while it is structured (hierarchical / tree based) it often is not available in a form useful for analysis (flat / tidy). ```html <html> <head> <title>This is a title</title> </head> <body> <p align="center">Hello world!</p> </body> </html> ``` --- ## rvest `rvest` is a package that makes basic processing and manipulation of HTML data straight forward. -- Core functions: * `read_html` - read HTML data from a url or character string. * `html_nodes` - select specified nodes from the HTML document using CSS selectors. * `html_table` - parse an HTML table into a data frame. * `html_text` - extract tag pairs' content. * `html_name` - extract tags' names. * `html_attrs` - extract all of each tag's attributes. * `html_attr` - extract tags' attribute value by name. --- ## SelectorGadget - SelectorGadget: Open source tool that eases CSS selector generation and discovery - Install the [Chrome Extension](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb) - A box will open in the bottom right of the website. Click on a page element that you would like your selector to match (it will turn green). SelectorGadget will then generate a minimal CSS selector for that element, and will highlight (yellow) everything that is matched by the selector. - Now click on a highlighted element to remove it from the selector (red), or click on an unhighlighted element to add it to the selector. Through this process of selection and rejection, SelectorGadget helps you come up with the appropriate CSS selector for your needs. ```r vignette("selectorgadget") ``` --- class: center, middle # Top 250 movies on IMDB --- ## Top 250 movies on IMDB Take a look at the source code, look for the tag `table` tag: <br> http://www.imdb.com/chart/top  --- ## First check to make sure you're allowed! ```r # install.packages("robotstxt") library(robotstxt) paths_allowed("http://www.imdb.com") ``` ``` ## www.imdb.com ``` ``` ## [1] TRUE ``` vs. e.g. ```r paths_allowed("http://www.facebook.com") ``` ``` ## Warning: The implementation of future_lapply() in the 'future' package has ## been deprecated. Please use the one in the 'future.apply' package instead. ``` ``` ## www.facebook.com ``` ``` ## [1] FALSE ``` --- ## Demo  Go to [rstudio.cloud](https://rstudio.cloud/) `\(\rightarrow\)` Web scraping `\(\rightarrow\)` Make a copy `\(\rightarrow\)` `scrape-250.R` --- ## Select and format pieces .small[ ```r library(rvest) page <- read_html("http://www.imdb.com/chart/top") titles <- page %>% html_nodes(".titleColumn a") %>% html_text() years <- page %>% html_nodes(".secondaryInfo") %>% html_text() %>% str_replace("\\(", "") %>% # remove ( str_replace("\\)", "") %>% # remove ) as.numeric() scores <- page %>% html_nodes("#main strong") %>% html_text() %>% as.numeric() imdb_top_250 <- tibble( title = titles, year = years, score = scores ) ``` ] --- <table> <thead> <tr> <th style="text-align:left;"> title </th> <th style="text-align:left;"> year </th> <th style="text-align:left;"> score </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> The Shawshank Redemption </td> <td style="text-align:left;"> 1994 </td> <td style="text-align:left;"> 9.2 </td> </tr> <tr> <td style="text-align:left;"> The Godfather </td> <td style="text-align:left;"> 1972 </td> <td style="text-align:left;"> 9.2 </td> </tr> <tr> <td style="text-align:left;"> The Godfather: Part II </td> <td style="text-align:left;"> 1974 </td> <td style="text-align:left;"> 9 </td> </tr> <tr> <td style="text-align:left;"> The Dark Knight </td> <td style="text-align:left;"> 2008 </td> <td style="text-align:left;"> 9 </td> </tr> <tr> <td style="text-align:left;"> 12 Angry Men </td> <td style="text-align:left;"> 1957 </td> <td style="text-align:left;"> 8.9 </td> </tr> <tr> <td style="text-align:left;"> Schindler's List </td> <td style="text-align:left;"> 1993 </td> <td style="text-align:left;"> 8.9 </td> </tr> <tr> <td style="text-align:left;"> The Lord of the Rings: The Return of the King </td> <td style="text-align:left;"> 2003 </td> <td style="text-align:left;"> 8.9 </td> </tr> <tr> <td style="text-align:left;"> Pulp Fiction </td> <td style="text-align:left;"> 1994 </td> <td style="text-align:left;"> 8.9 </td> </tr> <tr> <td style="text-align:left;"> The Good, the Bad and the Ugly </td> <td style="text-align:left;"> 1966 </td> <td style="text-align:left;"> 8.8 </td> </tr> <tr> <td style="text-align:left;"> Fight Club </td> <td style="text-align:left;"> 1999 </td> <td style="text-align:left;"> 8.8 </td> </tr> <tr> <td style="text-align:left;"> The Lord of the Rings: The Fellowship of the Ring </td> <td style="text-align:left;"> 2001 </td> <td style="text-align:left;"> 8.8 </td> </tr> <tr> <td style="text-align:left;"> Forrest Gump </td> <td style="text-align:left;"> 1994 </td> <td style="text-align:left;"> 8.7 </td> </tr> <tr> <td style="text-align:left;"> Star Wars: Episode V - The Empire Strikes Back </td> <td style="text-align:left;"> 1980 </td> <td style="text-align:left;"> 8.7 </td> </tr> <tr> <td style="text-align:left;"> Inception </td> <td style="text-align:left;"> 2010 </td> <td style="text-align:left;"> 8.7 </td> </tr> <tr> <td style="text-align:left;"> The Lord of the Rings: The Two Towers </td> <td style="text-align:left;"> 2002 </td> <td style="text-align:left;"> 8.7 </td> </tr> <tr> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> </tr> </tbody> </table> --- ## Clean up / enhance May or may not be a lot of work depending on how messy the data are - See if you like what you got: ```r glimpse(imdb_top_250) ``` ``` ## Observations: 250 ## Variables: 3 ## $ title <chr> "The Shawshank Redemption", "The Godfather", "The Godfat... ## $ year <dbl> 1994, 1972, 1974, 2008, 1957, 1993, 2003, 1994, 1966, 19... ## $ score <dbl> 9.2, 9.2, 9.0, 9.0, 8.9, 8.9, 8.9, 8.9, 8.8, 8.8, 8.8, 8... ``` - Add a variable for rank ```r imdb_top_250 <- imdb_top_250 %>% mutate( rank = 1:nrow(imdb_top_250) ) ``` --- <table> <thead> <tr> <th style="text-align:left;"> title </th> <th style="text-align:left;"> year </th> <th style="text-align:left;"> score </th> <th style="text-align:left;"> rank </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> The Shawshank Redemption </td> <td style="text-align:left;"> 1994 </td> <td style="text-align:left;"> 9.2 </td> <td style="text-align:left;"> 1 </td> </tr> <tr> <td style="text-align:left;"> The Godfather </td> <td style="text-align:left;"> 1972 </td> <td style="text-align:left;"> 9.2 </td> <td style="text-align:left;"> 2 </td> </tr> <tr> <td style="text-align:left;"> The Godfather: Part II </td> <td style="text-align:left;"> 1974 </td> <td style="text-align:left;"> 9 </td> <td style="text-align:left;"> 3 </td> </tr> <tr> <td style="text-align:left;"> The Dark Knight </td> <td style="text-align:left;"> 2008 </td> <td style="text-align:left;"> 9 </td> <td style="text-align:left;"> 4 </td> </tr> <tr> <td style="text-align:left;"> 12 Angry Men </td> <td style="text-align:left;"> 1957 </td> <td style="text-align:left;"> 8.9 </td> <td style="text-align:left;"> 5 </td> </tr> <tr> <td style="text-align:left;"> Schindler's List </td> <td style="text-align:left;"> 1993 </td> <td style="text-align:left;"> 8.9 </td> <td style="text-align:left;"> 6 </td> </tr> <tr> <td style="text-align:left;"> The Lord of the Rings: The Return of the King </td> <td style="text-align:left;"> 2003 </td> <td style="text-align:left;"> 8.9 </td> <td style="text-align:left;"> 7 </td> </tr> <tr> <td style="text-align:left;"> Pulp Fiction </td> <td style="text-align:left;"> 1994 </td> <td style="text-align:left;"> 8.9 </td> <td style="text-align:left;"> 8 </td> </tr> <tr> <td style="text-align:left;"> The Good, the Bad and the Ugly </td> <td style="text-align:left;"> 1966 </td> <td style="text-align:left;"> 8.8 </td> <td style="text-align:left;"> 9 </td> </tr> <tr> <td style="text-align:left;"> Fight Club </td> <td style="text-align:left;"> 1999 </td> <td style="text-align:left;"> 8.8 </td> <td style="text-align:left;"> 10 </td> </tr> <tr> <td style="text-align:left;"> The Lord of the Rings: The Fellowship of the Ring </td> <td style="text-align:left;"> 2001 </td> <td style="text-align:left;"> 8.8 </td> <td style="text-align:left;"> 11 </td> </tr> <tr> <td style="text-align:left;"> Forrest Gump </td> <td style="text-align:left;"> 1994 </td> <td style="text-align:left;"> 8.7 </td> <td style="text-align:left;"> 12 </td> </tr> <tr> <td style="text-align:left;"> Star Wars: Episode V - The Empire Strikes Back </td> <td style="text-align:left;"> 1980 </td> <td style="text-align:left;"> 8.7 </td> <td style="text-align:left;"> 13 </td> </tr> <tr> <td style="text-align:left;"> Inception </td> <td style="text-align:left;"> 2010 </td> <td style="text-align:left;"> 8.7 </td> <td style="text-align:left;"> 14 </td> </tr> <tr> <td style="text-align:left;"> The Lord of the Rings: The Two Towers </td> <td style="text-align:left;"> 2002 </td> <td style="text-align:left;"> 8.7 </td> <td style="text-align:left;"> 15 </td> </tr> <tr> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> </tr> </tbody> </table> --- ## css selectors We will be using a tool called selector gadget to help up identify the html elements of interest - it does this by constructing a css selector which can be used to subset the html document. Selector | Example | Description ------------ |------------------| ------------------------------------------------ element | `p` | Select all <p> elements element element | `div p` | Select all <p> elements inside a <div> element element>element | `div > p` | Select all <p> elements with <div> as a parent .class | `.title` | Select all elements with class="title" \#id | `.name` | Select all elements with id="name" [attribute] | `[class]` | Select all elements with a class attribute [attribute=value] | `[class=title]` | Select all elements with class="title" --- ## Analyze .question[ How would you go about answering this question: Which 1995 movies made the list? ] --- ```r imdb_top_250 %>% filter(year == 1995) ``` ``` ## # A tibble: 9 x 4 ## title year score rank ## <chr> <dbl> <dbl> <int> ## 1 Se7en 1995. 8.60 22 ## 2 The Usual Suspects 1995. 8.60 26 ## 3 Braveheart 1995. 8.30 74 ## 4 Toy Story 1995. 8.30 92 ## 5 Heat 1995. 8.20 122 ## 6 Casino 1995. 8.20 144 ## 7 Before Sunrise 1995. 8.10 208 ## 8 La Haine 1995. 8.00 230 ## 9 Twelve Monkeys 1995. 8.00 243 ``` --- ## Analyze .question[ How would you go about answering this question: Which years have the most movies on the list? ] -- ```r imdb_top_250 %>% group_by(year) %>% summarise(total = n()) %>% arrange(desc(total)) %>% head(5) ``` ``` ## # A tibble: 5 x 2 ## year total ## <dbl> <int> ## 1 1995. 9 ## 2 1957. 7 ## 3 1994. 6 ## 4 2000. 6 ## 5 2001. 6 ``` --- ## Visualize .question[ How would you go about creating this visualization: Visualize the average yearly score for movies that made it on the top 250 list over time. ] -- .small[ <!-- --> ] --- ## Potential challenges - Unreliable formatting at the source - Data broken into many pages - ... Discussion: https://raleigh.craigslist.org/search/apa ---