class: center, middle, inverse, title-slide # Web scraping ### Dr. Çetinkaya-Rundel ### November 16, 2017 --- class: center, middle # Scraping the web --- ## Scraping the web: what? why? - Increasing amount of data is available on the web. - These data are provided in an unstructured format: you can always copy&paste, but it's time-consuming and prone to errors. - Web scraping is the process of extracting this information automatically and transform it into a structured dataset. - Two different scenarios: - Screen scraping: extract data from source code of website, with html parser (easy) or regular expression matching (less easy). - Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files. - Why R? It includes all tools necessary to do web scraping, familiarity, direct analysis of data... But python, perl, java are also efficient tools. --- class: center, middle # Web Scraping with rvest --- ## Hypertext Markup Language Most of the data on the web is still largely available as HTML - while it is structured (hierarchical / tree based) it often is not available in a form useful for analysis (flat / tidy). ```html <html> <head> <title>This is a title</title> </head> <body> <p align="center">Hello world!</p> </body> </html> ``` --- ## rvest {.smaller} `rvest` is a package from Hadley Wickham that makes basic processing and manipulation of HTML data straight forward. <br/> Core functions: * `read_html` - read HTML data from a url or character string. * `html_nodes` - select specified nodes from the HTML document using CSS selectors. * `html_table` - parse an HTML table into a data frame. * `html_text` - extract tag pairs' content. * `html_name` - extract tags' names. * `html_attrs` - extract all of each tag's attributes. * `html_attr` - extract tags' attribute value by name. --- ## css selectors We will be using a tool called selector gadget to help up identify the html elements of interest - it does this by constructing a css selector which can be used to subset the html document. Selector | Example | Description ------------ |------------------| ------------------------------------------------ element | `p` | Select all <p> elements element element | `div p` | Select all <p> elements inside a <div> element element>element | `div > p` | Select all <p> elements with <div> as a parent .class | `.title` | Select all elements with class="title" \#id | `.name` | Select all elements with id="name" [attribute] | `[class]` | Select all elements with a class attribute [attribute=value] | `[class=title]` | Select all elements with class="title" --- ## SelectorGadget - SelectorGadget: Open source tool that eases CSS selector generation and discovery - Install the [Chrome Extension](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb) - A box will open in the bottom right of the website. Click on a page element that you would like your selector to match (it will turn green). SelectorGadget will then generate a minimal CSS selector for that element, and will highlight (yellow) everything that is matched by the selector. - Now click on a highlighted element to remove it from the selector (red), or click on an unhighlighted element to add it to the selector. Through this process of selection and rejection, SelectorGadget helps you come up with the appropriate CSS selector for your needs. ```r vignette("selectorgadget") ``` --- class: center, middle # Top 250 movies on IMDB --- ## Top 250 movies on IMDB Take a look at the source code, look for the tag `table` tag: <br> http://www.imdb.com/chart/top ![imdb_top](img/21/imdb_top_250.png) --- ## First check to make sure you're allowed! ```r # install.packages("robotstxt") library(robotstxt) paths_allowed("http://www.imdb.com") ``` ``` ## www.imdb.com ``` ``` ## [1] TRUE ``` --- ## Select and format pieces .small[ ```r library(rvest) page <- read_html("http://www.imdb.com/chart/top") titles <- page %>% html_nodes(".titleColumn a") %>% html_text() years <- page %>% html_nodes(".secondaryInfo") %>% html_text() %>% str_replace("\\(", "") %>% # remove ( str_replace("\\)", "") %>% # remove ) as.numeric() scores <- page %>% html_nodes("strong") %>% html_text() %>% as.numeric() imdb_top_250 <- data_frame( title = titles, year = years, score = scores ) ``` ] --- <table> <thead> <tr> <th style="text-align:left;"> title </th> <th style="text-align:left;"> year </th> <th style="text-align:left;"> score </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> The Shawshank Redemption </td> <td style="text-align:left;"> 1994 </td> <td style="text-align:left;"> 9.2 </td> </tr> <tr> <td style="text-align:left;"> The Godfather </td> <td style="text-align:left;"> 1972 </td> <td style="text-align:left;"> 9.2 </td> </tr> <tr> <td style="text-align:left;"> The Godfather: Part II </td> <td style="text-align:left;"> 1974 </td> <td style="text-align:left;"> 9 </td> </tr> <tr> <td style="text-align:left;"> The Dark Knight </td> <td style="text-align:left;"> 2008 </td> <td style="text-align:left;"> 9 </td> </tr> <tr> <td style="text-align:left;"> 12 Angry Men </td> <td style="text-align:left;"> 1957 </td> <td style="text-align:left;"> 8.9 </td> </tr> <tr> <td style="text-align:left;"> Schindler's List </td> <td style="text-align:left;"> 1993 </td> <td style="text-align:left;"> 8.9 </td> </tr> <tr> <td style="text-align:left;"> Pulp Fiction </td> <td style="text-align:left;"> 1994 </td> <td style="text-align:left;"> 8.9 </td> </tr> <tr> <td style="text-align:left;"> The Lord of the Rings: The Return of the King </td> <td style="text-align:left;"> 2003 </td> <td style="text-align:left;"> 8.9 </td> </tr> <tr> <td style="text-align:left;"> The Good, the Bad and the Ugly </td> <td style="text-align:left;"> 1966 </td> <td style="text-align:left;"> 8.8 </td> </tr> <tr> <td style="text-align:left;"> Fight Club </td> <td style="text-align:left;"> 1999 </td> <td style="text-align:left;"> 8.8 </td> </tr> <tr> <td style="text-align:left;"> The Lord of the Rings: The Fellowship of the Ring </td> <td style="text-align:left;"> 2001 </td> <td style="text-align:left;"> 8.8 </td> </tr> <tr> <td style="text-align:left;"> Forrest Gump </td> <td style="text-align:left;"> 1994 </td> <td style="text-align:left;"> 8.7 </td> </tr> <tr> <td style="text-align:left;"> Star Wars: Episode V - The Empire Strikes Back </td> <td style="text-align:left;"> 1980 </td> <td style="text-align:left;"> 8.7 </td> </tr> <tr> <td style="text-align:left;"> Inception </td> <td style="text-align:left;"> 2010 </td> <td style="text-align:left;"> 8.7 </td> </tr> <tr> <td style="text-align:left;"> The Lord of the Rings: The Two Towers </td> <td style="text-align:left;"> 2002 </td> <td style="text-align:left;"> 8.7 </td> </tr> <tr> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> </tr> </tbody> </table> --- ## Clean up / enhance May or may not be a lot of work depending on how messy the data are - See if you like what you got: ```r str(imdb_top_250) ``` ``` ## Classes 'tbl_df', 'tbl' and 'data.frame': 250 obs. of 3 variables: ## $ title: chr "The Shawshank Redemption" "The Godfather" "The Godfather: Part II" "The Dark Knight" ... ## $ year : num 1994 1972 1974 2008 1957 ... ## $ score: num 9.2 9.2 9 9 8.9 8.9 8.9 8.9 8.8 8.8 ... ``` - Add a variable for rank ```r imdb_top_250 <- imdb_top_250 %>% mutate( rank = 1:nrow(imdb_top_250) ) ``` --- <table> <thead> <tr> <th style="text-align:left;"> title </th> <th style="text-align:left;"> year </th> <th style="text-align:left;"> score </th> <th style="text-align:left;"> rank </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> The Shawshank Redemption </td> <td style="text-align:left;"> 1994 </td> <td style="text-align:left;"> 9.2 </td> <td style="text-align:left;"> 1 </td> </tr> <tr> <td style="text-align:left;"> The Godfather </td> <td style="text-align:left;"> 1972 </td> <td style="text-align:left;"> 9.2 </td> <td style="text-align:left;"> 2 </td> </tr> <tr> <td style="text-align:left;"> The Godfather: Part II </td> <td style="text-align:left;"> 1974 </td> <td style="text-align:left;"> 9 </td> <td style="text-align:left;"> 3 </td> </tr> <tr> <td style="text-align:left;"> The Dark Knight </td> <td style="text-align:left;"> 2008 </td> <td style="text-align:left;"> 9 </td> <td style="text-align:left;"> 4 </td> </tr> <tr> <td style="text-align:left;"> 12 Angry Men </td> <td style="text-align:left;"> 1957 </td> <td style="text-align:left;"> 8.9 </td> <td style="text-align:left;"> 5 </td> </tr> <tr> <td style="text-align:left;"> Schindler's List </td> <td style="text-align:left;"> 1993 </td> <td style="text-align:left;"> 8.9 </td> <td style="text-align:left;"> 6 </td> </tr> <tr> <td style="text-align:left;"> Pulp Fiction </td> <td style="text-align:left;"> 1994 </td> <td style="text-align:left;"> 8.9 </td> <td style="text-align:left;"> 7 </td> </tr> <tr> <td style="text-align:left;"> The Lord of the Rings: The Return of the King </td> <td style="text-align:left;"> 2003 </td> <td style="text-align:left;"> 8.9 </td> <td style="text-align:left;"> 8 </td> </tr> <tr> <td style="text-align:left;"> The Good, the Bad and the Ugly </td> <td style="text-align:left;"> 1966 </td> <td style="text-align:left;"> 8.8 </td> <td style="text-align:left;"> 9 </td> </tr> <tr> <td style="text-align:left;"> Fight Club </td> <td style="text-align:left;"> 1999 </td> <td style="text-align:left;"> 8.8 </td> <td style="text-align:left;"> 10 </td> </tr> <tr> <td style="text-align:left;"> The Lord of the Rings: The Fellowship of the Ring </td> <td style="text-align:left;"> 2001 </td> <td style="text-align:left;"> 8.8 </td> <td style="text-align:left;"> 11 </td> </tr> <tr> <td style="text-align:left;"> Forrest Gump </td> <td style="text-align:left;"> 1994 </td> <td style="text-align:left;"> 8.7 </td> <td style="text-align:left;"> 12 </td> </tr> <tr> <td style="text-align:left;"> Star Wars: Episode V - The Empire Strikes Back </td> <td style="text-align:left;"> 1980 </td> <td style="text-align:left;"> 8.7 </td> <td style="text-align:left;"> 13 </td> </tr> <tr> <td style="text-align:left;"> Inception </td> <td style="text-align:left;"> 2010 </td> <td style="text-align:left;"> 8.7 </td> <td style="text-align:left;"> 14 </td> </tr> <tr> <td style="text-align:left;"> The Lord of the Rings: The Two Towers </td> <td style="text-align:left;"> 2002 </td> <td style="text-align:left;"> 8.7 </td> <td style="text-align:left;"> 15 </td> </tr> <tr> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> </tr> </tbody> </table> --- ## Analyze <div id="question"> How would you go about answering this question: Which 1995 movies made the list? </div> --- ```r imdb_top_250 %>% filter(year == 1995) ``` ``` ## # A tibble: 9 x 4 ## title year score rank ## <chr> <dbl> <dbl> <int> ## 1 Se7en 1995 8.6 22 ## 2 The Usual Suspects 1995 8.6 26 ## 3 Braveheart 1995 8.3 76 ## 4 Toy Story 1995 8.3 93 ## 5 Heat 1995 8.2 124 ## 6 Casino 1995 8.2 146 ## 7 Before Sunrise 1995 8.1 214 ## 8 La Haine 1995 8.0 235 ## 9 Twelve Monkeys 1995 8.0 239 ``` --- ## Analyze <div id="question"> How would you go about answering this question: Which years have the most movies on the list? </div> -- ```r imdb_top_250 %>% group_by(year) %>% summarise(total = n()) %>% arrange(desc(total)) %>% head(5) ``` ``` ## # A tibble: 5 x 2 ## year total ## <dbl> <int> ## 1 1995 9 ## 2 1957 7 ## 3 2003 7 ## 4 2000 6 ## 5 2001 6 ``` --- ## Visualize <div id="question"> How would you go about creating this visualization: Visualize the average yearly score for movies that made it on the top 250 list over time. </div> -- .small[ ```r imdb_top_250 %>% group_by(year) %>% summarise(avg_score = mean(score)) %>% ggplot(aes(y = avg_score, x = year)) + geom_point() + geom_smooth(method = "lm") + xlab("year") ``` ![](21-deck_files/figure-html/unnamed-chunk-11-1.png)<!-- --> ] --- ## Potential challenges - Unreliable formatting at the source - Data broken into many pages - ... Discussion: https://raleigh.craigslist.org/search/apa --- class: center, middle # Case study --- ## Case study: Popular TV shows - Scrape the list of most popular TV shows on IMDB: http://www.imdb.com/chart/tvmeter - Examine each of the *first three* (or however many you can get through) movie subpage to also obtain - How many episodes so far - Certificate - First five plot keywords - Genres - Runtime - Country - Language