I have scraped the schedule for Duke men’s basketball from the goduke statsgeek website. The resulting data frame is not well formated and needs a lot of TLC to be useful for any statistical or data science use case. You can load the data frame into R using the following code:
load(url("http://stat.duke.edu/~cr173/Sta112_Fa16/data/duke_sched.Rdata"))
To start you are best off examining duke_sched
using RStudio’s viewer to get a sense of the data.
Clean up duke_sched
as best you can using stringr
, dplyr
and any other tools your are familiar with such that:
duke_sched
has meaningful column names
Add a column indicating if a game is a home game or an away game
Opponents are consistently formated with only the school’s name
For statistics yet to be observed make sure they are recorded as NA
s
Report the city, state and venue in separate columns
Report Duke score and their opponents score as separate numeric columns
Report attendance as a numeric value
Remove any extraneous rows and or columns
All of your code should be reproducible such that if later in the season I went back and updated the scraped results you would still be able to produce a clean and updated data frame at the end without revising your code.
Your submission should be an R Markdown file in your team App Ex repo, in a folder called AppEx_11_28_2016.Rmd
.
Parts 1 and 2 are due Tuesday, Dec 5th, 5pm