class: center, middle, inverse, title-slide # Introduction ## Intro to Data Science ### Shawn Santo ### 01-09-20 --- ## What is Data Science? >"Data science is a concept to unify statistics, data analysis, machine learning and their related methods in order to understand and analyze actual phenomena with data. It employs techniques and theories drawn from many fields within the context of mathematics, statistics, information science, and computer science." <br/><br/><br/> [Wikipedia](https://en.wikipedia.org/wiki/Data_science) --- ## Who am I? - Shawn Santo - [shawn.santo@duke.edu](mailto:shawn.santo@duke.edu) - Office hours - Monday 9:00 – 10:00am, 207A Old Chemistry - Thursday 12:30 – 1:30pm, 207A Old Chemistry --- ## Who else is involved? .pull-left[ - Salvador Arellano - [salvador.chavero.arellano@duke.edu](mailto:salvador.chavero.arellano@duke.edu) - Office hours: Wed 11:30am - 1:30pm, 203B Old Chemistry - Max Bartlett - [maxwell.bartlett@duke.edu](mailto:maxwell.bartlett@duke.edu) - Office hours: Sun 12:00pm - 2:00pm, 203B Old Chemistry - Kate Chen - [kathleen.chen@duke.edu](mailto:kathleen.chen@duke.edu) - Office hours: Tue and Thu 1:30pm - 2:30pm, 203B Old Chemistry ] .pull-right[ - Bin Han - [bin.han@duke.edu](mailto:bin.han@duke.edu) - Office hours: Fri 3:00pm - 5:00pm, 025 Old Chemistry - Frances Hung - [frances.hung@duke.edu](mailto:frances.hung@duke.edu) - Office hours: Tue 11:30am - 1:30pm, 025 Old Chemistry - Becky Tang - [becky.tang@duke.edu](mailto:becky.tang@duke.edu) - Office hours: Mon and Thu 4:30pm - 5:30pm, 203B Old Chemistry ] <br/><br/> .small-text[ Office hours will begin next week. ] --- ## Course objectives - Learn to explore, visualize, and analyze data in a reproducible and shareable manner - Gain experience in data wrangling and munging, exploratory data analysis, predictive modeling, and data visualization - Work on problems and case studies inspired by and based on real-world questions and data - Learn to effectively communicate results through written assignments and final project presentation --- ## Some of what you will learn .pull-left[ - Fundamentals of R - Data visualization with package `ggplot2` - Package `tidyverse` - Web scraping - Web based applications with RShiny - Spatial data visualization ] .pull-right[ - Data types and functions - Version control with GitHub - Reproducible reports with R Markdown - Regression and classification - Statistical inference ] <br><br> [Full course schedule](https://www2.stat.duke.edu/courses/Spring20/sta199.001/schedule.html) --- ## Why this class matters .middle.center[  ] *Source:* https://www.kdnuggets.com/2019/05/poll-top-data-science-machine-learning-platforms.html --- ## Why this class matters ### Some 2020 internships: - <b>Mayo Clinic :</b> <i> Interns will work with statisticians, bioinformaticists, and clinical investigators on research projects in areas such as clinical trials, statistical genetics, and bioinformatics. Experience with SAS and/or R preferred.</i> - <b>Netflix (Science and Analytics):</b> <i> Comfortable coding in at least one language (e.g., R, Python, Java, Scala, C++), experience preferred with version control (e.g., git), great communication skills, both oral and written.</i> - <b>Two Sigma:</b> <i> Use the scientific method to develop sophisticated investment models and shape our insights into how the markets will behave. Create and test complex investment ideas and partner with our engineers to test your theories. You should possess the following qualifications: Demonstrate intermediate skills in at least one programming language, performed an in-depth research project, examining real-world data, are an independent thinker who can creatively approach data analysis and communicate complex ideas clearly.</i> *Source:* https://stattrak.amstat.org/2019/12/01/2020-internship-listings/ --- class: middle, center, inverse # Examples of data science --- ## Examples - Billboard Hot 100 analytics - Analysis: [http://bit.ly/bh-100](http://bit.ly/bh-100) - [GitHub repository](https://github.com/RosebudAnwuri/TheArtandScienceofData/tree/master/The%20Making%20of%20Great%20Music) with data and R scripts -- <br/><br/> - Atlas of redistricting - Analysis: [http://bit.ly/redistrictnc](http://bit.ly/redistrictnc) -- <br/><br/> - A year as told by Fitbit - Analysis: [http://bit.ly/fitbit-year](http://bit.ly/fitbit-year) --- class: middle, center, inverse # Your Turn! --- ## Create a GitHub account Go to https://github.com/, and create an account (unless you already have one). Tips for creating a username from [Happy Git with R](http://happygitwithr.com/github-acct.html#username-advice). - Incorporate your actual name! - Reuse your username from other contexts if you can, e.g., Twitter or Slack. - Pick a username you will be comfortable revealing to your future boss. - Shorter is better than longer. - Be as unique as possible in as few characters as possible. - Make it timeless. Don’t highlight your current university, employer, or place of residence. - Avoid words laden with special meaning in programming, like `NA`. --- ## UN Votes Analysis 1. Log-in (create an account if needed) to RStudio Cloud - an email was sent with the sign-up link. 2. Under Projects, click the drop-down arrow next to `New Project` and select `New Project from Git Repo`. 3. Paste the link https://github.com/sta199-sp20-001/appex00.git in the dialogue box in RStudio Cloud. 4. In the Files pane in the bottom right corner, spot the file called `unvotes.Rmd`. Open it, and then click on the "Knit" button. 5. Go back to the file and put your name in the author field at the top of the file (in the `yaml` -- we will discuss what this is at a later date). Knit again. 6. Change the country names in parts 1 and 2 to countries that interest you. The spelling and capitalization must match what's in the data, so you can use the Appendix to see the correct spelling and capitalization. Knit again. <br/><br/> You have made your first data visualization! --- ## Discussion Discuss the following with a partner. 1. Start by introducing yourself! Name, year, major/academic interest, favorite hobby. 2. Consider the plot in Part 1. Describe how the voting behaviors of the four countries have changed over time. 3. Consider the plot in Part 2. - On which issues have the two countries voted most similarly in recent years? - On which issues have they voted most differently in recent years? - Has this changed over time? --- class: middle, center, inverse # Course policies --- ## Class meetings Lecture - Focus on concepts behind data analysis - Interactive lecture that includes examples and hands-on exercises - Bring fully-charged laptop to every lecture - Please let me know as soon as possible if you do not have access to a laptop -- <br/><br/> Lab - Focus on computing using R `tidyverse` syntax - Apply concepts from lecture to case study scenarios - Work on labs in teams of 3 - 4 - Bring fully-charged laptop to every lab --- ## Textbooks - [OpenIntro Statistics, 4th Edition](https://www.openintro.org/stat/textbook.php?stat_book=os) - Free PDF available online. Hard copy available for purchase. - Assigned readings about statistical content - [R for Data Science](http://r4ds.had.co.nz/) - Free online version. Hard copy available for purchase. - Assigned readings and resource for R coding using `tidyverse` syntax. --- ## Assessments - **Homework**: Individual assignments combining conceptual and computational skills. *Lowest score dropped.* -- - **Labs**: Individual and team assignments focusing on computational skills. *Lowest score will be dropped.* -- - **Exams**: Two individual take-home exams. -- - **Final project**: Team project presented during the final exam period, **Wednesday, April 29, 2:00 - 5:00pm**. You must complete the project and present in class to pass the course. -- - **Application exercises**: Exercises usually started in class and completed in teams by the next class. Check/no check grade. -- - **Teamwork**: Teams of 3-4 based on survey results. Consistent throughout the semester. Periodic peer evaluations. --- ## Teams I will construct teams based on the results of the survey you complete at the end of class today. Expectations: - Each member must commit to giving equal effort. - Each member must read, run, and understand all code in a final submission. - Each member must honestly complete the intragroup peer evaluation. --- ## Grading | Component | Weight | |--------------------------------------:|:------:| | Homework | 25% | | Labs | 15% | | Exam 1 | 17.5% | | Exam 2 | 17.5% | | Final Project | 15% | | Participation & Application Exercises | 5% | | Teamwork | 5% | <br> <br> - If you have a cumulative numerical average of 90 - 100, you are guaranteed at least an A-, 80 - 89 at least a B-, and 70 - 79 at least a C-. - You are expected to attend lectures and labs. Excessive absences or tardiness can impact your final course grade. --- ## Excused absences - Students who miss a class due to a scheduled varsity trip, religious holiday, or short-term illness should fill out the respective form. - These excused absences do not excuse you from assigned work. -- <br/><br/> - If you have a personal or family emergency or chronic health condition that affects your ability to participate in class, please contact your academic dean’s office. -- <br/><br/> - Exam dates cannot be changed and no make-up exams will be given. --- ## Late work and regrade requests - Homework assignments: - Late but within 24 hours of deadline: 20% penalty - Not accepted if submitted any later <br/><br/> - Late work will not be accepted for the take-home exams or final project. <br/><br/> - Regrade requests must be submitted within **48 hours** of when the assignment is returned using the link posted in the course syllabus. --- ## Policies - sharing / reusing code - Similar reproducible examples (reprex) exist online that will help you answer many of the questions posed on in-class work, homework assignments, and midterm exams. Use of these resources is allowed unless it is written explicitly on the assignment. - You must always cite any code you copy or use as inspiration. Copied code without citation is plagiarism and will result in a 0 for the assignment. - Copying and citing a large amount of code to satisfy a main objective of an assignment will result in a 0 for the assignment. - Discussion with other students and groups is always allowed unless it is written explicitly on the assignment. - Carefully read each assignment so you know what is permitted and what is not. If you are ever unsure what is allowed, please ask myself or one of the TAs. --- ## Academic Honesty All work for this class should be done in accordance with the Duke Community Standard. > To uphold the Duke Community Standard: > - I will not lie, cheat, or steal in my academic endeavors; > - I will conduct myself honorably in all my endeavors; and > - I will act if the Standard is compromised. Any violations will automatically result in a grade of 0 on the assignment and will be reported to [Office of Student Conduct](https://studentaffairs.duke.edu/conduct) for further action. --- ## Getting help - Post your content and course related questions on [Piazza](https://piazza.com/signup) - Email me your grade related questions - Set up a meeting --- ## Links to bookmark - Course page: https://www2.stat.duke.edu/courses/Spring20/sta199.001/ - GitHub organization: https://github.com/sta199-sp20-001 --- ## Before next lecture - Verify you can access Piazza (link through Sakai) - Complete the assigned reading - The murky tale of Flint's deceptive water data --- ## References - Deborah Nolan & Duncan Temple Lang (2010) Computing in the Statistics Curricula, The American Statistician, 64:2, 97-107, DOI: 10.1198/tast.2010.09132 - Piatetsky, G. (2019). Python leads the 11 top Data Science, Machine Learning platforms: Trends and Analysis. Kdnuggets.com. Retrieved 21 August 2019, from https://www.kdnuggets.com/2019/05/poll-top-data-science-machine-learning-platforms.html - Which Programming Language Should Data Scientists Learn First?. (2019). Medium. https://towardsdatascience.com/which-programming-language-should-data-scientists-learn-first-aac4d3fd3038