Instructions

  1. There is a GitHub repo you have been invited to
    • located in the organization for this class
    • name contains HW 1 and your name

Clone this repo in your local directory on gort. (Remember, the address is http://gort.stat.duke.edu:8787/.)

  1. Edit the README.md to include some relevant information about the repository, commit, and push. (This is just to check everything is working fine, and you know what you’re doing.)

  2. Open a new R Markdown file, name it the same name as your repository, and save it.

  3. Include answers to all exercises in your R Markdown file. Your answers should always include any summary and/or plot you use to answer that particular question.

Diamonds

The diamonds dataset that we will use in this application exercise consists of prices and quality information from about 54,000 diamonds, and is included in the ggplot2 package.

Since you already installed the ggplot2 and dplyr libraries last time, you don’t need to install them again. However each time you launch R you need to load the packages:

library(ggplot2)
library(dplyr)

To familiarize yourself with the dataset you can view the help file associated with it, or open up the dataset in the data viewer. To do so, run the following commands in the Console.

?diamonds
View(diamonds)

Another function that you’ll use very useful for quickly taking a peek at a dataset is str. This function compactly displays the internal structure of an R object.

str(diamonds)
## 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

The output above tells us that there are 53,940 observations and 10 variables in the dataset. The variable name are listed, along with their type and the first few observations of each variable. Note: R calls categorical variables factors.

More about the dataset

The dataset contains information on prices of diamonds, as well as various attributes of diamonds, some of which are known to influence their price (in 2008 $s): the 4 Cs (carat, cut, color, and clarity) , as well as some physical measurements (depth, table, price, x, y, and z). The figure below shows what these measurements represent.