App Ex 2: Diamond prices

The diamonds dataset that we will use in this application exercise consists of prices and quality information from about 54,000 diamonds, and is included in the ggplot2 package.

Since you already installed the ggplot2 library last time, you don’t need to install it again. However each time you launch R you need to load the package:

library(ggplot2)

To familiarize yourself with the dataset you can view the help file associated with it, or open up the dataset in the data viewer.

?diamonds
View(diamonds)

Another function that you’ll use very useful for quickly taking a peek at a dataset is str. This function compactly displays the internal structure of an R object.

str(diamonds)

## 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

The output above tells us that there are 53,940 observations and 10 variables in the dataset. The variable name are listed, along with their type and the first few observations of each variable. Note: R calls categorical variables factors.

In the rest of the activity you will work with your teammates and answer the exercises using RMarkdown. You will submit one document per team, so you should identify one team member to be responsible for compiling this document. The others are strongly encouraged to also keep a running document as you work on the excercises as a team. Remember that some of the exercises might require R code, and R code goes in chunks in the markdown documents.

More about the dataset

The dataset contains information on prices of diamonds, as well as various attributes of diamonds, some of which are known to influence their price (in 2008 $s): the 4 Cs (carat, cut, color, and clarity) , as well as some physical measurements (depth, table, price, x, y, and z). The figure below shows what these measurements represent.

diamond_measurements

Carat is a unit of mass equal to 200 mg and is used for measuring gemstones and pearls. Cut grade is is an objective measure of a diamond’s light performance, or, what we generally think of as sparkle.

The figures below shows color grading of diamonds:

diamond_colors

Lastly, the figure below shows clarity grading of diamonds:

diamond_clarity

Exploring variables individually

In the next section we will explore and visualize the distributions of the variables in this dataset individual. This type of analysis is also called univariate analysis.

Firs we should say a few words about classifying variables:

Numerical variables can be classified as continuous or discrete based on whether or not the variable can take on an infinite number of values or only non-negative whole numbers, respectively.
If the variable is categorical, we can determine if it is ordinal based on whether or not the levels have a natural ordering.

Exploring numerical data

When describing shapes of numerical distributions we highlight:

shape:
- right-skewed, left-skewed, symmetric (skew is to the side of the longer tail)
- unimodal, bimodal, multimodal, uniform
center: mean (mean), median (median), mode (not always useful)
spead: range (range), standard deviation (sd), inter-quartile range (IQR)
unusal observations

What type of variable is price? Would you expect its distribution to be symmetric, right-skewed, or left-skewed? Why?

The qplot function we learned about last time is also useful for visualizing distributions of single variables. For a numerical variable a histogram is a useful visual representation.

qplot(price, data = diamonds)

Does the shape of the distribution match your expectation?

Note that along with the plot, R printed out a warning for you:

stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Try out a few other bin widths by adding the binwidth argument to the plot, until you find a bin width that’s too narrow and one that is too wide to produce a meaningful histogram.

Earlier we learned about aesthetics and facets, which were attributes mapped to certain variables in the dataset. Unlike these, the argment binwidth is simply a parameter input that controls the appearance of the graph, but does not map appearance to data. Most parameters come with a default value, and different geoms use different aesthetics and parameters.

Visualize the other numerical variables in the dataset and discuss any interesting features.

When describing distributions of numerical variables we might also want to view statistics like mean, median, etc. (Functions for these are provided above, try a few.) Another useful function is summary:

summary(diamonds$price)

Exploring categorical data

When working with categorical variables, like cut, we would first want to know what the levels of that variable is.

levels(diamonds$cut)

What type of variable is color? Are all colors of diamonds (shown in the figure above) represented in the dataset? Apply the table function to the color variable. What does the resulting output show?

A useful representation for categorical variables is a bar plot.

qplot(color, data = diamonds)

Note that the qplot function decides on the default visualization (histogram or bar plot) depending on the type of the variable (numerical or categorical, respectively). We could, if we wanted, specifically ask for a certain geom as opposed to relying on the default.

qplot(price, data = diamonds, geom = "histogram")
qplot(color, data = diamonds, geom = "bar")

Make a bar plot of the distribution of cut, and describe its distribution.

Bivariate relationships

Now that we are familiar with the individual variables in the dataset, we can start evaluating relationships between them.

Adding another variable to a histogram

Let’s make a histogram of the depths of diamonds, with binwidth of 0.2%.

qplot(depth, data = diamonds, binwidth = 0.2)

For adding another variable (say, cut) to a visualization we can either use an aesthetic or a facet: * Using aesthetics: Use different colors to fill in for different cuts.

qplot(depth, data = diamonds, binwidth = 0.2, fill = cut)

Using facets: Split into different plots for different cuts.

qplot(depth, data = diamonds, binwidth = 0.2) +
  facet_wrap(~ cut)

Typical diamonds of which cut have the highest depth? On average, does depth increase or decrease as cut grade increase or decrease?

A simpler way of comparing depths across cuts would be to compare smoother distributions of depth, as opposed to the jagged histograms. * Using aesthetics: Use different colors to fill in for different cuts.

qplot(depth, data = diamonds, geom = "freqpoly", color = cut, binwidth = 0.2)

Using facets: Split into different plots for different cuts.

qplot(depth, data = diamonds, geom = "freqpoly", color = cut, binwidth = 0.2) +
  facet_wrap(~ cut)

In the above histograms we can see that peaks are higher for some cuts than others. What do these peaks indicate? Is this a useful piece of information for comparing average depth across cuts?

Instead of raw frequencies, we might want to focus on the data density instead.

qplot(depth, data = diamonds, geom = "density", color = cut)

Compare the distribution of price for the different cuts. Does anything seem unusual? Describe.

More on plotting options

Position adjustments

Another parameter that can be passed to qplot is position. The options are

position = "identity": Don’t adjust position
position = "dodge": Adjust position by dodging overlaps to the side
position = "fill": Stack overlapping objects on top of one another, and standardise to have
position = "stack": Stack overlapping objects on top of one another
position = "jitter": Jitter points to avoid overplotting

Try to recreate the following plot:

Next, play around with the other position options to see how they behave.

Note that jittering only works for numerical variables, not categorical. Let’s plot the relationship between price and carat for only diamonds with a Fair cut grading.

fair_diamonds = subset(diamonds, cut == "Fair")
qplot(carat, price, data = fair_diamonds)
qplot(carat, price, data = fair_diamonds, geom = "jitter")

Create a subset or diamonds that cost less than or equal to $1000 (<=), and plot the relationship between price and carat with and without jittering. Do the plots look the same or different. What does this mean?

Zooming

You can also adjust the limits of your axes to zoom in on certain portions of the distribution.

Run the two following pieces of code to see how this option works:

qplot(carat, price, data = diamonds)
qplot(carat, price, data = diamonds, ylim = c(0,2000), xlim = c(0,1))

What interesting feature is apparent in the above plot?

Further exploration

Explore the distribution of carat. What can you see? What might explain that pattern? What carat weights are most common? Make sure to experiment with bin width!

Look again at the relationship between price and carat. What do the data clustered in vertical lines in the plot tell us?

Write down two questions that you could answer with these data, and use appropriate visualizations and summary statistics to answer them. Make sure at least one of these requires using multiple variables (through aesthetics and/or facets) at once. Be creative!