The diamonds
dataset that we will use in this application exercise consists of prices and quality information from about 54,000 diamonds, and is included in the ggplot2
package.
Since you already installed the ggplot2
library last time, you don’t need to install it again. However each time you launch R you need to load the package:
library(ggplot2)
To familiarize yourself with the dataset you can view the help file associated with it, or open up the dataset in the data viewer.
?diamonds
View(diamonds)
Another function that you’ll use very useful for quickly taking a peek at a dataset is str
. This function compactly displays the internal structure of an R object.
str(diamonds)
## 'data.frame': 53940 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
The output above tells us that there are 53,940 observations and 10 variables in the dataset. The variable name are listed, along with their type and the first few observations of each variable. Note: R calls categorical variables factor
s.
In the rest of the activity you will work with your teammates and answer the exercises using RMarkdown. You will submit one document per team, so you should identify one team member to be responsible for compiling this document. The others are strongly encouraged to also keep a running document as you work on the excercises as a team. Remember that some of the exercises might require R code, and R code goes in chunks in the markdown documents.
The dataset contains information on prices of diamonds, as well as various attributes of diamonds, some of which are known to influence their price (in 2008 $s): the 4 Cs (carat
, cut
, color
, and clarity
) , as well as some physical measurements (depth
, table
, price
, x
, y
, and z
). The figure below shows what these measurements represent.
Carat is a unit of mass equal to 200 mg and is used for measuring gemstones and pearls. Cut grade is is an objective measure of a diamond’s light performance, or, what we generally think of as sparkle.
The figures below shows color grading of diamonds:
Lastly, the figure below shows clarity grading of diamonds:
In the next section we will explore and visualize the distributions of the variables in this dataset individual. This type of analysis is also called univariate analysis.
Firs we should say a few words about classifying variables:mean
), median (median
), mode (not always useful)
range
), standard deviation (sd
), inter-quartile range (IQR
)
The qplot
function we learned about last time is also useful for visualizing distributions of single variables. For a numerical variable a histogram is a useful visual representation.
qplot(price, data = diamonds)
Does the shape of the distribution match your expectation?
Note that along with the plot, R printed out a warning for you:
stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
binwidth
argument to the plot, until you find a bin width that’s too narrow and one that is too wide to produce a meaningful histogram.
Earlier we learned about aesthetics and facets, which were attributes mapped to certain variables in the dataset. Unlike these, the argment binwidth
is simply a parameter input that controls the appearance of the graph, but does not map appearance to data. Most parameters come with a default value, and different geoms use different aesthetics and parameters.
When describing distributions of numerical variables we might also want to view statistics like mean, median, etc. (Functions for these are provided above, try a few.) Another useful function is summary
:
summary(diamonds$price)
When working with categorical variables, like cut
, we would first want to know what the levels of that variable is.
levels(diamonds$cut)
table
function to the color
variable. What does the resulting output show?
A useful representation for categorical variables is a bar plot.
qplot(color, data = diamonds)
Note that the qplot
function decides on the default visualization (histogram or bar plot) depending on the type of the variable (numerical or categorical, respectively). We could, if we wanted, specifically ask for a certain geom
as opposed to relying on the default.
qplot(price, data = diamonds, geom = "histogram")
qplot(color, data = diamonds, geom = "bar")
cut
, and describe its distribution.
Now that we are familiar with the individual variables in the dataset, we can start evaluating relationships between them.
Let’s make a histogram of the depth
s of diamonds, with binwidth of 0.2%.
qplot(depth, data = diamonds, binwidth = 0.2)
For adding another variable (say, cut
) to a visualization we can either use an aesthetic or a facet: * Using aesthetics: Use different colors to fill in for different cuts.
qplot(depth, data = diamonds, binwidth = 0.2, fill = cut)
qplot(depth, data = diamonds, binwidth = 0.2) +
facet_wrap(~ cut)
A simpler way of comparing depths across cuts would be to compare smoother distributions of depth, as opposed to the jagged histograms. * Using aesthetics: Use different colors to fill in for different cuts.
qplot(depth, data = diamonds, geom = "freqpoly", color = cut, binwidth = 0.2)
qplot(depth, data = diamonds, geom = "freqpoly", color = cut, binwidth = 0.2) +
facet_wrap(~ cut)
Instead of raw frequencies, we might want to focus on the data density instead.
qplot(depth, data = diamonds, geom = "density", color = cut)
Another parameter that can be passed to qplot
is position
. The options are
position = "identity"
: Don’t adjust positionposition = "dodge"
: Adjust position by dodging overlaps to the sideposition = "fill"
: Stack overlapping objects on top of one another, and standardise to haveposition = "stack"
: Stack overlapping objects on top of one anotherposition = "jitter"
: Jitter points to avoid overplottingTry to recreate the following plot:
Next, play around with the other position
options to see how they behave.
Note that jittering only works for numerical variables, not categorical. Let’s plot the relationship between price and carat for only diamonds with a Fair
cut grading.
fair_diamonds = subset(diamonds, cut == "Fair")
qplot(carat, price, data = fair_diamonds)
qplot(carat, price, data = fair_diamonds, geom = "jitter")
<=
), and plot the relationship between price and carat with and without jittering. Do the plots look the same or different. What does this mean?
You can also adjust the limits of your axes to zoom in on certain portions of the distribution.
Run the two following pieces of code to see how this option works:
qplot(carat, price, data = diamonds)
qplot(carat, price, data = diamonds, ylim = c(0,2000), xlim = c(0,1))