The main dataset we'll be focusing on here is in the file
fish.asc. Here's a brief
description of the dataset:
A study was recently conducted in the Wacamaw and Lumber
Rivers to investigate mercury levels in tissues of large
mouth bass. At several stations along each river, a
group of fish were caught, weighed, and measured. In addition
a filet from each fish caught was sent to the lab so that
the tissue concentration of mercury could be determined for each fish.
Each fish caught corresponds to a single row of the file.
In order, the recorded information for each fish is
river | station | length in cm |
weight in grams |
mercury concentration in parts per million |
We're going to make some plots. So tell Splus to use 4 plots
per page:
> par(mfrow=c(2,2))
Make a histogram of conc; this can be done with the command:
> hist(fish$conc)
The label is a bit ugly; you can specify your labels by specifying
labels and the title in the histogram call
> hist(fish$conc,xlab="concentration ppm",ylab="count",
main="mercury concentrations")
The mercury concentrations are skewed to the right (as you'd
expect). A log transformation may help spread out the data
better. Create the log concentrations using
> fish$logconc_log(fish$conc,base=10)
To see what the dataframe fish now looks like type:
> fish
fish now has an additional column. Make a histogram of the
log concentrations:
> hist(fish$logconc,xlab="log concentration ppm",ylab="count",
main="log mercury concentrations")
Let's investigate how mercury concentration varies with station
and river. We'll focus on the new variable logconc. Let's first look
at logconc by river. One thing to try is to plot the data values against
the two river locations. First make the variable rivernum; then plot it
> rivernum_ifelse(fish$river=="wacamaw",1,2)
> plot(rivernum,fish$logconc,ylab="log concentration ppm",xlab="river",
main="concentration by river")
Check out the function ifelse(); it can be very useful. To
get more info on the function, type
> ?ifelse
This plot isn't so hot. Boxplots may give a better picture:
> boxplot(split(fish$logconc,fish$river),
ylab="log concentration ppm",xlab="river",
main="concentration by river")
Now let's look at logconc by station -- a plot and a boxplot:
> plot(fish$station,fish$logconc,ylab="log concentration ppm",
xlab="station", main="concentration by station")
> boxplot(split(fish$logconc,fish$station),
ylab="log concentration ppm",xlab="station",
main="concentration by station")
Let's investigate the relationship between length and weight
of the fish. Plot length vs. weight:
> plot(fish$length,fish$weight,xlab="length in cm",
ylab="length in gm", main="fish size")
The plot nearly follows a quadratic: y = x^2 (does this
make sense?).
Some points seem to be away from the point
cloud. Let's identify the stations from which these fish
were taken using the identify() function:
> par(mfrow=c(1,1)) # make the plot bigger
> plot(fish$length,fish$weight,xlab="length in cm",
ylab="length in gm", main="fish size")
> identify(fish$length,fish$weight,fish$station)
Now click on the plot the points you want to identify
using the left mouse button. When you are thru, click
the right button. Note the fish from station 2 all seem to
be away from the rest of the observations.
We can make this same plot using station number as the plotting
sympol. This can be done a couple of ways. We'll
use text after making the initial plot. Make the plot of
length by weight but specify type="n". This will tell Splus
not to make any marks inside the plotting region.
> par(mfrow=c(2,2))
> plot(fish$length,fish$weight,xlab="length in cm",type="n",
ylab="length in gm", main="fish size by station")
Next use text to fill in the plotting symbols. text() only adds
points to the existing plot. Type ?text to find out more about
the text() function.
> text(fish$length,fish$weight,fish$station)
Finally, a simple way to get a good overview of your data
is to use the pairs scatterplot matrix. This plots each
pair of variable against each other. To do this, enter
the command:
> pairs(fish)
You may only want to focus on a subset of the variables in
the dataframe fish. This can be done by controling the
index of fish. The command below gives pairwise scatterplots
for the variables length, weight and logconc - the 3rd, 4th
and 6th columns of the dataframe:
> pairs(fish[,c(3,4,6)])
Assignment:
Turn in the following plots. Print them out 4 plots to a page.
The scatterplot matrix plot should be printed on a single page.
All plots should have thoughtful axis labels and a title.
-
histogram of mercury concentrations.
- histogram of the log of the mercury concentrations.
- side by side boxplots of log concentration for the two rivers.
- side by side boxplots of log concentration for the each station.
- plot of length by height with outlying points identified with
their station number.
- plot of length by height using station number as plotting symbol.
- plot of weight by log conc using different plotting symbols for the
two rivers.
- plot of length by log conc using different plotting symbols for the
two rivers.
- scatterplot matrix of length, weight and log concentration.
You should only have to turn in 3 pages - the first two with 4 plots
on each, and one with the 3 by 3 scatterplot matrix. Good luck!
|