[Return to Sta113 Home]

STA113: Lab 1 (Friday, 27 August, 2004)


Starting Matlab


Click the following link to download the data for Example 1.11 on page 20 exp1-11.dat and save it as a file named "exp1-11.dat" in your home directory.
data = load('exp1-11.dat');  % load data
Summary Statistics
n = length(data)  % sample size               

mean(data)               
median(data)

range(data)                      
max(data) - min(data)
var(data)
sum((data-mean(data)).^2)/(n-1)
std(data)
sqrt(var(data))

prctile(data, 0:25:100)  % five-number summary

% (You might also like to see what happens if you just type 0:25:100 with no semicolon.)
% (Or you might try prctile(data,25).)
Histogram

Two matlab commands "hist" and "histc". Read the help file to learn the difference between the two commands.

hist(data,7);  % draw histogram with 7 classes

edg =[2 4 6 8 12 20 30];
% remove the semicolon if you wish to stop suppression of the
% output and see what the variable edg is that you've created.
[n,bin] = histc(data, edg);  %
rfreq = n/length(data);
width = [edg(2:7)-edg(1:6),1]';
h = bar(edg, rfreq./width , 'histc'); % draw density historgram with classes defined by edg
set(h, 'facecolor', 'green');  % change the color of the bins

% change axes labels and ticks 
set(gca, 'XTick', edg);
xlabel('Bond Strength')
ylabel('Density');
Boxplot

We will draw boxplot for the haircolor data shown in class.

% read data where 'NaN' means missing data
pain= [ 62 60 71 55 48; 63 57 52 41 43; 42 50 41 37 NaN; 32 39 51 30 35];
pain = pain'  % transpose of the original matrix
boxplot(pain)
 
set(gca, 'XTicklabel', {'Light Blonde', 'Dard Blonde', 'Light Brunette', 'Dark Brunette'})
xlabel('Hair Color');
ylabel('Pain threshold score');
Regression

Old Faithful is a geyser in Yellowstone National Park in Wyoming, USA. As it is a major tourist attraction, being able to predict the timing and length of the next interruption would be useful. The Old Faithful dataset contains data about the date of the observation, the duration of an eruption (in minutes) and the time between eruptions (also in minutes).

Use the oldfaith data to investigate the linear relationship between variable DURATION and TIME. We are interested in whether there is a linear relationship between them. And if so, how well does the best fitting straight line predict the TIME between eruptions, given the DURATION of the current eruption.

[date, duration, time]=textread('oldfaith.dat', '%d%f%d', 'headerlines',1);

% In the text above, the %d%f%d argument indicates the data types of
% the three columns.  The first and last columns matlab is to read
% as a signed integer value and the middle column matlab is to read
% as a floating point value.  The 'headerlines' argument indicates
% that the data are preceded by a row of non-data that is to be
% ignored.  This row, of course, contains column titles.  Type
% help textread for more information.

% The predictor variable  is going to be DURATION
% and the response variable  is TIME.
% First make a scatter plot of DURATION vs. TIME
plot(duration, time, '.');

% Now find the least squares regression line

X = [ones(length(duration),1) duration]; 
% X needs to be a matrix with a leading  column of ones.

[b,bint ,r,rint,stats] = regress(time, X, 0.05);  
% b :  estimates for the regression coefficients
% r : residuals
% stats(1) : R-square

b
stats(1)
SSE = sum(r.^2);
SST = var(time)*(length(time)-1);
1-SSE/SST

% Now add the regression line to the scatter plot.
x = [1.5, 5.5];
y = b(1)+ b(2).*x;
hold on;
plot(x , y, 'r-');
hold off;

who
% The command 'who' reports back a list of all the variables MatLab
% recognizes that you have defined.  (Just a useful tip.)

Graded Assignment

  • Wildlife biologists can fairly accurately determine the length of an alligator from aerial photographs or from a boat. Determining the weight of an alligator from a distance is much more difficult. Wildlife biologists in Florida captured 25 alligators in order to collect data and to develop a model from which weight can be predicted from length. The data set alligator.txt contains the resulting 25 measurements, the first variable is the alligator's weight (in pounds?) and the second is its length (in inches?).

  • Make histograms of the alligators' lengths and weights. Also determine the mean and median length and weight of the alligators.

  • Try the following two linear regression models: (i) x = length and y = weight; (ii) x = log(length) and y = log(weight). For each model, draw a regression plot (data points with fitted regression line) and a scatter plot of residual vs. x, and report the R-square value. Which model do you think is better? Why? Write an equation estimating weight explicitly as a function of length. If you used log(length) and log(weight) to determine your model, be sure your final model is an expression for weight, not for log(weight).

  • Write a Word document containing your graphs and your concise answers to all the questions above. Be sure your graphs have titles and clear axis labels and that the answers to the questions above are clearly stated. Be sure your name is on your paper along with your lab section. Try to keep it limited to one page, but if you must use two pages, staple them together. This is to be turned in on Friday, 10 September at the beginning of the lab section. You will have two lab periods to work on it: 27 August and 3 September. If you realize you will not have enough time to finish the assignment by then, you need to work on it outside of lab time.

    Getting Help from MatLab

    Print Graphics

    Leaving Matlab