Commonly Used Commands in Splus

A Summary of S-Plus Commands for Windows and Command Line

This page will be updated periodically, so be sure to "Reload" or "Refresh".

Please email sandra at stat.duke.edu with any corrections/updates.

Last modified: Sun Feb 22 13:48:48 EST 2004

Getting Started -- PLEASE READ
Importing Data
Summary Statistics
Calculating a correlation coefficient
Histograms
Boxplot
Confidence limits for a mean
Normal and t quantiles and probabilities
Format data for 2-sample t-test and fixed effects ANOVA
2-sample t-test and Confidence Interval for Difference of two means
One-way and Two-Way Fixed Effects ANOVA
Interaction plots for 2-way ANOVA
Subsetting Data
Remove a column/row of data
Calculating Power and Sample Size for a 2-sample t-test
Transformations of Data
Residuals vs. Fitted Plots for fixed effects ANOVA or Regression menus
Coding Residuals vs. Fitted Plots according to a third "Dummy" or "Indicator" variable
Normal Probability Plots
Multiple comparison procedures in ANOVA
Coded Scatter Plots Using Splus Menus. See also Coded Scatterplot with Fitted Regression Lines Superimposed (using Command Line)
Fit a linear regression model
Regression Line Plots
Plotting a fitted regression for log-transformed data on their original scale in Splus
Creating a script file in Splus
Plotting a fitted model for a nonlinear function on the original scale of the data (command line)
Plot of prediction intervals, confidence intervals for the mean response, and confidence intervals for the regression line
Polynomial Regression Models
Coding an indicator variable in Splus
Coded Scatterplot with Fitted Regression Lines Superimposed
Labeling points on a graph with text
Matrix of scatterplots for multiple variables.
Saving residuals from a regression model.
Saving and referring to components of a fitted model object.
ESS F-Tests
Influence diagnostics for regression: Cooks Distance, leverage, and externally studentized residuals

For each assignment/project, create a directory in your Z: drive where you will store all Splus work for that assignment. This is critical for those who use cluster computers. When you start Splus, Splus will give you a window to specify a "Path" to the directory you just created. If Splus does not give you that option at startup, go to "File" - "Chapters" - "Attach/Create Chapter" and under "Chapter Folder" browse to the new directory you created for your assignment. Make sure that "Position" is set to "1". (As the semester wears on, those who don't carefully set up separate directories for homework problems may find their files corrupted or Splus may not function properly.)

Importing Data
Save the dataset to your computer. Go to "File" and "Import Data" and "From File". Then select the directory that the file is in and select the format under "Files of type" by choosing "ASCII". Click on the column name, "V1", which is the default. Type in an appropriate variable name, like "concentration".

Summary Statistics

Go to "Statistics" - "Data Summaries" - "Summary Statistics". Choose the variable name you would like to summarize, for example, "concentration". Click on the tab called "Statistics" and check these boxes, "mean", "standard error", "confidence interval for mean", "1^st quartile", "median", "3^rd quartile".

Summary statistics, grouped by a coding variable

We'll look at summary statistics where "code" is used as a grouping variable. Go to Statistics - Data Summaries - Statistics, and under "Group Variables"," choose "code". Summary stats will be printed out for each group.

Histograms

Go to "Graph" - "2D Plot" - "Histogram". Under "X column", select the variable of interest.

Y axis of counts: Go to the "Options" tab and chose "Output type" "Counts"

Y axis of frequencies: For the second, go to the “Options” tab and chose “Output type” “Freq”. Make sure you understand what these two presentation methods say.

Add a title to your histogram by going to “Insert” “Text”.

Boxplot

For the boxplot, go to "Graph" - "2D Plot" - "Boxplot". Under "Y column", select the variable of interest.

Click on the y-axis label to add units of measurements.

Side by Side Boxplots

Go to Graph - 2D plot. On the left, "Axes type" is "linear". On the right, "Plot type" is "boxplot (x, grouping optional)". Click "OK". For a variable called "choles", under "Y Column" select "choles" and click "OK" to produce a boxplot of all the cholesterol measurements. To boxplot the cholesterol data grouped by "urban" or "rural", select "choles" under "Y column" and select "code" under "X column". Refine the axis labels in your plots to include a descriptive variable name and the units of measurement.

Confidence limits for a mean

Finding a confidence interval for the mean in Splus:

Go to “Statistics” - “Data Summaries” - “Summary Statistics”. Select your variable of interest under the section of the window “Data” and “Variables”. Now select the “Statistics” tab at the top of the pop-up window. Click the box next to “Conf. Limits for the Mean” and specify the “Conf Level”. Now click “Apply” or “OK” and the answer will be appended to the summary statistics.

Normal and t quantiles and probabilities

Please note that you should know how to use the tables in the back of Statistical Sleuth to look these up during exams.

Use the command line in Splus by going to “Window” - “Commands Window”. A blank window will open with a prompt like this “>”.

To find the area under a standard normal curve to the left of a number c, type into the command window:

> pnorm(c)

Example:

> pnorm(1) (area or probability under normal curve to the left of 1)

[1] 0.8413447

To find the value of z on a standard normal curve that corresponds to a given area p under a normal curve (a given probability p) type into the command window:

> qnorm(p)

Example: What’s the value of z that cuts off an area of .975 to the left?

> qnorm(.975)

[1] 1.959964

Similarly, for the t-distribution, suppose 19 degrees of freedom, then the commands are:

for quantiles:

> qt(.975,19)

[1] 2.093024 (useful as a multiplier in a 95% CI)

for the area under the t-distribution (probability) on 19 d.f. to the left of 1.67:

> pt(1.67,19)

[1] 0.944344

Format data for 2-sample t-test and fixed effects ANOVA

If your data are grouped by a variable called "code", change the "code" variable to a factor by going to Data and Change Data Type. Select "code" variable and under "New Type" choose "factor".

2-sample t-test and Confidence Interval for Difference of two means

You need to have your data arranged with measurements in one column and a "code" variable in the other column that will determine group membership. For the Guatemalan dataset, the variable names are "choles" and "code" (indicating urban or rural).

STATISTICS - COMPARE SAMPLES - TWO SAMPLE - T-TEST

Variable 1: choles

Variable 2: code

Check the box "variable is a grouping variable"

Assume equal variances

Mean under the null: 0

Alternative: two-sided (or one-sided)

A confidence interval and the two means are printed out.

One-way and Two-way Fixed Effects ANOVA

Go to Statistics and ANOVA and Fixed Effects. Dependent variable is your response or Y variable; and Independent variable is a code for treatment. Make sure you have changed this code to a FACTOR using "Data" "Change Data Type" "Factor". This will create the following formula in the formula window: "Y~code".

Two-way anova is similar except the formula is something like " Y~code1+code2".

For both types of models you may wish to print out group means by clicking on ""means" under the "RESULTS" tab in the ANOVA dialog box. Residual plots such as "Residuals vs. Fit" and "Normal QQ" can be found under "Plot" in the ANOVA dialog box.

Formula statements are as follows:

one-way: Y~code1
two-way, additive: Y~code1+code2
two-way, interaction: Y~code1*code2 or Y~code1+code2+code1:code2

Interaction plots for 2-way ANOVA. These plots summarize the mean of Factor A at each level of Factor B and are very useful graphical depictions of interaction effects. Go to "Statistics" and "Design" and "Interaction Plot". Choose the dependent and independent variables. Click "Both orderings for each pair" if you want to see 2 plots, one with Factor A on the x-axis, and one with Factor B on the x-axis.

Subsetting Data

For a dataset that is called "autofilter", with variables measuring emissions and car size (small=1, medium=2, large=3). Create a subset of the emissions data for small car size:

DATA - SUBSET

Dataset: autofilter

Columns in subset <ALL>

Subset Rows with: SIZE==1

Result type: Dataset

Save in: smallcars

Now a new spreadsheet window should pop up with the dataset called “smallcars”. You won’t need the middle column, which is called “SIZE”, and has entries all equal to 1, so you can delete it:

How to remove a column/row of data:

Highlight the column (or row)

DATA - REMOVE

Remove: column or you can choose "row"

Calculating Power and Sample Size for a 2-sample t-test

For both power and required sample size we need to specify the standard deviation, the mean under the null and the mean under the alternative that we would like to be able to detect.

When calculating power and minimum difference, Splus assumes that the standard deviation you provide is known for the population. This allows the use of the normal distribution for related calculations. Splus shouldn't be used to make these calculations when there are less than 10 in each sample, because the results won't be accurate.

Note that for the two-sample test for independent samples, we assume that the variances are equal and we use a pooled estimate of s. You'll need to calculate this pooled value and enter it when prompted.

For a hypothesis test in which the null says that the mean is zero, we want the power of the two sample test above to detect a difference in means of 0.5 grams at ?=0.05:

STATISTICS - POWER AND SAMPLE SIZE - NORMAL MEAN
Check the box for “Power”
Two-sample test
Alpha=.05
Specify N1 and N2
Sigma1 : <Enter your value of s pooled>
Sigma2 : Sigma1 <assume standard deviations are equal>
Null: Mean1: 0
Alternative: Mean2: <specify this value>
Test type: <may be two sided, greater than or less than>

Similarly, you can calculate sample size for a given power.

Transformations of Data
In Splus, you will create a new column for a log transformed version of a variable. Call the variable (on its original scale) X. Go to "Data" and "Transform". Give your the log transformed X a name in "Target Column". We'll call it log.X. In the box next to "Expression", type the desired transformation formula, in this case it is "log(X)". A new column is created in your dataframe called log.X with the new log transformed values in it.

Create logit transformed data by creating a new column in Splus. Do this by going to "Data" and "Transform". Give your new variable a name in "Target Column". I'll refer to it as "newvar". In the box next to "Expression", type "log((newvar)/(1-newvar))". A new column is created in your dataframe.

Residuals vs. Fitted Plots for fixed effects ANOVA or Regression menus
Under Statistics and Anova and Fixed Effects, enter dependent variable (Y) and independent variable (X, or "code"). Go to the "Plot" Tab in the ANOVA menu and click "Residuals vs. Fit". This will give a plot analogous to that on page 126 of Sleuth, Display 5.13. Note that under "Options", Splus will label your plot with the observation numbers for a number of extreme points that you specify.The default is for Splus to label the x-axis, "Fitted:X Variable". A better x-axis label (which you can add by clicking on the text in the figure and replacing it) is "Estimated ".

Coded Scatter Plot of Residuals vs. Fitted Values: Coding points when a third variable, "SIZECODE", is of interest.

Thanks to Susannah King for providing this command line code:

First save the residuals and fitted values from your regression model

If your fitted and residual columns are fitted1 and residuals1

plot(fit1,residuals1,type="n",xlab="Fitted",ylab="Residuals")
points(fit1[SIZECODE==0],residuals1[SIZECODE==0],pch="0")
points(fit1[SIZECODE==1],residuals1[SIZECODE==1],pch="1")
title("Residuals vs. Fitted")
abline(0,0)

Normal Probability Plots
Another plot of interest in residual analysis is a normal probability plot, or qq-normal plot.

For a dataset where the variable of interest is "DO", go to "Graph" - "2D plot" - "QQ Normal with Line". In the "y" column, pick "DO".
QQ normal plots for residuals can also be produced within the regression and ANOVA windows by going to the "Results" tab and clicking the "QQ normal with line" option.

Multiple comparison procedures in ANOVA
First, fit your ANOVA model, and enter a model name under "Save Model Object".

Go to "Statistics" - "ANOVA" - "Multiple Comparisons".

Enter the name of your "Model Object".
Enter the variable name under "Levels Of".
Make sure "Plot Intervals" is checked.
"Comparison Type": For all pairwise procedures, enter "mca". If you just want CIs for each variables, enter "none".
"Method": Choose Bonferroni, Scheffe, or Tukey.
Error type can be specified as "family-wise" if you choose "Tukey" or "Scheffe". Bonferroni is a "comparison-wise" procedure.
Note that "best.fast" uses all possible multiple comparison procedures and finds the one with the narrowest confidence interval. Typically, you will want to choose "family-wise" if you choose this option.
It is possible to specify contrasts by entering the name of the "Contrast Matrix". But for this course, you'll do specific contrasts by hand.

Coded Scatter Plot
Coded Scatter Plot Directions

Fit a linear regression model
Once your data is loaded and in front of you on the Splus spreadsheet, go to "Statistics" - "Regression" - "Linear". Make sure the name of your dataset is entered and specify the dependent (Y) variable. Shift click to choose multiple explanatory (X) variables. Or you can enter the formula directly, "Y~X", for example.

Regression Line Plot
Regression Line Plot Directions

Linear Correlation Coefficient
Linear correlation Coefficient directions

Polynomial Regression Models
Directions for fitting a second order polynomial regression model.

Plotting: Go to "Graph" - "2D Plot" - "Fit-Polynomial Curvefit". Choose the X and Y variables. Click on the "Curve Fitting" tab and check that you are fitting "Curve Fit Type: Linear" and "Poly. Order 2". Click "OK" to produce the plot.
Next, fit the model by choosing "Statistics" - "Regression" - "Linear". In the "Formula" box, type "Y ~ X + I(X^2)".

Plot of prediction intervals, confidence intervals for the mean response, and confidence intervals for the regression line

CI for mean response: Under "Graph", "2D Plot", "Fit-Linear Least Squares", specify the x-Column and the y-column. Then click on the "By Conf Bound" tab. Choose "Confidence 0.90" at the bottom or whatever bounds you choose. Then choose a line style (other than "None"), color and width. Click OK.
A function which produces the 3 types of intervals

Plotting a fitted regression for log-transformed data on their original scale in Splus
Directions.

Creating a script file in Splus You can create a script file in Splus (like a macro in Excel) as follows:

Go to "File" "New" "Script". Type each command on separate lines in the file. To run the file, press "F10", and the output will be produced in the lower window.

Coding an indicator variable in Splus

There are 2 ways to code an indicator variable in Splus.

Manual coding into the Splus data sheet: Review section 9.3.2 on p. 235-8 of the Sleuth. In the meadowfoam example, Display 9.7 shows how variables for Light Level Indicators can be coded in terms of 0-1 variables. In Section 9.3.3, the Sleuth suggests coding a 6-level indicator variable using 5 of the indicator variables in Display 9.7. One variable is selected as the reference level, and that variable does not have an indicator assigned to it. Note that using this manual coding method, the indicator variables you create are not designated as "factor" but are "double" (under "Data"-"Change Data Type"). You can redesignate the reference level by recoding the indicator variables.
Letting Splus code the variables for you: Again, we wish to establish one level as the "reference level" and code other indicators relative to that reference level.
1. Open the Command Line Window (Select Command Window from the Windows menu)
2. To have S-Plus create all of our dummy variables automatically for us, we need to change one option on how S-Plus handles factors. In the command window, enter
  
  options(contrasts="contr.treatment")
  If we have a categorical or factor variable with k levels, this will tell S-Plus to create k-1 dummy or indicator variables, where the first dummy variable is an indicator for the 2nd level, the 2nd dummy variable is an indicator of the 3rd level, and the last dummy variable is an indicator of the last level. The first level corresponds to having all dummy variables equal to 0.
3. For an indicator variable called "code", you need to tell S-Plus to treat "code" as a factor rather than as a continuous variable. Go to the Data menu and select Change Data Type. Select the column for "code", and then for Column Type select "factor". Click on OK.
4. If "code" had entries with levels "1", "2", "3", then the reference level that Splus will pick will be "1". If "code" had entries of "B", "A" and "C", Splus will pick "A" as the reference level. If you are confused by this, work through the bat data in Case Study 2 of Chapter 10. In Display 10.9, you can fit models with different reference levels and check that you get the same intercept as the Sleuth authors.
Coded Scatterplot with Fitted Regression Lines Superimposed:Command line instructions
The following commands will produce a plot for a dataset, "data.asc", in which the columns are Y, X1 and code. Let "X1" and "Y" be continuous variables and let "code" be an indicator variable. We assume that "code" is coded as a "0" or a "1". The first command, "attach", allows you to refer to the variable names in the dataframe directly. Without it, Splus command line only knows about "data.asc" but not the names of the variables in it.

attach(data)
plot(X1,Y,type="n",xlab="XLABEL",ylab="YLABEL")
points(X1[code==0],Y[code==0],pch="0")
points(X1[code==1],Y[code==1],pch="1")
title("Whatever title you want")

Now you want to plot the regression lines for each level of the indicator variable. First, you'll need to calculate the slope and intercept of the lines you will add. To add a line to your plot, use the command "abline(intercept,slope)". So let's add a line to our plot with intercept 1.5 and slope 2.
abline(1.5,2)

You can add each line in a similar way.

Add a legend:
legend(xcoord,ycoord,legend=c("Code=1","Code=0"),pch="01")
Note that you need to enter the xcoord and ycoord yourself. This is the point that marks the upper left hand corner of the legend box.
You can always change the range of the X1 and Y axes by typing, for example:
xlim=c(100,200),ylim=c(300,400) as extra arguments, separated by commas, in the plot statement.
Getting Fancy: You can make a dotted line by adding "lty=2" to the abline command, so that it reads abline(1.5,2,lty=2).

Labeling points on a graph with text
Here is how to make a scatterplot where the points are the observation numbers:
plot(x, y)
text(x, y)
### this will type the observation number (instead of a dot) at each point.
Now plot the points with the observation number above them:
plot(x,y)
text(x,y+.01)
### the "+0.01" will add a tiny bit to the y coordinate so that the number appears above the point. You have to decide what this tiny bit is -- it will depend on the scale of the y-axis.

You can add text at any place on a graph you want. If I want to add "function 1" at x=.3 and y=.7, I would type:
text(.3, .7, "function 1")
Matrix of scatterplots for multiple variables.
Go to "Graph" - "2D Plot". The menu is divided into the left "Axis Type" side and the right "Plot Type" side. Under "Axis Type" choose "Matrix". Under "Plot Type" choose "Scatter plot Matrix". You will control-click on each variable you would like included in the matrix.

Plotting a fitted model for a nonlinear function on the original scale of the data (command line).
Let's say you have just fit the model log(p/(1-p))~log(d).
Write this equation in terms of p as follows: p= ( exp(beta0) X^beta1 ) / ( 1 + exp(beta0) X^beta1 ) ).
Create a new column in Splus using a transformation of data. In the "Expression Box", type: ( exp(beta0) X^beta1 ) / ( 1 + exp(beta0) X^beta1 ) ). Of course you will substitute estimates of beta0 and beta1 directly into the expression. We will name this new column Yhat for this example.
Follow the command line directions in Coded Scatterplot with Fitted Regression Lines Superimposed to create scatterplot of X on Y on their original scale. Perhaps it will be a scatterplot coded according to some grouping variable. Now, to add the line (instead of an "abline"), type
lines(X,Yhat)
Line types can be specified using "lty=" a number from 1 to 10 (1 is a solid line).
lines(X,Yhat,lty=2)
If you have added multiple lines, a legend can be produced as follows legend(xcoord,ycoord,legend=c("Code=1","Code=0"),lty=1:2)
This will produce a legend with a solid line (lty=1) representing Code=1 and a dotted line (lty=2) representing Code=0. Note that you need to enter the xcoord and ycoord yourself. This is the point that marks the upper left hand corner of the legend box.

Saving residuals and/or fitted values from a regression model.
Under "Statistics" "Regression" "Linear", select the "Results" tab in the "Linear Regression" pop-up box. Under "Saved Residuals", check the box labeled "residuals" and under "Save In", choose your dataset (or some other dataset if necessary.) See also Saving and referring to components of a fitted model object, noted below.

Saving and referring to components of a fitted model object.
Fit the model of interest, say, "flowers~time + intensity". Also in the same window you will see "Save Model Object" "Save As". Save the model object as "mymodel".
Once you have saved your model you can go to the command line and type "summary(mymodel)". Compare this result to that produced by the point-and-click regression fitting method. Another useful command for an anova table is "summary(aov(mymodel))" Other options in the command line include: "mymodel$coefficients", "mymodel$fitted.values", "summary(mymodel)$r.squared", "summary(mymodel)$sigma", "sqrt(sum(mymodel$residuals^2)/mymodel$df.residual)".

ESS F-tests
Let's say that you have 2 nested models that you wish to compare. To use the bee pollen data, consider these models:
- log(percent/(1-percent))~elevation and
- log(percent/(1-percent))~elevation*location.
Save the fitted regression models. (see above) Call the first "equal.lines.lm", and the second "sep.lines.lm". Go to "Statistics" and "Compare Models". Select "Model Class" as "lm", which means you are looking among linear models. Go to the "Model Objects" drop down menu, and shift-click on the 2 models. Make sure that "F" is checked and hit OK. You'll see the F-statistic of interest.

Influence diagnostics for regression: Cooks Distance, leverage, externally studentized residuals and DFFITS
- You will load a script file, called diagplot.s to load a function called "diagplot.fun". This function has 2 arguments, "x" and "y". The "x" is a matrix of all columns of "x" in your model, and the "y" is the response column.
- For example, a possible command might be: diagplot.fun(cbind(height, weight, age), IQ) if we wish to predict IQ from height, weight and age.
- You must create a new column for all interaction terms that may be of interest and put these within the "cbind" parentheses.
- The Statistical Sleuth covers leverage, Cook's distance, and externally studentized residuals, all of which are plotted using the script file. Note that a plot entitled "DFFITS" is also included. This measures the ith case's influence on the fitted value of Y_i. Case i is relatively influential if the absolute value of the ith value of DFFITS is greater than 2*sqrt(p/n). This statistic differs from Cook's Distance by a scale factor and the replacement of the estimated model error by an estimate of model error calculated without the ith case.