To read the data into S-Plus, choose from the S-Plus menus File, then Import Data, then From file. Choose the file that you have just saved. You should have a spreadsheet called "incomes" open with the data in one columns, labeled "V1". Note: By default, S-Plus will name the dataset according to the name of the file from which the data was imported. Your dataset will not be named "incomes" unless you named your data file "incomes" (plus file extension).
We're interested in assessing the mean income of the workers in this company. The data about the whole population is given in our data set. However, let's imagine that we don't have access to this whole file. Instead, we are forced to take a random sample of 40 employees, and make an estimate of the mean income using the incomes of the people in our sample.
Let's begin by taking this random sample of 40 people from our data set. Go to Data, Random sample. Choose "income" as your data set, and make sure to set the sample size as 40. This random sample should appear as a separate data sheet.
To find a 95% confidence interval for the population mean (based on the sample you just took), go to Statistics, Compare samples, One sample, t test. Note that we will be using this same menu for hypothesis tests (subject of the next few lectures), since these tests and the confidence interval methodology are closely related. Make sure to enter the name of the data set which contains the random sample and the correct column. You also need to choose 0 for the "mean under the null hypothesis", "two-sided" as the alternative hypothesis, and confidence level "0.95" In the report that is generated, you will find the confidence interval on the fourth and fifth lines.
Try this procedure again with other random samples, and see what happens.
numsamps _ 100 #How many samples will we take? sampsize _ 40 #How large will each sample be? alpha _ 0.05 #What is our level of significance? lower.bd _ rep(0,numsamps) upper.bd _ rep(0,numsamps) t.stat _ qt(1-(alpha/2), df=sampsize-1) #Calculates the t-statistic. for (i in 1:numsamps){ newsamp _ sample(incomes[,1], size=sampsize, replace=F) std.err _ sqrt(var(newsamp, unbiased=T)/sampsize) lower.bd[i] _ mean(newsamp) - (t.stat*std.err) upper.bd[i] _ mean(newsamp) + (t.stat*std.err) } #How many covered the true mean? covered _ rep(0, numsamps) covered[lower.bd < mean(incomes[,1]) & upper.bd > mean(incomes[,1])] _ 1 plot(c(min(lower.bd), max(upper.bd)), c(1,numsamps), xlab="", ylab="", type="n") for (i in 1:numsamps){ if (covered[i] == 1) segments(lower.bd[i], i, upper.bd[i], i, col=2) else segments(lower.bd[i], i, upper.bd[i], i, col=3) } abline(v=mean(incomes[,1])) title(sub=paste(sum(covered)," of ",numsamps," intervals covered the true mean of ",mean(incomes[,1])))