########################################################################
# This lab analyzes data from the US Table Tennis Association
#
# Each line in the file contains:
#
# 1. The winner's rating
# 2. The loser's rating
# 3. Which of eight tournaments the data is taken from
# 4. The difference between the two ratings
# 5. The average of the two ratings
#
# First we will go through some exercises to reinforce ideas about
# histograms and Normal curves. Then we will
# learn about how well the ratings
# predict the winner. Please feel free to try your own analyses. If
# you want to know how to carry out a computation or make a particular
# plot in Splus, just post a message to the newsgroup and I'll try to
# explain it. If you come up with something interesting, please post
# it to the group.

# How big is this data set? The answer is given as the number of rows
# and the number of columns

dim ( pingpong )

# Print out the first six rows, just to see what the data look like.
# Notice the notation. Square brackets
# are used for subsets of arrays. The 1:6 asks for rows 1-6. Then there
# is a comma. Then 1:5 asks for columns 1-5.

pingpong [ 1:6, 1:5 ]

# If you leave out either the rows or columns then you get all of them.
# So we could have written just

pingpong [ 1:6, ]

# and gotten the same thing.

# Do player rankings follow the Normal curve? (We'll assume that all the
# rankings are for different players.) Begin by putting all the rankings
# into one vector. S-Plus thinks that pingpong is a "list", not a matrix.
# To turn it into a vector we have to "unlist" it.

rankings <- unlist ( pingpong[,1:2] )

# Notice the notation here. We left out the first subscript, so we're
# getting all rows. And the 1:2 says we're getting the first two columns,
# the rankings of the winners and losers.
# Do they follow the Normal curve? Look at the following histogram and
# see what you think.

hist ( rankings )

# To help us judge whether rankings follow the Normal curve,
# we can plot the Normal curve
# on top of the histogram. First we need the average and SD of the rankings.

Ave <- mean ( rankings )
SD <- sqrt ( mean ( ( rankings - Ave )^2 ) )

# Now we make the histogram over which we'll plot the Normal curve.
# We say prob=T to tell S-Plus to draw the histogram to the density scale.

hist ( rankings, prob=T )

# Now we're ready to draw the Normal curve. We need a bunch of points
# along the x-axis where we'll plot the curve. I'll choose a sequence of
# 40 points
# spread out between 3 SD's below average and 3 SD's above average.
# Feel free to make a different choice.

x <- seq ( Ave-3*SD, Ave+3*SD, length=40 )

# The y-values will be the Normal density. (Why density? Because that's
# the scale on the y-axis.) Here's how to get them.
# The following command tells S-Plus to calculate the Normal curve
# above each of the x values and it says what average and SD to use.

y <- dnorm ( x, Ave, SD )

points ( x, y )

# To make the plot easier to see we'll draw the histogram without shading
# and we'll connect the dots with a line. The statement type="b" tells
# S-Plus to draw both points and lines

hist ( rankings, prob=T, style.bar="old" )
points ( x, y, type="b" )

# Use the Normal curve to estimate the fraction of table tennis players
# with rankings above 2000. Do you think your estimate is too high, too
# low, or about right?

# Here's how to calculate the fraction directly from the data.

High <- rankings > 2000
Low <- rankings <= 2000

# Do you see what High and Low are? If not, display them, or just the
# first several elements of them.

sum ( High ) # The number of players with High rankings
sum ( Low ) # The number of players with Low rankings

sum(High)/length(rankings) # The fraction of players with High rankings
sum(Low)/length(rankings) # The fraction of players with Low rankings

# Was your estimate from the Normal curve too high, too low, or about right?

#------------------------------------------------------------------#

# Now let's compare winners and losers. One thing to notice is the average
# difference between the two groups compared to the spread within each
# group separately.

par ( mfrow=c(2,1) )
hist ( pingpong[,1] )
hist ( pingpong[,2] )

# It would be easier to compare these if they were drawn with the same
# x-axes. Redraw them with the same x-axes, without shading, and with
# Normal curves on each one. The first Normal curve should use the average
# and SD of the winners; the second Normal curve should use the average
# and SD of the losers.

#------------------------------------------------------------------#

# Now let's see how well the rankings
# can predict the winner. First we'll get a histogram of the
# difference between the two players.

hist ( abs(pingpong[,4]) )

# This is just to see the range of differences. It looks sensible to
# group them by 100's.
# Let's take all the matches in which |D| is less than 100, and see
# how often the higher ranked player wins. Then we'll do the same
# for |D| between 100 and 200; for |D| between 200 and 300, etc.

diff <- abs(pingpong[,4])
good1 <- diff < 100 # matches with |D| < 100
good2 <- diff < 200 & diff >= 100# matches with |D| between 100 and 200
good3 <- diff < 300 & diff >= 200 # matches with |D| between 200 and 300
good4 <- diff < 400 & diff >= 300# matches with |D| between 300 and 400
good5 <- diff >= 400 # matches with |D| > 400

# Do you see what good1, good2, etc. are? If not, display them.

F.wins <- pingpong[,4] > 0# matches where the favored player wins

# Find the number of matches in which the ranking difference was less than 100

sum ( good1 )

# Find the number of those matches in which the favored player won

sum ( good1 & F.wins )

# What's the ratio?

sum ( good1 & F.wins ) / sum ( good1 )

# Now repeat for the other four categories




# That should tell you approximately how well the rankings can predict
# the winner. Do you see why?
# Here are some questions to think about. How would you
# make a plot or do a calculation to answer them?

# 1. Were the players equally strong in all 8 tournaments?

# 2. Were the rankings equally good predictors in all 8 tournaments?

# 3. Does the accuracy of the rankings depend on how good the players are?
# For example, suppose players whose rankings are around 1000 but whose
# difference is around 50 play each other. What's the chance that the
# higher ranked player wins? Now suppose players whose rankings are
# around 2000 but whose difference is around 50 play each other. What's
# the chance the higher ranked player wins? Is it the same in both cases?


# Try to answer these questions. Post your suggestions (and pleas for help)
# to the newsgroup.