Tuesday, November 18, 2008

Learning about Lengths of Baseball Games

One complaint about baseball is that it is a relatively long game, compared to American football or basketball.

To understand more about the lengths of baseball games, I collected the lengths (in minutes) of all games played during the 2007 season.  Here are my questions:

1.  Are the times of baseball games normally distributed?
2.  If not, how do the times differ from a normal curve?

I'll illustrate the R work. You'll need a couple of packages: LearnEDA and vcd (for the rootogram function).

1.  First I'll read in the baseball game times, construct bins and the rootogram (this is essentially a histogram where you plot the roots of the counts instead of the counts).

library(LearnEDA)
data=read.table("game.time.07.txt",header=T)
attach(data)

# set up bins and construct a rootogram

bins=seq(100,300,by=20)
bin.mids=(bins[-1]+bins[-length(bins)])/2
h=hist(time,breaks=bins)

h$counts=sqrt(h$counts)
plot(h,xlab="TIME",ylab="ROOT FREQUENCY",main="")



















2.  Now I want to fit a normal comparison curve.  I figure out the mean and standard deviation of the normal curve by using letter values.

# find mean and standard deviation of matching Gaussian curve

S=lval(time)
f=as.vector(S[2,2:3])
m=as.integer((f[1]+f[2])/2)
sd=as.integer((f[2]-f[1])/1.349)

I get that the times are approximately N(175, 24).

3.  Is this normal approximation ok?  The function fit.gaussian computes the expected counts for each bins and computes the residuals sqrt(observed) - sqrt(expected).

# function fit.gaussian.R fits Gaussian curve to the counts

s=fit.gaussian(time,bins,m,sd)

# output observed and expected counts

output=round(data.frame(s$counts, sqrt(s$counts), s$probs, s$expected, 
     sqrt(s$expected), sqrt(s$counts)-sqrt(s$expected)),2)
names(output)=c("count","root","prob","fit","root fit","residual")
output

   count root prob   fit root fit residual
1      1 1.00 0.01  1.00     1.00     0.00
2      8 2.83 0.06  6.08     2.47     0.36
3     22 4.69 0.19 19.17     4.38     0.31
4     27 5.20 0.32 31.34     5.60    -0.40
5     20 4.47 0.27 26.60     5.16    -0.69
6     12 3.46 0.12 11.72     3.42     0.04
7      6 2.45 0.03  2.67     1.64     0.81
8      1 1.00 0.00  0.32     0.56     0.44
9      1 1.00 0.00  0.02     0.14     0.86
10     1 1.00 0.00  0.00     0.02     0.98


# place root expected counts on top of rootogram

lines(bin.mids,sqrt(s$expected),lwd=4,col="red")




















3.  The rootogram function plots the residuals.  Both plots below show the same info, but in different ways.

# load library vcd that contains rootogram function

library(vcd)

# again save histogram object

h=hist(time,breaks=bins,plot=F)

# illustrate "hanging" style of rootogram

r=rootogram(h$counts,s$expected,type="hanging")  # this is the hanging style 




















# illustrate "deviation" style of rootogram

r=rootogram(h$counts,s$expected,type="deviation") 





















What have we learned?  Although the lengths of baseball games are roughly normal in shape, we see that there are some large negative residuals on the right tail.

What does this mean?  Baseball game lengths have a long right-tail which means one tends to see many long games.  If you know anything about baseball, you know that baseball can go into extra-innings when there the game is tied after 9 innings, and these extra-inning games cause the long right tail.

No comments: