In baseball, the objective is to win games and a team wins a game by scoring more runs than its opponent. An interesting question is "how important is a single run" towards the goal of winning a game? Suppose one collects the runs scored, the runs allowed, the number of wins and the number of losses for a group of teams. Bill James (a famous guy who works on baseball data) discovered the empirical relationship
log(wins/losses) = 2 log(runs scored/runs allowed)
He called this the Pythagorean Relationship.
Let's try to demonstrate this relationship by use of a resistant fit.
1. First, I collected data for all baseball teams in the 2008 season. The dataset teams2008. txt contains for each of the 30 teams ...
Team -- the name of the team
Wins -- the number of wins
Losses -- the number of losses
Runs.Scored -- the total number of runs scored
Runs.Allowed -- the total number of runs allowed
2. I read this dataset into R and compute the variables log.RR and log.WL.
data=read.table("http://bayes.bgsu.edu/eda/data/teams2008.txt",header=T)
attach(data)
log.RR=log(Runs.Scored/Runs.Allowed)
log.WL=log(Wins/Losses)
3. I graph log.RR against log.WL and add team labels to the graph. As we hoped, the relationship looks pretty linear.
plot(log.RR,log.WL,pch=19)
text(log.RR,log.WL,Team,pos=2)
4. I next fit a resistant line using the rline function in the LearnEDA package. I add the
fitted line to the graph.
the.fit=rline(log.RR,log.WL,iter=4)
curve(the.fit$a+the.fit$b*(x-the.fit$xC),add=TRUE)
5. If Bill James' relationship holds, the slope of the resistant line should be close to 2.
the.fit
$a
[1] 0.01006079
$b
[1] 1.801718
It approximately holds since the slope of 1.8 is close to 2.
6. To see if this is a reasonable fit, we compute the fit and the residuals.
FIT=the.fit$a+the.fit$b*(log.RR-the.fit$xC)
RESIDUAL=log.WL-FIT
and then plot the residuals, adding the team labels.
plot(log.RR,RESIDUAL,pch=19)
abline(h=0)
text(log.RR,RESIDUAL,Team,pos=2)
7. What are we looking for in the residual plot? First, we look for general patterns that we didn't see earlier in the first plot. I don't see any trend, so it appears that we removed the tilt by fitting a line.
Also we are looking for unusually small or large residual. Here a "lucky team" corresponds to a team who seemed to win more games than one would expect based on their wins and losses.
Which team was unusually lucky in the 2008 season? A hint: they were a "heavenly" team from the west coast.