Tuesday, August 26, 2008

Example of data analysis on R



Here is a typical data analysis on R. I'm a baseball fan and I'm interested in the wins and losses for the current Major League teams and how these wins and losses are related to the runs scored and runs allowed.

In Excel, I create a dataset that contains the wins, losses, runs scored, and runs allowed for all 30 teams. Here is the first 4 rows of the dataset.

Team League W L RS RA
Tampa Bay American 79 50 597 515
Boston American 75 55 670 559
NY Yankees American 70 60 632 585
Toronto American 67 63 574 510

I save this dataset as "baseball2008.txt" -- text, tab-delimited format. I save this file in a folder called "eda" and I make this the current R working directory so R will find this file.

Here is my analysis:

# read the datafile into R

data=read.table("baseball2008.txt",header=T,sep="\t")

# attach the data to make the variable names available in R

attach(data)

# compute the winning proportion for all teams

win.prop=W/(W+L)

# what was the largest winning proportion?

max(win.prop)

[1] 0.6183206

# which team had the largest winning proportion?

Team[win.prop==max(win.prop)]

[1] Chicago Cubs

# for each team, compute the number of runs scored per game

runs.game=RS/(W+L)

# construct a stemplot of the runs scored per game
# (I'm assuming you have the aplpack package installed)

library(aplpack)
stem.leaf(runs.game)

1 | 2: represents 1.2
leaf unit: 0.1
n: 30
1 s | 7
4 3. | 889
7 4* | 011
8 t | 2
12 f | 4555
(6) s | 666667
12 4. | 88899
7 5* | 00011
t |
2 f | 45

# (sorry the display got messed up when I put it on this blog)
# you get a similar display by using a histogram

hist(runs.game)
























# there are two teams who stand out with respect to runs scored

Team[runs.game>5.3]

[1] Texas Chicago Cubs

# what is the average number of runs per game scored by a team?

mean(runs.game)

[1] 4.621464

# do American League teams score more than National League teams?

mean(runs.game[League=="American"])

[1] 4.756021

mean(runs.game[League=="National"])

[1] 4.503728

# there is an interesting relationship between a team's winning
# proportion and the number of runs scored and runs allowed
#
# the relationship is log(W/L) = k log(RS/RA)
# where k is typically near 2

# we'll compute log(W/L), log(RS/RA)
# and fit a least-squares line -- we'll see if the slope is close to 2

log.WL=log(W/L)
log.RR=log(RS/RA)

lm(log.WL ~ log.RR)

Coefficients:
(Intercept) log.RR
-0.0003097 1.8377065

# here we see the slope is 1.84 which relatively close to 2

No comments: