Here is a typical data analysis on R. I'm a baseball fan and I'm interested in the wins and losses for the current Major League teams and how these wins and losses are related to the runs scored and runs allowed.
In Excel, I create a dataset that contains the wins, losses, runs scored, and runs allowed for all 30 teams. Here is the first 4 rows of the dataset.
Team | League | W | L | RS | RA |
Tampa Bay | American | 79 | 50 | 597 | 515 |
Boston | American | 75 | 55 | 670 | 559 |
NY Yankees | American | 70 | 60 | 632 | 585 |
Toronto | American | 67 | 63 | 574 | 510 |
I save this dataset as "baseball2008.txt" -- text, tab-delimited format. I save this file in a folder called "eda" and I make this the current R working directory so R will find this file.
Here is my analysis:
# read the datafile into R
data=read.table("baseball2008.txt",header=T,sep="\t")
# attach the data to make the variable names available in R
attach(data)
# compute the winning proportion for all teams
win.prop=W/(W+L)
# what was the largest winning proportion?
max(win.prop)
[1] 0.6183206
# which team had the largest winning proportion?
Team[win.prop==max(win.prop)]
[1] Chicago Cubs
# for each team, compute the number of runs scored per game
runs.game=RS/(W+L)
# construct a stemplot of the runs scored per game
# (I'm assuming you have the aplpack package installed)
library(aplpack)
stem.leaf(runs.game)
1 | 2: represents 1.2
leaf unit: 0.1
n: 30
1 s | 7
4 3. | 889
7 4* | 011
8 t | 2
12 f | 4555
(6) s | 666667
12 4. | 88899
7 5* | 00011
t |
2 f | 45
# (sorry the display got messed up when I put it on this blog)
# you get a similar display by using a histogram
hist(runs.game)
# there are two teams who stand out with respect to runs scored
Team[runs.game>5.3]
[1] Texas Chicago Cubs
# what is the average number of runs per game scored by a team?
mean(runs.game)
[1] 4.621464
# do American League teams score more than National League teams?
mean(runs.game[League=="American"])
[1] 4.756021
mean(runs.game[League=="National"])
[1] 4.503728
# there is an interesting relationship between a team's winning
# proportion and the number of runs scored and runs allowed
#
# the relationship is log(W/L) = k log(RS/RA)
# where k is typically near 2
# we'll compute log(W/L), log(RS/RA)
# and fit a least-squares line -- we'll see if the slope is close to 2
log.WL=log(W/L)
log.RR=log(RS/RA)
lm(log.WL ~ log.RR)
Coefficients:
(Intercept) log.RR
-0.0003097 1.8377065
# here we see the slope is 1.84 which relatively close to 2