Saturday, January 31, 2009

Letter values by group

In the next topic, we are talking about comparing groups.  The first step is to construct some graph (such as a stemplot or boxplot or dotplot) across groups and the next step is to compute summaries (such as letter values) for each group.  Here are some comments about using R to do this stuff.

Let's return to the boston marathon data where there are two variables, time and age.  We are interested in comparing the completion times for different ages.

data(boston.marathon)
attach(boston.marathon)

By the way, this data frame is organized with two variables, response (here time) and
group (here age).

1.  GRAPHS

There is no simple way to construct parallel stemplots in R for different groups.  Parallel boxplots are easy to construct by typing, say, 

boxplot(time~age,ylab="Time",xlab="Group")

Also, it is easy to construct parallel dotplots by use of the stripchart function.

stripchart(time~age,ylab="Group")

Personally, I prefer dots that are solid black and are stacked:

stripchart(time~age,method="stack",pch=19,ylab="Group")

2.  SUMMARIES BY GROUP

If you apply the boxplot option with the plot = FALSE option, it will give you five number summaries (almost) for each group.  But the output isn't very descriptive, so I wrote a simple "wrapper" function where the output is easier to follow.

lval.by.group=function(response,group)
{
B=boxplot(response~group, plot=FALSE, range=0)
S=as.matrix(B$stats)
dimnames(S)[[1]]=c("LO","QL","M","QH","HI")
dimnames(S)[[2]]=B$names
S
}

To use this in R, you just type the function into the R Console window.  Or if you store this function in a file called lval.by.group.R, then you read the function into R by typing

source("lval.by.group.R")

Anyway, let me apply this function for the marathon data.

lval.by.group(time,age)
    20  30  40  50  60
LO 150 194 163 222 219
QL 222 213 224 251 264
M  231 235 239 262 274
QH 240 259 262 281 279
HI 274 330 346 349 338
attr(,"class")
       20 
"integer" 

I think the output is pretty clear.

3.  WHAT IF YOUR DATA IS ORGANIZED AS (GROUP1, GROUP2, ...)?

Sometimes, it is convenient to read data in a matrix format, where the different groups are in different columns.  An example of this is the population densities dataset.

data(pop.densities.1920.2000)
pop.densities.1920.2000[1:4,]
  STATE y1920 y1960 y1980 y1990 y2000
1    AL  45.8  64.2  76.6  79.6  87.6
2    AK   0.1   0.4   0.7   1.0   1.1
3    AZ   2.9  11.5  23.9  32.3  45.2
4    AR  33.4  34.2  43.9  45.1  51.3

You see that the 1920 densities are in the 2nd column, the 1960 densities in the second column, etc.

Anyway, you want to put this in the

[Response, Group]

format.  You can do this by the stack command.  The input is the matrix of data (with the first column removed) and the output is what you want.

d=stack(pop.densities.1920.2000[,-1])
d[1:5,]
  values   ind
1   45.8 y1920
2    0.1 y1920
3    2.9 y1920
4   33.4 y1920
5   22.0 y1920

Now we can compute the letter values by group (year) by typing

lval.by.group(d$values,d$ind)
     y1920   y1960   y1980   y1990   y2000
LO    0.10     0.4     0.7    1.00    1.10
QL   17.30    22.5    28.4   32.05   41.40
M    39.90    67.2    80.8   79.60   88.60
QH   62.35   114.6   157.7  181.65  202.85
HI 7292.90 12523.9 10132.3 9882.80 9378.00

Note that since I didn't attach the data frame d, I'm referring to the response variable as d$values and the grouping variable by d$ind.



No comments: