Exploratory Data Analysis: Letter values by group

In the next topic, we are talking about comparing groups. The first step is to construct some graph (such as a stemplot or boxplot or dotplot) across groups and the next step is to compute summaries (such as letter values) for each group. Here are some comments about using R to do this stuff.

Let's return to the boston marathon data where there are two variables, time and age. We are interested in comparing the completion times for different ages.

data(boston.marathon)

attach(boston.marathon)

By the way, this data frame is organized with two variables, response (here time) and

group (here age).

1. GRAPHS

There is no simple way to construct parallel stemplots in R for different groups. Parallel boxplots are easy to construct by typing, say,

boxplot(time~age,ylab="Time",xlab="Group")

Also, it is easy to construct parallel dotplots by use of the stripchart function.

stripchart(time~age,ylab="Group")

Personally, I prefer dots that are solid black and are stacked:

stripchart(time~age,method="stack",pch=19,ylab="Group")

2. SUMMARIES BY GROUP

If you apply the boxplot option with the plot = FALSE option, it will give you five number summaries (almost) for each group. But the output isn't very descriptive, so I wrote a simple "wrapper" function where the output is easier to follow.

lval.by.group=function(response,group)

{

B=boxplot(response~group, plot=FALSE, range=0)

S=as.matrix(B$stats)

dimnames(S)[[1]]=c("LO","QL","M","QH","HI")

dimnames(S)[[2]]=B$names

}

To use this in R, you just type the function into the R Console window. Or if you store this function in a file called lval.by.group.R, then you read the function into R by typing

source("lval.by.group.R")

Anyway, let me apply this function for the marathon data.

lval.by.group(time,age)

20 30 40 50 60

LO 150 194 163 222 219

QL 222 213 224 251 264

M 231 235 239 262 274

QH 240 259 262 281 279

HI 274 330 346 349 338

attr(,"class")

"integer"

I think the output is pretty clear.

3. WHAT IF YOUR DATA IS ORGANIZED AS (GROUP1, GROUP2, ...)?

Sometimes, it is convenient to read data in a matrix format, where the different groups are in different columns. An example of this is the population densities dataset.

data(pop.densities.1920.2000)

pop.densities.1920.2000[1:4,]

STATE y1920 y1960 y1980 y1990 y2000

1 AL 45.8 64.2 76.6 79.6 87.6

2 AK 0.1 0.4 0.7 1.0 1.1

3 AZ 2.9 11.5 23.9 32.3 45.2

4 AR 33.4 34.2 43.9 45.1 51.3

You see that the 1920 densities are in the 2nd column, the 1960 densities in the second column, etc.

Anyway, you want to put this in the

[Response, Group]

format. You can do this by the stack command. The input is the matrix of data (with the first column removed) and the output is what you want.

d=stack(pop.densities.1920.2000[,-1])

d[1:5,]

values ind

1 45.8 y1920

2 0.1 y1920

3 2.9 y1920

4 33.4 y1920

5 22.0 y1920

Now we can compute the letter values by group (year) by typing

lval.by.group(d$values,d$ind)

y1920 y1960 y1980 y1990 y2000

LO 0.10 0.4 0.7 1.00 1.10

QL 17.30 22.5 28.4 32.05 41.40

M 39.90 67.2 80.8 79.60 88.60

QH 62.35 114.6 157.7 181.65 202.85

HI 7292.90 12523.9 10132.3 9882.80 9378.00

Note that since I didn't attach the data frame d, I'm referring to the response variable as d$values and the grouping variable by d$ind.

Saturday, January 31, 2009

Letter values by group

No comments:

Blog Archive

About Me