Saturday, January 31, 2009

Letter values by group

In the next topic, we are talking about comparing groups.  The first step is to construct some graph (such as a stemplot or boxplot or dotplot) across groups and the next step is to compute summaries (such as letter values) for each group.  Here are some comments about using R to do this stuff.

Let's return to the boston marathon data where there are two variables, time and age.  We are interested in comparing the completion times for different ages.

data(boston.marathon)
attach(boston.marathon)

By the way, this data frame is organized with two variables, response (here time) and
group (here age).

1.  GRAPHS

There is no simple way to construct parallel stemplots in R for different groups.  Parallel boxplots are easy to construct by typing, say, 

boxplot(time~age,ylab="Time",xlab="Group")

Also, it is easy to construct parallel dotplots by use of the stripchart function.

stripchart(time~age,ylab="Group")

Personally, I prefer dots that are solid black and are stacked:

stripchart(time~age,method="stack",pch=19,ylab="Group")

2.  SUMMARIES BY GROUP

If you apply the boxplot option with the plot = FALSE option, it will give you five number summaries (almost) for each group.  But the output isn't very descriptive, so I wrote a simple "wrapper" function where the output is easier to follow.

lval.by.group=function(response,group)
{
B=boxplot(response~group, plot=FALSE, range=0)
S=as.matrix(B$stats)
dimnames(S)[[1]]=c("LO","QL","M","QH","HI")
dimnames(S)[[2]]=B$names
S
}

To use this in R, you just type the function into the R Console window.  Or if you store this function in a file called lval.by.group.R, then you read the function into R by typing

source("lval.by.group.R")

Anyway, let me apply this function for the marathon data.

lval.by.group(time,age)
    20  30  40  50  60
LO 150 194 163 222 219
QL 222 213 224 251 264
M  231 235 239 262 274
QH 240 259 262 281 279
HI 274 330 346 349 338
attr(,"class")
       20 
"integer" 

I think the output is pretty clear.

3.  WHAT IF YOUR DATA IS ORGANIZED AS (GROUP1, GROUP2, ...)?

Sometimes, it is convenient to read data in a matrix format, where the different groups are in different columns.  An example of this is the population densities dataset.

data(pop.densities.1920.2000)
pop.densities.1920.2000[1:4,]
  STATE y1920 y1960 y1980 y1990 y2000
1    AL  45.8  64.2  76.6  79.6  87.6
2    AK   0.1   0.4   0.7   1.0   1.1
3    AZ   2.9  11.5  23.9  32.3  45.2
4    AR  33.4  34.2  43.9  45.1  51.3

You see that the 1920 densities are in the 2nd column, the 1960 densities in the second column, etc.

Anyway, you want to put this in the

[Response, Group]

format.  You can do this by the stack command.  The input is the matrix of data (with the first column removed) and the output is what you want.

d=stack(pop.densities.1920.2000[,-1])
d[1:5,]
  values   ind
1   45.8 y1920
2    0.1 y1920
3    2.9 y1920
4   33.4 y1920
5   22.0 y1920

Now we can compute the letter values by group (year) by typing

lval.by.group(d$values,d$ind)
     y1920   y1960   y1980   y1990   y2000
LO    0.10     0.4     0.7    1.00    1.10
QL   17.30    22.5    28.4   32.05   41.40
M    39.90    67.2    80.8   79.60   88.60
QH   62.35   114.6   157.7  181.65  202.85
HI 7292.90 12523.9 10132.3 9882.80 9378.00

Note that since I didn't attach the data frame d, I'm referring to the response variable as d$values and the grouping variable by d$ind.



Wednesday, January 28, 2009

How to Bin?

One of you asked me a question about the Fathom assignment on constructing histograms.  Let me describe the issues by considering a histogram for a new dataset.

Today I was thinking about snow and wondering if 6 inches is really a lot of snowfall.  So I thought I would look for some data showing levels of snowfall for different locations in the United States.

At the National Climatic Data Center site http://www.ncdc.noaa.gov/ussc/USSCAppController?action=map, one can download a "1.1 MB ASCII text file containing the maximum 1-day, 2-day, and 3-day snowfall for all available stations in the Lower 48 States and Alaska based on data through December 2006."  This sounded interesting so I downloaded this data, created a text datafile, and then read this into R.  (I had to do some data cleaning -- a little tedious.)

data=read.table("snowclim-fema-sep2006.txt",header=T,sep="\t")

This dataset contains the maximum 1-day snowfall, the maximum 2-day snowfall, the maximum 3-day snowfall for 9087 weather stations in 49 states in the U.S.

Let's focus on the max 1-day snowfall for all of the stations in Ohio -- remember, I wanted to figure out the extreme nature of 6 inches.

Using the subset command, I create a new data frame ohio that contains the data for the Ohio stations.

ohio=subset(data,data$State=="OH")

The variable of interest is called Max.1.Day.

Let's say that I wish to construct a histogram by hand -- that is, choose the bins by hand. 

First, I use the summary command to figure out the extremes.  So the variable falls betweeen 6.5 and 21.

  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   6.50   11.25   13.00   13.28   15.00   21.00 

Let's try using 7 bins (why not?).  The range of the data is 21-6.50 = 14.5.  If I divide this range by 7, I get 14,5/7 = 2.07 which is approximately 2 (a prettier number than 2.07).

So let's try using the bins

(6, 8),  (8, 10), (10, 12), (12, 14), (14, 16), (16, 18), (18, 20), (20, 22)

(Actually this is 8 bins, but that's ok -- it is close to 7 bins.)

In R, I create a vector that gives these bin endpoints -- we call these my.breaks.

my.breaks=seq(6,22,2)

I use the hist command with the breaks option.  Also I want to show "6 inches" on the graph.  I do this by adding a vertical line to the graph (abline command) and label it by the text command. 

hist(ohio$Max.1.Day,breaks=my.breaks)
abline(v=6,lwd=3,col="red")
text(8,45,"6 inches",col="red")





Is this a good looking histogram?  In other words, did I make a reasonable choice of bins?  
This is a pretty smooth and I have a clear idea of the shape of the data (a little right-skewed).  But maybe I need more bins to see finer structure in the histogram.






Let's try using bins of length 1.

my.breaks=seq(6,22,1)
hist(ohio$Max.1.Day,breaks=my.breaks)
abline(v=6,lwd=3,col="red")
text(8,30,"6 inches",col="red")






Well, I think we went too far.  The histogram is less smooth -- I see some bumps in the histogram that may be a byproduct of random noise.  I definitely prefer the first histogram to this one.







I am guessing that an "optimal" bin width is between 1 and 2.

If you use the hist command without a breaks option, it will use the "Sturges" method to determine the optimal bin width for this particular sample size.  This method is similar to the one described in the Fathom lab.

I tried it.  Suprisingly (really), it turns out that R used the same breaks as I did in the first histogram.  

Wrap-up:  How does this discussion relate to the Fathom homework?  In Fathom, it is very easy to play with the bin width and see the effect of the choice of bins on the histograms.  In the famous kid's tale "Goldilocks and the Three Bears", remember that Goldilocks tried several bowls of porridge -- one bowl was "too hot", a second bowl was "too cold", and a third bowl was "just right".  Similarly, you'll get bad histograms when you choose too few bins or too many bins, and there will be a middle number of bins that will be "just right".  By playing with different histograms, you will decide (by eye) what choice of bin width is "just right", and then you will see if your choice of best bin width corresponds to the bin choice using the optimal histogram rule.

By the way, how does the "optimal rule" 

  bin width = 2 (fourth spread) / n^(1/3)

perform here?  Here the quartiles are 11.25 and 15, the fourth spread is 15-11.25 = 3.75, so the optimal bin width is 2 (3.75)/179^(1/3) = 1,33 which is a bit smaller than the bin width of 2 used in the hist command.

Final thought -- "6 inches" is really not a lot of snow, although it seemed to cripple the city where I live.












Wednesday, January 21, 2009

President Ages

As you know, yesterday was a big day in the history of the US as Barack Obama became our 44th president.  In honor of that event, I collected the ages at inauguration and lifespans of all 44 presidents and read this dataset into R.

data=read.table("president.ages.txt",header=T,sep="\t")

I attach the dataframe to make the variables visible.

attach(data)

Then using the stem.leaf function in the aplstats package, I construct a stemplot of the president ages:

> stem.leaf(Age)
1 | 2: represents 12
 leaf unit: 1
            n: 44
   2     t | 23
         f | 
   6     s | 6677
   9    4. | 899
  15    5* | 011111
  17     t | 22
  (9)    f | 444445555
  18     s | 6667777
  11    5. | 8
  10    6* | 0111
   6     t | 2
   5     f | 445
         s | 
   2    6. | 89

What do I see?  Here are some things I notice:

  • The ages at inauguration seem pretty symmetric shaped about the values of 54 or 55.
  • President ages range from 42 to 69.  Actually, one has to be a particular age to be president, so I believe that 42 is close to the rule.
  • Barrack Obama is one of the youngest presidents in history at 48
I didn't play with the options for stemplot -- I just used the default settings in the stem.leaf function.  Could I produce a better stemplot?

Let's break between the tens and ones with five leaves per stem.

> stem.leaf(Age,unit=1,m=2)
1 | 2: represents 12
 leaf unit: 1
            n: 44
    2    4* | 23
    9    4. | 6677899
   22    5* | 0111112244444
  (12)   5. | 555566677778
   10    6* | 0111244
    3    6. | 589

Or I could go the other way and split between the ones and tenths with 10 leaves per stem.

> stem.leaf(Age,unit=.1,m=1)
1 | 2: represents 1.2
 leaf unit: 0.1
            n: 44
   1    42 | 0
   2    43 | 0
        44 | 
        45 | 
   4    46 | 00
   6    47 | 00
   7    48 | 0
   9    49 | 00
  10    50 | 0
  15    51 | 00000
  17    52 | 00
        53 | 
  22    54 | 00000
  (4)   55 | 0000
  18    56 | 000
  15    57 | 0000
  11    58 | 0
        59 | 
  10    60 | 0
   9    61 | 000
   6    62 | 0
        63 | 
   5    64 | 00
   3    65 | 0
HI: 68 69

I think it is pretty obvious that the first stemplot is the best.   If I have too few lines, then I lose some of the structure of the distribution; with too many lines, I don't see any structure at all.




Wednesday, January 14, 2009

Learning R by Viewing Movies

As you are learning R, you should try out the movies that I have made that are posted at

http://bayes.bgsu.edu/eda

(I call them R Encounters.) You'll see movies that show

-- how to create a dataset and read it into R
-- how to install the LearnEDA package in R
-- illlustrating a simple data analysis on R

I'm assuming that most of you are working in Windows. Let me know if you work on a Macintosh -- things work on R a little differently on mac.

Students have struggled in the past on creating datafiles and reading them in R -- let me know if you are experiencing problems with this.

Also, let me know if you like the R movies. They are easy to make and I can make more of them if you find them useful.

Tuesday, January 13, 2009

Welcome to Exploratory Data Analysis

Hi EDA students.

I will be using this blog to provide advice on using R, give you new examples and help on the homework.

This week, you'll be learning R and Fathom and creating a couple of datasets that you'll be reading into R.

I have provided help on R on my website http://bayes.bgsu.edu/eda

You'll find guidance on installing R and the LearnEDA package, and several R lessons that you can work through.  Also I made some movies that illustrate some basic R stuff.

I know that learning R can be difficult, so let me know if you have specific concerns.