Exploratory Data Analysis: How to Bin?

One of you asked me a question about the Fathom assignment on constructing histograms. Let me describe the issues by considering a histogram for a new dataset.

Today I was thinking about snow and wondering if 6 inches is really a lot of snowfall. So I thought I would look for some data showing levels of snowfall for different locations in the United States.

At the National Climatic Data Center site http://www.ncdc.noaa.gov/ussc/USSCAppController?action=map, one can download a "1.1 MB ASCII text file containing the maximum 1-day, 2-day, and 3-day snowfall for all available stations in the Lower 48 States and Alaska based on data through December 2006." This sounded interesting so I downloaded this data, created a text datafile, and then read this into R. (I had to do some data cleaning -- a little tedious.)

data=read.table("snowclim-fema-sep2006.txt",header=T,sep="\t")

This dataset contains the maximum 1-day snowfall, the maximum 2-day snowfall, the maximum 3-day snowfall for 9087 weather stations in 49 states in the U.S.

Let's focus on the max 1-day snowfall for all of the stations in Ohio -- remember, I wanted to figure out the extreme nature of 6 inches.

Using the subset command, I create a new data frame ohio that contains the data for the Ohio stations.

ohio=subset(data,data$State=="OH")

The variable of interest is called Max.1.Day.

Let's say that I wish to construct a histogram by hand -- that is, choose the bins by hand.

First, I use the summary command to figure out the extremes. So the variable falls betweeen 6.5 and 21.

Min. 1st Qu. Median Mean 3rd Qu. Max.

6.50 11.25 13.00 13.28 15.00 21.00

Let's try using 7 bins (why not?). The range of the data is 21-6.50 = 14.5. If I divide this range by 7, I get 14,5/7 = 2.07 which is approximately 2 (a prettier number than 2.07).

So let's try using the bins

(6, 8), (8, 10), (10, 12), (12, 14), (14, 16), (16, 18), (18, 20), (20, 22)

(Actually this is 8 bins, but that's ok -- it is close to 7 bins.)

In R, I create a vector that gives these bin endpoints -- we call these my.breaks.

my.breaks=seq(6,22,2)

I use the hist command with the breaks option. Also I want to show "6 inches" on the graph. I do this by adding a vertical line to the graph (abline command) and label it by the text command.

hist(ohio$Max.1.Day,breaks=my.breaks)

abline(v=6,lwd=3,col="red")

text(8,45,"6 inches",col="red")

Is this a good looking histogram? In other words, did I make a reasonable choice of bins?

This is a pretty smooth and I have a clear idea of the shape of the data (a little right-skewed). But maybe I need more bins to see finer structure in the histogram.

Let's try using bins of length 1.

my.breaks=seq(6,22,1)

hist(ohio$Max.1.Day,breaks=my.breaks)

abline(v=6,lwd=3,col="red")

text(8,30,"6 inches",col="red")

Well, I think we went too far. The histogram is less smooth -- I see some bumps in the histogram that may be a byproduct of random noise. I definitely prefer the first histogram to this one.

I am guessing that an "optimal" bin width is between 1 and 2.

If you use the hist command without a breaks option, it will use the "Sturges" method to determine the optimal bin width for this particular sample size. This method is similar to the one described in the Fathom lab.

I tried it. Suprisingly (really), it turns out that R used the same breaks as I did in the first histogram.

Wrap-up: How does this discussion relate to the Fathom homework? In Fathom, it is very easy to play with the bin width and see the effect of the choice of bins on the histograms. In the famous kid's tale "Goldilocks and the Three Bears", remember that Goldilocks tried several bowls of porridge -- one bowl was "too hot", a second bowl was "too cold", and a third bowl was "just right". Similarly, you'll get bad histograms when you choose too few bins or too many bins, and there will be a middle number of bins that will be "just right". By playing with different histograms, you will decide (by eye) what choice of bin width is "just right", and then you will see if your choice of best bin width corresponds to the bin choice using the optimal histogram rule.

By the way, how does the "optimal rule"

bin width = 2 (fourth spread) / n^(1/3)

perform here? Here the quartiles are 11.25 and 15, the fourth spread is 15-11.25 = 3.75, so the optimal bin width is 2 (3.75)/179^(1/3) = 1,33 which is a bit smaller than the bin width of 2 used in the hist command.

Final thought -- "6 inches" is really not a lot of snow, although it seemed to cripple the city where I live.

Wednesday, January 28, 2009

How to Bin?

1 comment:

Blog Archive

About Me