Wednesday, April 8, 2009

Binning Baseball Ages

To motivate the issues on the next EDA topic, I collected the ages of all 651 pitchers who played major league baseball during the 2008 season.

I have a vector age that contains the ages in years for these pitchers.

A histogram is a standard way of graphing this data.  Before I graph, I should select reasonable bins; I could use the default selection of bins chosed by the R hist command, but I typically like more control over my graphical displays.

Here I'm interested in the number of players who are each possible age 20, 21, 22, ... etc.  So choose cutpoints 19.5, 20.5, ..., 46.5 that cover 
the range of the data and so there will be no confusion about data falling on bin boundaries.

cutpoints=seq(19.5,46.5)

Now I can use the hist function using the optional breaks argument.

hist(age,breaks=cutpoints)




















What do I see in this display?
  • The shape of the data looks a bit right-skewed.  I'm a little surprised about the number of pitchers who are 40 or older.
  • The most popular ages are 25 and 27 among MLB pitchers.
  • Looking more carefully, it might seem a little odd that we have 79 pitchers of ages 25 and 27, but only 66 pitchers who are age 26.
What is causing this odd behavior in the frequencies for popular ages?  We don't see this behavior for the bins with small counts.

Actually, this "odd behavior" is just an implication of the basic EDA idea that

LARGE COUNTS HAVE LARGER VARIABILITY THAN SMALL COUNTS

So we typically will see this type of behavior whenever we construct a histogram.

When we plot a histogram, it would seem desirable to remove this "variability problem" so it is easier to make comparisons.  For example, when we compare the counts to expected counts assuming a Gaussian model, it will be harder to look at residuals for bins with large counts and bins with small counts since we will have this unequal variability problem.

This discussion motivates the construction of a rootogram and eventually a suspended rootogram to make comparisons with a symmetric curve.

By the way, can we fit a Gaussian curve to our histogram?  On R, we

  • first plot the histogram using the freq=FALSE option -- the vertical scale will be DENSITY rather than COUNTS
  • use the curve command to add a normal curve where the mean and standard deviation are found by the mean and sd of the ages
Here are the commands and the resulting graph.

hist(age,breaks=cutpoints,freq=FALSE)
curve(dnorm(x,mean(age),sd(age)),add=TRUE,col="red",lwd=3)



















It should be clear that a Gaussian curve is not a good model for baseball ages.

Wednesday, March 25, 2009

Looking at Hits on my Web Page

The next topic is two-way tables. To illustrate the application of median polish, I took some convenient data, that is, a data that was readily available to me.

A couple of years ago, I wrote a book on Bayesian computing using R. I have a website that gives resource material for the book and I use Google Analytics to monitor hits on this particular website. Each day I observe the number and location of hits; it is interesting data partly since it seems that statisticians from many countries are interested in my book.

Anyway, here is the data -- a number in the table represents the number of hits for a particular day of the week for a particular week.

Week 1 Week 2 Week 3 Week 4
Sunday 22 12 17 15
Monday 23 15 27 17
Tuesday 17 26 21 14
Wednesday 26 13 18 18
Thursday 24 27 28 13
Friday 28 17 17 19
Saturday 14 11 13 13

I am interested in how the website hits vary across days of the week and also how the website hits vary across weeks. I can explore these patterns by means of an additive fit that I do by the median polish algorithm.

Since the data is stored as a matrix, a median polish is done by the medpolish function:

fit=medpolish(web.hits)

The variable "fit" stores the output of medpolish. Let's look at each component of medpolish.

> fit$overall
[1] 18

This tells that the average number of hits (per day) on my website was 18.

> fit$row
Sunday Monday Tuesday Wednesday Thursday Friday
-2.00 0.50 -0.50 0.00 5.50 1.75
Saturday
-5.25

These are the row effects. For Sunday, the row effect is -2 -- this means that on this day, the number of hits tends to be 2 smaller than average. Comparing Sunday and Monday, there tends to be 0.50 - (-2.00) = 2.5 more hits on Monday.

> fit$col
Week 1 Week 2 Week 3 Week 4
4.50 -2.75 1.00 -1.00

These are the column effects. It looks like my website hits across weeks where HIGH, LOW, high, low. On average, there were 4.50 - (2.75) = 7.25 more hits on Week 1 than Week 2.

The remaining component in the additive fit are the residuals. These tell us how the hit values deviate from the fitted values (from the additive model).

> fit$residuals
Week 1 Week 2 Week 3 Week 4
Sunday 1.50 -1.25 0.00 0.00
Monday 0.00 -0.75 7.50 -0.50
Tuesday -5.00 11.25 2.50 -2.50
Wednesday 3.50 -2.25 -1.00 1.00
Thursday -4.00 6.25 3.50 -9.50
Friday 3.75 0.00 -3.75 0.25
Saturday -3.25 1.00 -0.75 1.25

If the residual values are generally small (small compared to the row and column effects), then the additive model is a good description of the patterns in the data. Actually, the residuals look large to me, so I'm not sure I'd get that excited about this additive fit. Specifically, the residual for Tuesday, Week 2 is 11.25 -- for some reason, this particular day had many hits -- many more than one would expect based on its day of the week and week number.

Tuesday, March 17, 2009

Smoothing on R


One of you asked how to produce a "3RSSH, twice" smooth on R.  It seems that my R notes on the web could be clarified.  Here I illustrate a simple function to do the smooth that we want.

Here is a new function that you can use called smooth.3RSSH.twice.

smooth.3RSSH.twice=function(data)
{
SMOOTH=han(smooth(attend,kind="3RSS")) # 3RSSH smooth
ROUGH=data-SMOOTH                      # computes the rough
SMOOTH+han(smooth(ROUGH,kind="3RSS"))  # twicing operation
}

This program does three things:

1.  First, one computes a 3RSSH smooth using the smooth command in R and the han function from the LearnEDA package.

2.  Then one computes the rough (the residuals) from this smooth.

3.  Then one smooths the rough (using the same 3RSSH smooth) and adds this "smoothed rough" to the first smooth to get a "twiced smooth".

Here is an illustration of how it works for the Braves attendance data from the notes.
(I am assuming the above function has been read into R.)

library(LearnEDA)
data(brave.at)
attach(brave.at)
plot(game,attend)
the.smooth=smooth.3RSSH.twice(attend)
lines(game,the.smooth,lwd=3,col="green")


Monday, March 16, 2009

Smoothing Free Throw Percentages

I hope you had a nice spring break. My son's tennis team is flying to Florida this week -- I wish I could join him.

Since March Madness is starting, I thought it would be appropriate to talk about basketball data. There was an interesting article about free-throw shooting that recently appeared in the New York Times. See the article here.

The main message was that free-throw shooting in professional basketball has hovered about 75% for many years. Unlike other athletic performances such as running and swimming, basketball players don't seem to be getting better in shooting free throws.

Is that really true? Has free-throw shooting accuracy remained constant for all of the years of professional basketball?

To answer this question, I collected some data. For each of the seasons 1949-50 through the current season 2008-09, I collected the overall free-throw shooting percentage in the NBA.

Here are the shooting percentages graphed as a function of year.


















Actually, although the overall free-throw shooting percentage is approximately 75%, there seems to some interesting patterns in the graph.

To better see the patterns, I use the command

han(smooth(FTP,kind="3RSS"))

to superimpose a 3RSSH smooth on the graph and I get the following:


















What patterns do we see?

1. In the early days between 1950-1970, the shooting percentages were relatively low with a valley around 72% in the late 1960's.

2. The shooting percentages increased through the 1970's, had a small valley and hit a peak of about 76% in 1990.

3. Then the percentages decreased again and had a local minimum of 74% around 1995.

4. In recent years, the percentages are increasing. It is interesting that the current free-throw shooting percentage 77.2 is the highest in NBA history.

So in reality, the shooting percentage has not stayed flat across years. But it is surprising that NBA players haven't learned to shoot free throws better in the last 60 years.

Monday, March 2, 2009

Fitting a Line by Eye

You did fine on the latest Fathom "fitting line" homework. But I sensed a little confusion and I should make a few comments about fitting a line by eye.

Let's return to that homework problem where you are plotting the OBP of the baseball players for two consecutive years.

Here is a plot that many of you produced.
















Is this a good graph? Actually, NO since there is too much white space around the points.

You can improve this in Fathom by using the hand tool to fill up the space. Here is a better plot.















Second, it is not easy to fit a movable line. To make this process easier, plot the movable line and then add a residual graph.


















A good line will remove any tilt pattern in the residual graph. I still have a downward tilt in this graph - this suggests I have to try a little harder.


















This looks better -- I don't see much of an increasing or decreasing pattern in the residuals.

Does my best fit correspond to a least-squares or resistant fit? (By the way, the resistant line is called a median-median line in Fathom.) I show all three lines below. Least-squares is blue, median-median is purple, and my line is brown.



















It looks like my line is closer to the resistant fit.

Generally, when one has outliers, I would anticipate that my line would be closer to the resistant median-median line.

Friday, February 27, 2009

Graphical User Interface for R

Since some of you seem to be struggling with R's interface, where you type commands and you have to know the R functions.

There is an attractive package called R Commander which provides a menu interface for R. You work in a special window and most of the useful R functions are available as options in menus.

On the left, you see a snapshot of R Commander (on a Macintosh).

How do you get R Commander? It is easy -- you just install the package Rcmdr from CRAN. This package uses a number of other R packages. If you don't have these other packages installed yet, then R will automatically install these (it takes a little while -- be patient).

One of my former students who teaches at Youngstown State uses R Commander in his statistics classes. He says "Rcmdr is really cool, and the students eat it up."

Anyway, you might want to try this. You can run the functions in LearnEDA by first loading it and typing commands (like lval) in the top window.

Let me know if you find this helpful, since I haven't done much with it.

Thursday, February 26, 2009

Plotting and Straightening on R

To encourage you to use R for next week's assignment, I put together a new movie showing R in action.

As you probably know, one of the fastest growing companies in the U.S. is Google.  I found an interesting graph showing Google's growth (as measured by the number of employees):
















Obviously there has been a substantial increase in Google employees over this two year period.  But we wish to say more.  We'd like to describe the size of this growth.  Also, we'd like to look at the residuals that will detect possible interesting patterns beyond the obvious growth.

Here's the plan.

1.  We start with plotting the data, fitting a resistant line, and looking at the residuals.

2.  By looking at the half-slope ratio and the residuals, we decide if the graph is straight.

3.  If we see curvature in the graph, then we try to reexpress one or more of the variables (by a power transformation) to straighten the graph.  We use half-slope ratios and residual plots to see if we are successful in straightening the graph.

4.  Once the graph is straight, then we summarize the fit (interpret the slope of the resistant line) and examine the residuals.

The key function in the LearnEDA package is rline.

I illustrate using rline in my new movie 


The dataset can be found at http://bayes.bgsu.edu/eda/data/google.txt 
and my script of R commands for this analysis can be found at

By the way, I'm using a new version of the LearnEDA package that you can download at


Hope this is helpful in your work next week.


Wednesday, February 25, 2009

Sexy Jobs

Since many of you will be thinking about jobs soon, here is an interesting posting about desirable jobs. Many of you are in the right area!

In a recent talk by Google's chief economist Hal Varian, he says this:

"I keep saying the sexy job in the next ten years will be statisticians.
People think I’m joking, but who would’ve guessed that computer engineers
would’ve been the sexy job of the 1990s? . . .
The ability to take data—to be able to understand it, to process it,
to extract value from it, to visualize it, to communicate it—that’s going to
be a hugely important skill in the next decades, not only at the
professional level but even at the educational level for elementary school
kids, for high school kids, for college kids. Because now we really do have
essentially free and ubiquitous data. So the complimentary scarce factor is
the ability to understand that data and extract value from it.
I think statisticians are part of it, but it’s just a part. You also
want to be able to visualize the data, communicate the data, and utilize it
effectively. But I do think those skills—of being able to access,
understand, and communicate the insights you get from data analysis—are
going to be extremely important. Managers need to be able to access and
understand the data themselves. "

The full article is at

http://www.mckinseyquarterly.com/Hal_Varian_on_how_the_Web_challenges_managers_2286

Monday, February 23, 2009

Reducing Skewness

Do you want to see me quickly reduce skewness?

skewness

More seriously, here are some comments after looking at your homework (the grades are posted and you did generally well on this homework).

1. Why is symmetry important?

Some of you may be wondering why it is useful to make datasets symmetric. I don't think making a dataset symmetric is as important as making datasets have equal spreads, but here are a couple of reasons why symmetry is helpful.

-- Symmetric datasets are simpler to summarize. There is an obvious "average". Also if the data is bell-shaped, one can use the empirical rule (2/3 of the data fall within one standard deviation of the mean).

-- Many statistical procedures like ANOVA assume normally distributed data. One might wish to reexpress data before applying these procedures.

2. Monotone increasing reexpressions.

-- In our definition of power transformations, recall that when p is negative, we consider

"minus data raised to a power"

We did that so that all of the transformations are monotone increasing. So when you have a right-skewed dataset and you move from p=1 to p=0 to p=-1, you should be moving toward symmetry and left-skewness.

3. Skewness in the middle and skewness in the tails.

Sometimes it is tough to make a data symmetric, since there will be different behavior in the middle half and in the tails. You can detect this by a symmetry plot. The points to the left may be close to the line and the points to the right may fall off the line. This indicates that the middle of the data is symmetric, but there is a long tail to the right (or the left).

4. When does reexpression work?

You need sufficient spread in the data as measured by HI/LO. If this ratio is not much different from 1, reexpressions won't help much.

5. Are some of you "R resistant"?

Many of you seem to prefer using Fathom or Minitab. That's okay, but the best way to get comfortable using R is to practice using it. I made the package LearnEDA to make R easier to use, but some of you aren't taking advantage of the special functions.

Monday, February 16, 2009

Last Homework and Dreamtowns


We've been talking about taking power reexpressions of data to achieve particular objectives, such as equalizing spreads between batches or making a batch symmetric.

When do these reexpressions work?  

1.  First you need to have data with sufficient spread, that we measure by the ratio HI/LO for the power transformations to have much effect.

2.  You have to be taking a "significant" power.  In the last Fathom homework, most of you tried p = .82 (from a starting value of p = 1) which would typically have little effect on the data.  Typically, we try reexpressions in steps of 0.5, so we move from raw to roots to logs, and so on.

This week, we are reexpressing data to achieve approximate symmetry.  Here's an example.

Last summer, there was an interesting article posted on bizjournals.com that ranked 140 "dreamtowns" -- these towns offer refuge from big cities and conjested traffice.  I got interested in the article since my town, Findlay, made the list.

Anyway, they collected a number of variables from each city, including the percentage of adults (25 and over) who hold college degrees.

Here's a histogram of the these percentages from the 140 towns.




















This looks right-skewed with one outlier (I never knew that Bozeman, Montana had a lot of highly educated people) and this is a good candidate for reexpression.

In the notes, I give several methods (plotting the mids, using a symmetry plot, using Hinkley's method) for choosing the "right" rexpression.

I'd suggest that you should try at least 2 of these methods in your homework.

Monday, February 9, 2009

Simple Statistical Comparisons

I just finished grading your "comparing batches" homework.  You did fine in the mechanics (computing the summaries, constructing a spread versus level graph, and deciding on an appropriate reexpression), but it wasn't clear that you understand WHY we are doing all of this work.

Let's review the main points of this section.

1.  We wish to compare two batches.

2.  What does compare mean?  Well, it could mean many things.  Batch 1 has a larger spread, batch 2 has three more outliers than batch 2, and so on.  You illustrated many type of comparisons in your homework writeups.

Here we wish to compare the general or average locations of the two batches.   A simple statistical comparison is something like

"batch 2 is 10 units larger than batch 1"

Maybe we should call this a SSC (statisticians love to use acroynms.)

What this could mean is that if I added 10 units to each value in batch 1, then I would get a new dataset that resembles batch 2.

3.  Is it always appropriate to make a SSC?

No.

It won't work if the two batches have unequal spreads.  If they have unequal spreads, then adding a number to batch 1 will NOT resemble batch 2.

4.  So if the batches have different spreads, we give up?

No.

It is possible that can can reexpress the batches to a new scale, so that the new batches have approximate new scales.

5.  So the plan is to (1) try to find a suitable reexpression and (2) do a SSC on the reexpressed data.

The snowfall data example using Fathom is one example where our strategy works.  But generally, you did a poor job in making a SCC on the reexpressed data.




Sunday, February 1, 2009

A Statistical Salute to the Steelers

I just finished watching the Super Bowl and I'm happy that the Steelers won.  The winning quarterback Ben Roethlisberger is from my home town (Findlay) and it was very exciting seeing him direct his team to the winning touchdown at the end of the game.

As I was watching this game, I wondered: 

"Which team has the better offense?"

To answer this question, I collected the yards gained for every single play of the two teams, the Steelers and the Cardinals, for this Superbowl.  I created a datafile with two variables, Yards and Team, and entered this data into R.

Using the R function

boxplot(Yards~Team,horizontal=T,xlab="Yards Gained",ylab="Team",col="gold")

I created the following boxplot.



















What do we see in this boxplot display?
  • Both batches of yards gained look a bit right-skewed.  
  • There are four outliers for the Steelers and two for the Cardinals.  These correspond to big plays for the teams that gained a lot of yards.
Why did I draw this boxplot display?  Several comments:
  • We wish to make a comparison between the two batches.
  • What is a comparison?  Well, we could say that one batch tends to be larger than the second batch.  For this example, if it were true, then I would say that the Cardinals gained more yards (per play) than the Steelers.
  • But that isn't really saying much.   When I say "make a comparison", this means that I want to say that one batch is a particular number larger or smaller than the second batch.
  • When can we say 
    "batch 1 is 10 larger than batch 2"?
  • As you'll read in the notes, we can only make this type of comparison "batch 1 is 10 larger than batch 2" when the two batches have similar spreads.
Returning to my football example, it doesn't appear that the two teams were very different with respect to yards gained per play.  Both teams had approximately the same median.

Saturday, January 31, 2009

Letter values by group

In the next topic, we are talking about comparing groups.  The first step is to construct some graph (such as a stemplot or boxplot or dotplot) across groups and the next step is to compute summaries (such as letter values) for each group.  Here are some comments about using R to do this stuff.

Let's return to the boston marathon data where there are two variables, time and age.  We are interested in comparing the completion times for different ages.

data(boston.marathon)
attach(boston.marathon)

By the way, this data frame is organized with two variables, response (here time) and
group (here age).

1.  GRAPHS

There is no simple way to construct parallel stemplots in R for different groups.  Parallel boxplots are easy to construct by typing, say, 

boxplot(time~age,ylab="Time",xlab="Group")

Also, it is easy to construct parallel dotplots by use of the stripchart function.

stripchart(time~age,ylab="Group")

Personally, I prefer dots that are solid black and are stacked:

stripchart(time~age,method="stack",pch=19,ylab="Group")

2.  SUMMARIES BY GROUP

If you apply the boxplot option with the plot = FALSE option, it will give you five number summaries (almost) for each group.  But the output isn't very descriptive, so I wrote a simple "wrapper" function where the output is easier to follow.

lval.by.group=function(response,group)
{
B=boxplot(response~group, plot=FALSE, range=0)
S=as.matrix(B$stats)
dimnames(S)[[1]]=c("LO","QL","M","QH","HI")
dimnames(S)[[2]]=B$names
S
}

To use this in R, you just type the function into the R Console window.  Or if you store this function in a file called lval.by.group.R, then you read the function into R by typing

source("lval.by.group.R")

Anyway, let me apply this function for the marathon data.

lval.by.group(time,age)
    20  30  40  50  60
LO 150 194 163 222 219
QL 222 213 224 251 264
M  231 235 239 262 274
QH 240 259 262 281 279
HI 274 330 346 349 338
attr(,"class")
       20 
"integer" 

I think the output is pretty clear.

3.  WHAT IF YOUR DATA IS ORGANIZED AS (GROUP1, GROUP2, ...)?

Sometimes, it is convenient to read data in a matrix format, where the different groups are in different columns.  An example of this is the population densities dataset.

data(pop.densities.1920.2000)
pop.densities.1920.2000[1:4,]
  STATE y1920 y1960 y1980 y1990 y2000
1    AL  45.8  64.2  76.6  79.6  87.6
2    AK   0.1   0.4   0.7   1.0   1.1
3    AZ   2.9  11.5  23.9  32.3  45.2
4    AR  33.4  34.2  43.9  45.1  51.3

You see that the 1920 densities are in the 2nd column, the 1960 densities in the second column, etc.

Anyway, you want to put this in the

[Response, Group]

format.  You can do this by the stack command.  The input is the matrix of data (with the first column removed) and the output is what you want.

d=stack(pop.densities.1920.2000[,-1])
d[1:5,]
  values   ind
1   45.8 y1920
2    0.1 y1920
3    2.9 y1920
4   33.4 y1920
5   22.0 y1920

Now we can compute the letter values by group (year) by typing

lval.by.group(d$values,d$ind)
     y1920   y1960   y1980   y1990   y2000
LO    0.10     0.4     0.7    1.00    1.10
QL   17.30    22.5    28.4   32.05   41.40
M    39.90    67.2    80.8   79.60   88.60
QH   62.35   114.6   157.7  181.65  202.85
HI 7292.90 12523.9 10132.3 9882.80 9378.00

Note that since I didn't attach the data frame d, I'm referring to the response variable as d$values and the grouping variable by d$ind.



Wednesday, January 28, 2009

How to Bin?

One of you asked me a question about the Fathom assignment on constructing histograms.  Let me describe the issues by considering a histogram for a new dataset.

Today I was thinking about snow and wondering if 6 inches is really a lot of snowfall.  So I thought I would look for some data showing levels of snowfall for different locations in the United States.

At the National Climatic Data Center site http://www.ncdc.noaa.gov/ussc/USSCAppController?action=map, one can download a "1.1 MB ASCII text file containing the maximum 1-day, 2-day, and 3-day snowfall for all available stations in the Lower 48 States and Alaska based on data through December 2006."  This sounded interesting so I downloaded this data, created a text datafile, and then read this into R.  (I had to do some data cleaning -- a little tedious.)

data=read.table("snowclim-fema-sep2006.txt",header=T,sep="\t")

This dataset contains the maximum 1-day snowfall, the maximum 2-day snowfall, the maximum 3-day snowfall for 9087 weather stations in 49 states in the U.S.

Let's focus on the max 1-day snowfall for all of the stations in Ohio -- remember, I wanted to figure out the extreme nature of 6 inches.

Using the subset command, I create a new data frame ohio that contains the data for the Ohio stations.

ohio=subset(data,data$State=="OH")

The variable of interest is called Max.1.Day.

Let's say that I wish to construct a histogram by hand -- that is, choose the bins by hand. 

First, I use the summary command to figure out the extremes.  So the variable falls betweeen 6.5 and 21.

  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   6.50   11.25   13.00   13.28   15.00   21.00 

Let's try using 7 bins (why not?).  The range of the data is 21-6.50 = 14.5.  If I divide this range by 7, I get 14,5/7 = 2.07 which is approximately 2 (a prettier number than 2.07).

So let's try using the bins

(6, 8),  (8, 10), (10, 12), (12, 14), (14, 16), (16, 18), (18, 20), (20, 22)

(Actually this is 8 bins, but that's ok -- it is close to 7 bins.)

In R, I create a vector that gives these bin endpoints -- we call these my.breaks.

my.breaks=seq(6,22,2)

I use the hist command with the breaks option.  Also I want to show "6 inches" on the graph.  I do this by adding a vertical line to the graph (abline command) and label it by the text command. 

hist(ohio$Max.1.Day,breaks=my.breaks)
abline(v=6,lwd=3,col="red")
text(8,45,"6 inches",col="red")





Is this a good looking histogram?  In other words, did I make a reasonable choice of bins?  
This is a pretty smooth and I have a clear idea of the shape of the data (a little right-skewed).  But maybe I need more bins to see finer structure in the histogram.






Let's try using bins of length 1.

my.breaks=seq(6,22,1)
hist(ohio$Max.1.Day,breaks=my.breaks)
abline(v=6,lwd=3,col="red")
text(8,30,"6 inches",col="red")






Well, I think we went too far.  The histogram is less smooth -- I see some bumps in the histogram that may be a byproduct of random noise.  I definitely prefer the first histogram to this one.







I am guessing that an "optimal" bin width is between 1 and 2.

If you use the hist command without a breaks option, it will use the "Sturges" method to determine the optimal bin width for this particular sample size.  This method is similar to the one described in the Fathom lab.

I tried it.  Suprisingly (really), it turns out that R used the same breaks as I did in the first histogram.  

Wrap-up:  How does this discussion relate to the Fathom homework?  In Fathom, it is very easy to play with the bin width and see the effect of the choice of bins on the histograms.  In the famous kid's tale "Goldilocks and the Three Bears", remember that Goldilocks tried several bowls of porridge -- one bowl was "too hot", a second bowl was "too cold", and a third bowl was "just right".  Similarly, you'll get bad histograms when you choose too few bins or too many bins, and there will be a middle number of bins that will be "just right".  By playing with different histograms, you will decide (by eye) what choice of bin width is "just right", and then you will see if your choice of best bin width corresponds to the bin choice using the optimal histogram rule.

By the way, how does the "optimal rule" 

  bin width = 2 (fourth spread) / n^(1/3)

perform here?  Here the quartiles are 11.25 and 15, the fourth spread is 15-11.25 = 3.75, so the optimal bin width is 2 (3.75)/179^(1/3) = 1,33 which is a bit smaller than the bin width of 2 used in the hist command.

Final thought -- "6 inches" is really not a lot of snow, although it seemed to cripple the city where I live.












Wednesday, January 21, 2009

President Ages

As you know, yesterday was a big day in the history of the US as Barack Obama became our 44th president.  In honor of that event, I collected the ages at inauguration and lifespans of all 44 presidents and read this dataset into R.

data=read.table("president.ages.txt",header=T,sep="\t")

I attach the dataframe to make the variables visible.

attach(data)

Then using the stem.leaf function in the aplstats package, I construct a stemplot of the president ages:

> stem.leaf(Age)
1 | 2: represents 12
 leaf unit: 1
            n: 44
   2     t | 23
         f | 
   6     s | 6677
   9    4. | 899
  15    5* | 011111
  17     t | 22
  (9)    f | 444445555
  18     s | 6667777
  11    5. | 8
  10    6* | 0111
   6     t | 2
   5     f | 445
         s | 
   2    6. | 89

What do I see?  Here are some things I notice:

  • The ages at inauguration seem pretty symmetric shaped about the values of 54 or 55.
  • President ages range from 42 to 69.  Actually, one has to be a particular age to be president, so I believe that 42 is close to the rule.
  • Barrack Obama is one of the youngest presidents in history at 48
I didn't play with the options for stemplot -- I just used the default settings in the stem.leaf function.  Could I produce a better stemplot?

Let's break between the tens and ones with five leaves per stem.

> stem.leaf(Age,unit=1,m=2)
1 | 2: represents 12
 leaf unit: 1
            n: 44
    2    4* | 23
    9    4. | 6677899
   22    5* | 0111112244444
  (12)   5. | 555566677778
   10    6* | 0111244
    3    6. | 589

Or I could go the other way and split between the ones and tenths with 10 leaves per stem.

> stem.leaf(Age,unit=.1,m=1)
1 | 2: represents 1.2
 leaf unit: 0.1
            n: 44
   1    42 | 0
   2    43 | 0
        44 | 
        45 | 
   4    46 | 00
   6    47 | 00
   7    48 | 0
   9    49 | 00
  10    50 | 0
  15    51 | 00000
  17    52 | 00
        53 | 
  22    54 | 00000
  (4)   55 | 0000
  18    56 | 000
  15    57 | 0000
  11    58 | 0
        59 | 
  10    60 | 0
   9    61 | 000
   6    62 | 0
        63 | 
   5    64 | 00
   3    65 | 0
HI: 68 69

I think it is pretty obvious that the first stemplot is the best.   If I have too few lines, then I lose some of the structure of the distribution; with too many lines, I don't see any structure at all.




Wednesday, January 14, 2009

Learning R by Viewing Movies

As you are learning R, you should try out the movies that I have made that are posted at

http://bayes.bgsu.edu/eda

(I call them R Encounters.) You'll see movies that show

-- how to create a dataset and read it into R
-- how to install the LearnEDA package in R
-- illlustrating a simple data analysis on R

I'm assuming that most of you are working in Windows. Let me know if you work on a Macintosh -- things work on R a little differently on mac.

Students have struggled in the past on creating datafiles and reading them in R -- let me know if you are experiencing problems with this.

Also, let me know if you like the R movies. They are easy to make and I can make more of them if you find them useful.

Tuesday, January 13, 2009

Welcome to Exploratory Data Analysis

Hi EDA students.

I will be using this blog to provide advice on using R, give you new examples and help on the homework.

This week, you'll be learning R and Fathom and creating a couple of datasets that you'll be reading into R.

I have provided help on R on my website http://bayes.bgsu.edu/eda

You'll find guidance on installing R and the LearnEDA package, and several R lessons that you can work through.  Also I made some movies that illustrate some basic R stuff.

I know that learning R can be difficult, so let me know if you have specific concerns.