Exploratory Data Analysis: March 2009

Wednesday, March 25, 2009

Looking at Hits on my Web Page

The next topic is two-way tables. To illustrate the application of median polish, I took some convenient data, that is, a data that was readily available to me.

A couple of years ago, I wrote a book on Bayesian computing using R. I have a website that gives resource material for the book and I use Google Analytics to monitor hits on this particular website. Each day I observe the number and location of hits; it is interesting data partly since it seems that statisticians from many countries are interested in my book.

Anyway, here is the data -- a number in the table represents the number of hits for a particular day of the week for a particular week.

Week 1 Week 2 Week 3 Week 4
Sunday 22 12 17 15
Monday 23 15 27 17
Tuesday 17 26 21 14
Wednesday 26 13 18 18
Thursday 24 27 28 13
Friday 28 17 17 19
Saturday 14 11 13 13

I am interested in how the website hits vary across days of the week and also how the website hits vary across weeks. I can explore these patterns by means of an additive fit that I do by the median polish algorithm.

Since the data is stored as a matrix, a median polish is done by the medpolish function:

fit=medpolish(web.hits)

The variable "fit" stores the output of medpolish. Let's look at each component of medpolish.

> fit$overall
[1] 18

This tells that the average number of hits (per day) on my website was 18.

> fit$row
Sunday Monday Tuesday Wednesday Thursday Friday
-2.00 0.50 -0.50 0.00 5.50 1.75
Saturday
-5.25

These are the row effects. For Sunday, the row effect is -2 -- this means that on this day, the number of hits tends to be 2 smaller than average. Comparing Sunday and Monday, there tends to be 0.50 - (-2.00) = 2.5 more hits on Monday.

> fit$col
Week 1 Week 2 Week 3 Week 4
4.50 -2.75 1.00 -1.00

These are the column effects. It looks like my website hits across weeks where HIGH, LOW, high, low. On average, there were 4.50 - (2.75) = 7.25 more hits on Week 1 than Week 2.

The remaining component in the additive fit are the residuals. These tell us how the hit values deviate from the fitted values (from the additive model).

> fit$residuals
Week 1 Week 2 Week 3 Week 4
Sunday 1.50 -1.25 0.00 0.00
Monday 0.00 -0.75 7.50 -0.50
Tuesday -5.00 11.25 2.50 -2.50
Wednesday 3.50 -2.25 -1.00 1.00
Thursday -4.00 6.25 3.50 -9.50
Friday 3.75 0.00 -3.75 0.25
Saturday -3.25 1.00 -0.75 1.25

If the residual values are generally small (small compared to the row and column effects), then the additive model is a good description of the patterns in the data. Actually, the residuals look large to me, so I'm not sure I'd get that excited about this additive fit. Specifically, the residual for Tuesday, Week 2 is 11.25 -- for some reason, this particular day had many hits -- many more than one would expect based on its day of the week and week number.

Tuesday, March 17, 2009

Smoothing on R

One of you asked how to produce a "3RSSH, twice" smooth on R. It seems that my R notes on the web could be clarified. Here I illustrate a simple function to do the smooth that we want.

Here is a new function that you can use called smooth.3RSSH.twice.

smooth.3RSSH.twice=function(data)

{

SMOOTH=han(smooth(attend,kind="3RSS")) # 3RSSH smooth

ROUGH=data-SMOOTH # computes the rough

SMOOTH+han(smooth(ROUGH,kind="3RSS")) # twicing operation

}

This program does three things:

1. First, one computes a 3RSSH smooth using the smooth command in R and the han function from the LearnEDA package.

2. Then one computes the rough (the residuals) from this smooth.

3. Then one smooths the rough (using the same 3RSSH smooth) and adds this "smoothed rough" to the first smooth to get a "twiced smooth".

Here is an illustration of how it works for the Braves attendance data from the notes.

(I am assuming the above function has been read into R.)

library(LearnEDA)

data(brave.at)

attach(brave.at)

plot(game,attend)

the.smooth=smooth.3RSSH.twice(attend)

lines(game,the.smooth,lwd=3,col="green")

Monday, March 16, 2009

Smoothing Free Throw Percentages

I hope you had a nice spring break. My son's tennis team is flying to Florida this week -- I wish I could join him.

Since March Madness is starting, I thought it would be appropriate to talk about basketball data. There was an interesting article about free-throw shooting that recently appeared in the New York Times. See the article here.

The main message was that free-throw shooting in professional basketball has hovered about 75% for many years. Unlike other athletic performances such as running and swimming, basketball players don't seem to be getting better in shooting free throws.

Is that really true? Has free-throw shooting accuracy remained constant for all of the years of professional basketball?

To answer this question, I collected some data. For each of the seasons 1949-50 through the current season 2008-09, I collected the overall free-throw shooting percentage in the NBA.

Here are the shooting percentages graphed as a function of year.

Actually, although the overall free-throw shooting percentage is approximately 75%, there seems to some interesting patterns in the graph.

To better see the patterns, I use the command

han(smooth(FTP,kind="3RSS"))

to superimpose a 3RSSH smooth on the graph and I get the following:

What patterns do we see?

1. In the early days between 1950-1970, the shooting percentages were relatively low with a valley around 72% in the late 1960's.

2. The shooting percentages increased through the 1970's, had a small valley and hit a peak of about 76% in 1990.

3. Then the percentages decreased again and had a local minimum of 74% around 1995.

4. In recent years, the percentages are increasing. It is interesting that the current free-throw shooting percentage 77.2 is the highest in NBA history.

So in reality, the shooting percentage has not stayed flat across years. But it is surprising that NBA players haven't learned to shoot free throws better in the last 60 years.

Monday, March 2, 2009

Fitting a Line by Eye

You did fine on the latest Fathom "fitting line" homework. But I sensed a little confusion and I should make a few comments about fitting a line by eye.

Let's return to that homework problem where you are plotting the OBP of the baseball players for two consecutive years.

Here is a plot that many of you produced.

Is this a good graph? Actually, NO since there is too much white space around the points.

You can improve this in Fathom by using the hand tool to fill up the space. Here is a better plot.

Second, it is not easy to fit a movable line. To make this process easier, plot the movable line and then add a residual graph.

A good line will remove any tilt pattern in the residual graph. I still have a downward tilt in this graph - this suggests I have to try a little harder.

This looks better -- I don't see much of an increasing or decreasing pattern in the residuals.

Does my best fit correspond to a least-squares or resistant fit? (By the way, the resistant line is called a median-median line in Fathom.) I show all three lines below. Least-squares is blue, median-median is purple, and my line is brown.

It looks like my line is closer to the resistant fit.

Generally, when one has outliers, I would anticipate that my line would be closer to the resistant median-median line.