Exploratory Data Analysis: October 2008

Thursday, October 30, 2008

My Smoothing Tribute to the Phillies

Last night was an exciting night. My team, the Philadelphia Phillies, won the World Series for only the second time in their history that started in 1883.

So it seems appropriate to give a statistical tribute to the Phillies.

I collected the Phillies winning percentage (percentage of games won) for each season in their history. Below I plot the winning percentage against the season year.

It is hard to see the general pattern in this graph, so it makes sense to smooth. I could use Tukey's resistant smooth described in the EDA notes, but I'll illustrate the use of an alternative smoothing method called lowess. Here is the R code to graph the scatterplot and overlay the lowess smooth.

plot(Year,Win.Pct,pch=19,col="red")

lines(lowess(Year,Win.Pct,f=.2),lwd=3,col="red")

title("PHILLIES WINNING PERCENTAGES",col="red")

abline(h=50,lwd=2)

I added the horizontal line at WIN.PCT = 50 that corresponds to an average season.

Looking at the graph, you should see why the Philadelphia fans are so excited about the Phillies winning this season.

1. The Phillies teams generally have been crummy, especially between 1920 and 1950. The team hasn't experienced much success.

2. But there have been two recent periods where the Phillies have been successful. One period was in the 1970's and the climax of this success was the Phillies first World Series win in 1980. The second period is in recent history and of course the Phillies won their second World Series in 2008.

Sunday, October 26, 2008

Comments on Plotting Homework

Here are some general and specific comments on your efforts on Homework 4 on plotting and straightening.

First, most of you did well on this homework. You were good in describing the fit and the pattern in the residuals. Also you made reasonable choices at reexpressing the x and/or the y variable so that the graph looked pretty straight.

But there were some things that caused you to lose points.

1. What are you looking for in a residual plot? In this assignment, the focus was looking for nonlinear patterns. For example, is it appropriate to fit a line to the (year, population) data for the England and Wales dataset? We can answer this question by looking for a nonlinear pattern in the plot of residuals against year. There is significant curvature (that is, a quadratic pattern) in the residual plot which tells you that the population growth is not linear.

By the way, some of you plotted log population against log year -- why did you take the log of year? It doesn't make any sense to me.

2. Always talk about the fit and the residuals in the context of the data. Someone plotted x against y without telling me the variables. The fun part of statistics is that you can always talk about the application.

3. Remember that funny problem where the scatterplot shows two group of points with a clear separation? This is one type of nonlinear pattern that you won't be able to straighten with a single choice of power transformation. But since the graph clearly divides into two parts, it makes sense to treat this as two independent problems and try to straighten each part.

4. Should one fit a line by least-squares or by a resistant line? In many situations, it won't make a difference -- either fit will work. But least-squares can give you relatively poor fits when there are outliers.

How can you tell if least squares isn't the best fit? Look at the residual plot. If you still see some increasing or descreasing pattern, then this tells you that least-squares hasn't explained all of the "tilt" pattern in the graph.

Monday, October 20, 2008

Straightening

This week, the main topic is reexpressing either the x or y variable to straighten a nonlinear pattern that we see in a scatterplot. Although the manipulation may be straightforward, it is possible to miss the main message in this material. Here are some questions that may help in your explanations in the homework.

Question 1: What is a simple description of the pattern in a scatterplot?

Question 2: Why do we prefer to fit lines instead of more complicated curves like quadratic or cubic?

Question 3: Is it possible to straighten all nonlinear patterns by power transformations?

Question 4: Can you think of a situation or an example where it is not possible to straighten by power transformations?

Don't forget to look at your data first -- you may see rightaway that it is not possible to straighten the graph.

Wednesday, October 15, 2008

Example of Resistant Fitting

In baseball, the objective is to win games and a team wins a game by scoring more runs than its opponent. An interesting question is "how important is a single run" towards the goal of winning a game? Suppose one collects the runs scored, the runs allowed, the number of wins and the number of losses for a group of teams. Bill James (a famous guy who works on baseball data) discovered the empirical relationship

log(wins/losses) = 2 log(runs scored/runs allowed)

He called this the Pythagorean Relationship.

Let's try to demonstrate this relationship by use of a resistant fit.

1. First, I collected data for all baseball teams in the 2008 season. The dataset teams2008. txt contains for each of the 30 teams ...

Team -- the name of the team
Wins -- the number of wins
Losses -- the number of losses
Runs.Scored -- the total number of runs scored
Runs.Allowed -- the total number of runs allowed

2. I read this dataset into R and compute the variables log.RR and log.WL.

data=read.table("http://bayes.bgsu.edu/eda/data/teams2008.txt",header=T)
attach(data)

log.RR=log(Runs.Scored/Runs.Allowed)
log.WL=log(Wins/Losses)

3. I graph log.RR against log.WL and add team labels to the graph. As we hoped, the relationship looks pretty linear.

plot(log.RR,log.WL,pch=19)
text(log.RR,log.WL,Team,pos=2)

4. I next fit a resistant line using the rline function in the LearnEDA package. I add the
fitted line to the graph.

the.fit=rline(log.RR,log.WL,iter=4)
curve(the.fit$a+the.fit$b*(x-the.fit$xC),add=TRUE)

5. If Bill James' relationship holds, the slope of the resistant line should be close to 2.

the.fit
$a
[1] 0.01006079
$b
[1] 1.801718

It approximately holds since the slope of 1.8 is close to 2.

6. To see if this is a reasonable fit, we compute the fit and the residuals.

FIT=the.fit$a+the.fit$b*(log.RR-the.fit$xC)
RESIDUAL=log.WL-FIT

and then plot the residuals, adding the team labels.

plot(log.RR,RESIDUAL,pch=19)
abline(h=0)
text(log.RR,RESIDUAL,Team,pos=2)

7. What are we looking for in the residual plot? First, we look for general patterns that we didn't see earlier in the first plot. I don't see any trend, so it appears that we removed the tilt by fitting a line.

Also we are looking for unusually small or large residual. Here a "lucky team" corresponds to a team who seemed to win more games than one would expect based on their wins and losses.

Which team was unusually lucky in the 2008 season? A hint: they were a "heavenly" team from the west coast.

Monday, October 6, 2008

Reexpressing for Symmetry

Your instructor is currently in "Phillies Heaven". His baseball team rarely has a chance to win the World Series (they have won one World Series in over 120 seasons of competing) and they currently in the National League Championship against the Dodgers!

Oh right -- I'm supposed to talk about the EDA class.

I finished the grading on your "symmetry" homework and Fathom assignment. You generally did fine, but I'll explain some issues that may have caused you to lose points.

What was I looking for?

In the notes, we talked about several methods for assessing symmetry of a batch, including looking a the sequence of midsummaries, using a symmetry plot, and Hinkley's quick method. For each dataset you consider, here's a outline of what you should do:

1. Demonstrate (using some method) that your raw data looks nonsymmetric.

2. Experiment with power transformations with different choices of p to try to make the reexpressed data approximately symmetric. Use one of our methods to see if the "p-reexpression" works.

3. Convince me that you have found a reasonable transformation by graphing the reexpressed data (say by a histogram or a stemplot).

It is best if you explain (with words) your process of finding the best reexpression. I'm much more interested in your thought process than your computer output.

Here are a few other pitfalls.

1. Some of you were confused when you considered reexpressions with negative values of p.

p = 1 -- data is really right-skewed (big positive value of Hinkley's d)

p = 0 -- data is slightly right-skewed (smaller positive value of d)

p = -1 -- data is right skewed (positive value of d)

What is going on? The problem is that you were defining a reexpression like p = -1/2 as

data^(-1/2)

when you should have used

-data^(-1/2)

By adding the negative sign, all of your transformations are increasing functions, and then

you can make better sense of the reexpressed graphs and the methods.

2. Reexpression only works when there is sufficient spread in your data. Suppose you have data that ranges from 50 to 60 -- here the range is only 60/50 = 1.2. Reexpression using any value of p won't work -- that is, it won't change the shape of the data.

3. One of you used a normal probability plot to determine if the graph was symmetric. What's wrong with this? Well, it is not one of the methods we discussed in the notes. Second, checking for normality is different (but related) than checking for symmetry. Data can be symmetric but not normally distributed.