Monday, October 6, 2008

Reexpressing for Symmetry




Your instructor is currently in "Phillies Heaven".  His baseball team rarely has a chance to win the World Series (they have won one World Series in over 120 seasons of competing) and they currently in the National League Championship against the Dodgers!

Oh right -- I'm supposed to talk about the EDA class.

I finished the grading on your "symmetry" homework and Fathom assignment.    You generally did fine, but I'll explain some issues that may have caused you to lose points.

What was I looking for?

In the notes, we talked about several methods for assessing symmetry of a batch, including looking a the sequence of midsummaries, using a symmetry plot, and Hinkley's quick method.   For each dataset you consider, here's a outline of what you should do:

1.  Demonstrate (using some method) that your raw data looks nonsymmetric.

2.  Experiment with power transformations with different choices of p to try to make the reexpressed data approximately symmetric.  Use one of our methods to see if the "p-reexpression" works.

3.  Convince me that you have found a reasonable transformation by graphing the reexpressed data (say by a histogram or a stemplot).

It is best if you explain (with words) your process of finding the best reexpression.  I'm much more interested in your thought process than your computer output.

Here are a few other pitfalls.

1.  Some of you were confused when you considered reexpressions with negative values of p.

p = 1  -- data is really right-skewed (big positive value of Hinkley's d)

p = 0 -- data is slightly right-skewed (smaller positive value of d)

p = -1 -- data is right skewed (positive value of d)

What is going on?  The problem is that you were defining a reexpression like p = -1/2 as

data^(-1/2)

when you should have used

-data^(-1/2)

By adding the negative sign, all of your transformations are increasing functions, and then
you can make better sense of the reexpressed graphs and the methods.

2.  Reexpression only works when there is sufficient spread in your data.  Suppose you have data that ranges from 50 to 60 -- here the range is only 60/50 = 1.2.  Reexpression using any value of p won't work -- that is, it won't change the shape of the data.

3.  One of you used a normal probability plot to determine if the graph was symmetric.  What's wrong with this?  Well, it is not one of the methods we discussed in the notes.  Second, checking for normality is different (but related) than checking for symmetry.  Data can be symmetric but not normally distributed.


No comments: