Exploratory Data Analysis: February 2009

Friday, February 27, 2009

Graphical User Interface for R

Since some of you seem to be struggling with R's interface, where you type commands and you have to know the R functions.

There is an attractive package called R Commander which provides a menu interface for R. You work in a special window and most of the useful R functions are available as options in menus.

On the left, you see a snapshot of R Commander (on a Macintosh).

How do you get R Commander? It is easy -- you just install the package Rcmdr from CRAN. This package uses a number of other R packages. If you don't have these other packages installed yet, then R will automatically install these (it takes a little while -- be patient).

One of my former students who teaches at Youngstown State uses R Commander in his statistics classes. He says "Rcmdr is really cool, and the students eat it up."

Anyway, you might want to try this. You can run the functions in LearnEDA by first loading it and typing commands (like lval) in the top window.

Let me know if you find this helpful, since I haven't done much with it.

Thursday, February 26, 2009

Plotting and Straightening on R

To encourage you to use R for next week's assignment, I put together a new movie showing R in action.

As you probably know, one of the fastest growing companies in the U.S. is Google. I found an interesting graph showing Google's growth (as measured by the number of employees):

Obviously there has been a substantial increase in Google employees over this two year period. But we wish to say more. We'd like to describe the size of this growth. Also, we'd like to look at the residuals that will detect possible interesting patterns beyond the obvious growth.

Here's the plan.

1. We start with plotting the data, fitting a resistant line, and looking at the residuals.

2. By looking at the half-slope ratio and the residuals, we decide if the graph is straight.

3. If we see curvature in the graph, then we try to reexpress one or more of the variables (by a power transformation) to straighten the graph. We use half-slope ratios and residual plots to see if we are successful in straightening the graph.

4. Once the graph is straight, then we summarize the fit (interpret the slope of the resistant line) and examine the residuals.

The key function in the LearnEDA package is rline.

I illustrate using rline in my new movie

http://bayes.bgsu.edu/eda/straightening.google.swf

The dataset can be found at http://bayes.bgsu.edu/eda/data/google.txt

and my script of R commands for this analysis can be found at

http://bayes.bgsu.edu/eda/R/Ch5.google.work.R

By the way, I'm using a new version of the LearnEDA package that you can download at

http://bayes.bgsu.edu/eda/LearnEDA_1.01.zip

Hope this is helpful in your work next week.

Wednesday, February 25, 2009

Sexy Jobs

Since many of you will be thinking about jobs soon, here is an interesting posting about desirable jobs. Many of you are in the right area!

In a recent talk by Google's chief economist Hal Varian, he says this:

"I keep saying the sexy job in the next ten years will be statisticians.
People think I’m joking, but who would’ve guessed that computer engineers
would’ve been the sexy job of the 1990s? . . .
The ability to take data—to be able to understand it, to process it,
to extract value from it, to visualize it, to communicate it—that’s going to
be a hugely important skill in the next decades, not only at the
professional level but even at the educational level for elementary school
kids, for high school kids, for college kids. Because now we really do have
essentially free and ubiquitous data. So the complimentary scarce factor is
the ability to understand that data and extract value from it.
I think statisticians are part of it, but it’s just a part. You also
want to be able to visualize the data, communicate the data, and utilize it
effectively. But I do think those skills—of being able to access,
understand, and communicate the insights you get from data analysis—are
going to be extremely important. Managers need to be able to access and
understand the data themselves. "

The full article is at

http://www.mckinseyquarterly.com/Hal_Varian_on_how_the_Web_challenges_managers_2286

Monday, February 23, 2009

Reducing Skewness

Do you want to see me quickly reduce skewness?

skewness

More seriously, here are some comments after looking at your homework (the grades are posted and you did generally well on this homework).

1. Why is symmetry important?

Some of you may be wondering why it is useful to make datasets symmetric. I don't think making a dataset symmetric is as important as making datasets have equal spreads, but here are a couple of reasons why symmetry is helpful.

-- Symmetric datasets are simpler to summarize. There is an obvious "average". Also if the data is bell-shaped, one can use the empirical rule (2/3 of the data fall within one standard deviation of the mean).

-- Many statistical procedures like ANOVA assume normally distributed data. One might wish to reexpress data before applying these procedures.

2. Monotone increasing reexpressions.

-- In our definition of power transformations, recall that when p is negative, we consider

"minus data raised to a power"

We did that so that all of the transformations are monotone increasing. So when you have a right-skewed dataset and you move from p=1 to p=0 to p=-1, you should be moving toward symmetry and left-skewness.

3. Skewness in the middle and skewness in the tails.

Sometimes it is tough to make a data symmetric, since there will be different behavior in the middle half and in the tails. You can detect this by a symmetry plot. The points to the left may be close to the line and the points to the right may fall off the line. This indicates that the middle of the data is symmetric, but there is a long tail to the right (or the left).

4. When does reexpression work?

You need sufficient spread in the data as measured by HI/LO. If this ratio is not much different from 1, reexpressions won't help much.

5. Are some of you "R resistant"?

Many of you seem to prefer using Fathom or Minitab. That's okay, but the best way to get comfortable using R is to practice using it. I made the package LearnEDA to make R easier to use, but some of you aren't taking advantage of the special functions.

Monday, February 16, 2009

Last Homework and Dreamtowns

We've been talking about taking power reexpressions of data to achieve particular objectives, such as equalizing spreads between batches or making a batch symmetric.

When do these reexpressions work?

1. First you need to have data with sufficient spread, that we measure by the ratio HI/LO for the power transformations to have much effect.

2. You have to be taking a "significant" power. In the last Fathom homework, most of you tried p = .82 (from a starting value of p = 1) which would typically have little effect on the data. Typically, we try reexpressions in steps of 0.5, so we move from raw to roots to logs, and so on.

This week, we are reexpressing data to achieve approximate symmetry. Here's an example.

Last summer, there was an interesting article posted on bizjournals.com that ranked 140 "dreamtowns" -- these towns offer refuge from big cities and conjested traffice. I got interested in the article since my town, Findlay, made the list.

Anyway, they collected a number of variables from each city, including the percentage of adults (25 and over) who hold college degrees.

Here's a histogram of the these percentages from the 140 towns.

This looks right-skewed with one outlier (I never knew that Bozeman, Montana had a lot of highly educated people) and this is a good candidate for reexpression.

In the notes, I give several methods (plotting the mids, using a symmetry plot, using Hinkley's method) for choosing the "right" rexpression.

I'd suggest that you should try at least 2 of these methods in your homework.

Monday, February 9, 2009

Simple Statistical Comparisons

I just finished grading your "comparing batches" homework. You did fine in the mechanics (computing the summaries, constructing a spread versus level graph, and deciding on an appropriate reexpression), but it wasn't clear that you understand WHY we are doing all of this work.

Let's review the main points of this section.

1. We wish to compare two batches.

2. What does compare mean? Well, it could mean many things. Batch 1 has a larger spread, batch 2 has three more outliers than batch 2, and so on. You illustrated many type of comparisons in your homework writeups.

Here we wish to compare the general or average locations of the two batches. A simple statistical comparison is something like

"batch 2 is 10 units larger than batch 1"

Maybe we should call this a SSC (statisticians love to use acroynms.)

What this could mean is that if I added 10 units to each value in batch 1, then I would get a new dataset that resembles batch 2.

3. Is it always appropriate to make a SSC?

No.

It won't work if the two batches have unequal spreads. If they have unequal spreads, then adding a number to batch 1 will NOT resemble batch 2.

4. So if the batches have different spreads, we give up?

No.

It is possible that can can reexpress the batches to a new scale, so that the new batches have approximate new scales.

5. So the plan is to (1) try to find a suitable reexpression and (2) do a SSC on the reexpressed data.

The snowfall data example using Fathom is one example where our strategy works. But generally, you did a poor job in making a SCC on the reexpressed data.

Sunday, February 1, 2009

A Statistical Salute to the Steelers

I just finished watching the Super Bowl and I'm happy that the Steelers won. The winning quarterback Ben Roethlisberger is from my home town (Findlay) and it was very exciting seeing him direct his team to the winning touchdown at the end of the game.

As I was watching this game, I wondered:

"Which team has the better offense?"

To answer this question, I collected the yards gained for every single play of the two teams, the Steelers and the Cardinals, for this Superbowl. I created a datafile with two variables, Yards and Team, and entered this data into R.

Using the R function

boxplot(Yards~Team,horizontal=T,xlab="Yards Gained",ylab="Team",col="gold")

I created the following boxplot.

What do we see in this boxplot display?

Both batches of yards gained look a bit right-skewed.
There are four outliers for the Steelers and two for the Cardinals. These correspond to big plays for the teams that gained a lot of yards.

Why did I draw this boxplot display? Several comments:

We wish to make a comparison between the two batches.
What is a comparison? Well, we could say that one batch tends to be larger than the second batch. For this example, if it were true, then I would say that the Cardinals gained more yards (per play) than the Steelers.
But that isn't really saying much. When I say "make a comparison", this means that I want to say that one batch is a particular number larger or smaller than the second batch.
When can we say
"batch 1 is 10 larger than batch 2"?
As you'll read in the notes, we can only make this type of comparison "batch 1 is 10 larger than batch 2" when the two batches have similar spreads.

Returning to my football example, it doesn't appear that the two teams were very different with respect to yards gained per play. Both teams had approximately the same median.