Exploratory Data Analysis: November 2008

Friday, November 21, 2008

Binning homework

Some of you appear to be confused on this "binning" homework. Let's outline what you are supposed to do in this assignment.

1. First you bin your data and construct a histogram. This is easy (I hope). The histogram gives you counts of each bin.

By the way, a rootogram is just like a histogram, but you are graphing the root counts against the bins. (Why do you do this? Look at the notes.)

2. You find a Gaussian curve that fits your data -- you have a mean and a standard deviation.

3. Now you want to find the expected or fitted counts from the Gaussian curve. You do this by running the function fit.gaussian -- the inputs to this function are your bins, the vector of raw data, and the mean and standard deviation of the Gaussian curve. Suppose you save the output of this function into the variable s. Then

s$counts are the observed bin counts

s$expected are the expected counts

4. The question at this point is -- does the Gaussian curve provide a good fit to the histogram? Well, maybe yes, and maybe no.

How can you tell?

You look at the residuals which are essentially the deviations of the counts from the expected counts.

I defined several residuals in the notes.

Simple rootogram residuals are

r.simple = sqrt(d) - sqrt(e)

These are graphed by use of the rootogram function.

The double root residuals are defined by

DRR = sqrt(2+4d) - sqrt(1+4e).

How do I compute these? Well, you already have computed the d's and the e's -- you can use R to compute the DRR's.

5. Once I compute the residuals and graphed them, am I done? Well, the point of graphing the residuals is to see if they show any systematic pattern -- if there is a pattern, then the Gaussian curve is not good.

6. Well, we are almost done. Part 1 (d) asks you to fit a G comparison curve to the root data. (Boy, your instructor seems to like taking roots.)

All I'm asking here is to FIRST transform the data by a root, and then repeat all of the above with the root data.

7. Lastly, in #2, I'm asking you to fit a G curve to the women heights. If you have loaded in the studentdata dataset, then to get the female heights, you type

library(LearnEDA)

data(studentdata)

f.heights=studentdata$Height[studentdata$Gender=="female"]

Good luck!

Tuesday, November 18, 2008

Learning about Lengths of Baseball Games

One complaint about baseball is that it is a relatively long game, compared to American football or basketball.

To understand more about the lengths of baseball games, I collected the lengths (in minutes) of all games played during the 2007 season. Here are my questions:

1. Are the times of baseball games normally distributed?

2. If not, how do the times differ from a normal curve?

I'll illustrate the R work. You'll need a couple of packages: LearnEDA and vcd (for the rootogram function).

1. First I'll read in the baseball game times, construct bins and the rootogram (this is essentially a histogram where you plot the roots of the counts instead of the counts).

library(LearnEDA)

data=read.table("game.time.07.txt",header=T)

attach(data)

# set up bins and construct a rootogram

bins=seq(100,300,by=20)

bin.mids=(bins[-1]+bins[-length(bins)])/2

h=hist(time,breaks=bins)

h$counts=sqrt(h$counts)

plot(h,xlab="TIME",ylab="ROOT FREQUENCY",main="")

2. Now I want to fit a normal comparison curve. I figure out the mean and standard deviation of the normal curve by using letter values.

# find mean and standard deviation of matching Gaussian curve

S=lval(time)

f=as.vector(S[2,2:3])

m=as.integer((f[1]+f[2])/2)

sd=as.integer((f[2]-f[1])/1.349)

I get that the times are approximately N(175, 24).

3. Is this normal approximation ok? The function fit.gaussian computes the expected counts for each bins and computes the residuals sqrt(observed) - sqrt(expected).

# function fit.gaussian.R fits Gaussian curve to the counts

s=fit.gaussian(time,bins,m,sd)

# output observed and expected counts

output=round(data.frame(s$counts, sqrt(s$counts), s$probs, s$expected,

sqrt(s$expected), sqrt(s$counts)-sqrt(s$expected)),2)

names(output)=c("count","root","prob","fit","root fit","residual")

output

count root prob fit root fit residual

1 1 1.00 0.01 1.00 1.00 0.00

2 8 2.83 0.06 6.08 2.47 0.36

3 22 4.69 0.19 19.17 4.38 0.31

4 27 5.20 0.32 31.34 5.60 -0.40

5 20 4.47 0.27 26.60 5.16 -0.69

6 12 3.46 0.12 11.72 3.42 0.04

7 6 2.45 0.03 2.67 1.64 0.81

8 1 1.00 0.00 0.32 0.56 0.44

9 1 1.00 0.00 0.02 0.14 0.86

10 1 1.00 0.00 0.00 0.02 0.98

# place root expected counts on top of rootogram

lines(bin.mids,sqrt(s$expected),lwd=4,col="red")

3. The rootogram function plots the residuals. Both plots below show the same info, but in different ways.

# load library vcd that contains rootogram function

library(vcd)

# again save histogram object

h=hist(time,breaks=bins,plot=F)

# illustrate "hanging" style of rootogram

r=rootogram(h$counts,s$expected,type="hanging") # this is the hanging style

# illustrate "deviation" style of rootogram

r=rootogram(h$counts,s$expected,type="deviation")

What have we learned? Although the lengths of baseball games are roughly normal in shape, we see that there are some large negative residuals on the right tail.

What does this mean? Baseball game lengths have a long right-tail which means one tends to see many long games. If you know anything about baseball, you know that baseball can go into extra-innings when there the game is tied after 9 innings, and these extra-inning games cause the long right tail.

Sunday, November 9, 2008

Median Polishing Student Grades

Our department recently was interested in exploring the relationship between student class attendance and performance in a number of 100-level math classes. We all know that missing math classes will have an adverse effect on grades, but we wanted to learn more about this relationship.

For many students in four 100-level math classes, we collected the student's attendance (percentage of classes attended) and their performance (a percentage where a 90% corresponds to A work, 80-90% a B, etc). We decided to categorize attendance using the categories 0-50%, 50-70%, 70-90%, over 90%) and then we found the mean performance for students in each categorization of attendance and course. We got the following two-way table.

Attendance Pct

COURSE 0-50 50-70 70-90 90-100

---------------------------------

MATH 112 61.9 68.0 68.8 75.5

MATH 115 54.3 71.8 76.6 83.3

MATH 122 56.1 64.8 73.1 78.1

MATH 126 50.6 66.3 73.9 78.1

I'll demonstrate median polish with this data.

1. First, I have to get this data into a matrix form into R. I'll first use the matrix command to form the matrix (by default, one enters data column by column) and then I'll add row and column labels.

grade=matrix(c(61.9,54.3,56.1,50.6,

68.0,71.8,64.8,66.3,

68.8,76.6,73.1,73.9,

75.5,83.3,78.1,78.1),c(4,4))

dimnames(grade)=list(c("MATH 112","MATH 115","MATH 122","MATH 126"),

c("0-50","50-70","70-90","90-100"))

To check that I've read in the data correctly, I'll display "grade":

> grade

0-50 50-70 70-90 90-100

MATH 112 61.9 68.0 68.8 75.5

MATH 115 54.3 71.8 76.6 83.3

MATH 122 56.1 64.8 73.1 78.1

MATH 126 50.6 66.3 73.9 78.1

2. Now I can implement median polish to get an additive fit. I'll store the output in the variable "my.fit" and then I'll display the different components.

> my.fit=medpolish(grade)

Here's the common value.

> my.fit$overall

[1] 69.7

Here are the row effects.

> my.fit$row

MATH 112 MATH 115 MATH 122 MATH 126

-0.7375 4.5000 0.2875 -0.2875

Here are the column effects.

> my.fit$col

0-50 50-70 70-90 90-100

-16.35000 -2.75625 2.75625 8.40000

I'll interpret this additive fit.

The average performance of these students is 69.7%.
Looking at the row effects, we see that MATH 115 students get grades that are 4.5 - (0.2875) approx 4.2 points higher than MATH 122 students. MATH 122 students tend to be 0.2875 - (-0.2875) approx 0.57 points higher than MATH 126 students, and MATH 112 students are about a half percentage point lower than MATH 126 students.
Looking at the column effects, we see a clear relationship between attendance and performance. The best (90-100%) attenders do 8.4 - 2.75 = 5.65 points better (on average) than the 70-90 attendance group, do 8.4 - (-2.75) = 11.15 points better than the 50-70 attendance group, and a whopping 8.4 - (-16.35) = 24.75 points better than the "no shows" (the under 50% attending group).

3. It might be helpful to plot this fit. I wrote a function plot2way in the LearnEDA package:

> library(LearnEDA)

> plot2way(my.fit$overall+my.fit$row,my.fit$col,dimnames(grade)[[1]],

dimnames(grade)[[2]])

Note that the plot2way function has four arguments: the row part, the column part, the vector of names of the rows, and the vector of names of the columns. Here's the figure.

Actually, this plot would look better if I had figured out how to rotate the figure so that the FIT lines are horizontal. (I'll give extra credit points to anyone who can fix my function to do that.)

From this graph we see that the best performances are the MATH 115 students who attend over 90% of the classes; the worst performances are the MATH 112 students who have under 50% attendance.

4. Are we done? Not quite. We have looked at the fit, but have not looked at the residuals -- the differences between the performance and the fit. They are stored in my.fit$residual -- I will round the values so they are easier to view.

> round(my.fit$residual)

0-50 50-70 70-90 90-100

MATH 112 9 2 -3 -2

MATH 115 -4 0 0 1

MATH 122 2 -2 0 0

MATH 126 -2 0 2 0

I see one large residual that I have highlighted in red. MATH 112 students who don't come to class (under 50% attendance) seem to do 9 points better than one would expect based on the additive fit. This might deserve further study.

Friday, November 21, 2008

Binning homework

Tuesday, November 18, 2008

Learning about Lengths of Baseball Games

Sunday, November 9, 2008

Median Polishing Student Grades

Blog Archive

About Me