Exploratory Data Analysis: December 2008

I just put an example of using flogs to compare scores on two years of placement scores. But some of you may be confused why we take flogs in the first place. Here are the main points.

1. We want to compare two proportions, say p1 and p2. There are two problems with a direct comparision like p1/p2.

a) This measure depends on whether we consider the proportions p1, p2 or the proportions 1-p1, 1-p2, and that's a problem. The choice of p1, p2 or 1-p1, 1-p2 shouldn't change our comparision.

b) Small proportions tend to have smaller variation than large proportions (close to .5). The ratio p1/p2 = 1.5 is more meaningful (significant) if the p's are close to zero, than if the p's are close to 0.5.

So we want to transform a proportion p so that

-- it doesn't matter if we consider p or 1-p

-- the variance of the transformed p will be roughly the same for p near 0 and p near 0.5.

By using the flog reexpression log(p/(1-p)) we achieve these two goals.

By the way, the flog reexpression is the basis for logistic regression models. You will likely be using logistic regression in your statistical life, but we are trying here to motivate why we use the flog reexpression.

Here's an illustration on transforming data for proportion data.

Every summer, a math placement test is given to over 3000 entering freshmen. The score on this test is used to determine which math course they are allowed to take in the fall.

The score on the placement test is a course number that indicates that the student can take that course in the fall. Here are the counts of the placement scores for the freshmen in the last six years:

2003 2004 2005 2006 2007 2008

131H 33 54 51 57 44 41

131 196 364 342 361 320 251

130 192 245 236 248 211 208

128 557 707 647 700 603 618

126 428 518 489 480 442 428

122 501 612 580 565 464 498

215 661 912 888 838 792 747

112 207 216 230 208 212 180

095 418 524 545 419 407 335

090 46 78 89 58 62 46

How can we compare the scores for different years? A first step to compute percentages of each column.

2003 2004 2005 2006 2007 2008

131H 1.0 1.3 1.2 1.4 1.2 1.2

131 6.1 8.6 8.3 9.2 9.0 7.5

130 5.9 5.8 5.8 6.3 5.9 6.2

128 17.2 16.7 15.8 17.8 17.0 18.4

126 13.2 12.2 11.9 12.2 12.4 12.8

122 15.5 14.5 14.2 14.4 13.0 14.9

215 20.4 21.6 21.7 21.3 22.3 22.3

112 6.4 5.1 5.6 5.3 6.0 5.4

095 12.9 12.4 13.3 10.7 11.4 10.0

090 1.4 1.8 2.2 1.5 1.7 1.4

We see in 2003 that 6.1% of the students placed in MATH 131 and 20.4% placed in MATH 215.

Let's follow Tukey's strategy for comparing percentage vectors. To make this simple to explain, let's focus on comparing the percentages of 2006 and 2007.

2006 2007

131H 1.4 1.2

131 9.2 9.0

130 6.3 5.9

128 17.8 17.0

126 12.2 12.4

122 14.4 13.0

215 21.3 22.3

112 5.3 6.0

095 10.7 11.4

090 1.5 1.7

1. First we cut the data by some row. Let's try cutting the data after the second row.

2006 2007

131H 1.4 1.2

131 9.2 9.0

---------------

130 6.3 5.9

128 17.8 17.0

126 12.2 12.4

122 14.4 13.0

215 21.3 22.3

112 5.3 6.0

095 10.7 11.4

090 1.5 1.7

2. We compute a folded log for each year. For 2006, we see that 1.4 + 9.2 = 10.6% are above the line and 100 - 10.6 = 89.4% are below the line, so the flog is

FLOG for 2006 = log(10.6/89.4) = -2.13

Likewise the flog for 2007 is given by

FLOG for 2007 = log(10.2/89.8) = -2.18

3. To compare the years 2006 and 2007, we look at the difference in flogs:

Change in FLOG from 2006 to 2007 is -2.18 - (-2.13) = -0.05

The interpretation is that students did 0.05 worse in 2007 (on the flog scale).

What if we cut the table by a different row? We can repeat the procedure using all possible cuts.

Here is the table of flogs:

2006 2007

[1,] -4.22 -4.38

[2,] -2.13 -2.17

[3,] -1.59 -1.65

[4,] -0.63 -0.70

[5,] -0.12 -0.18

[6,] 0.46 0.35

[7,] 1.56 1.44

[8,] 1.98 1.88

[9,] 4.20 4.03

To understand these values, -4.22 is the flog if we cut after the first row, -2.13 is the flog if we cut after the second row, etc.

To compare the years 2006 and 2007, we look at the difference in flogs:

2007 FLOG - 2006 FLOG

[1,] -0.16

[2,] -0.04

[3,] -0.06

[4,] -0.07

[5,] -0.06

[6,] -0.11

[7,] -0.12

[8,] -0.10

[9,] -0.17

What have we learned? Note that all of the flog differences are negative and the median flog difference is -0.10. So it is clear the 2007 students did a little worse than the 2006 students.

Monday, December 1, 2008

Why do we flog?

Flogging placement data

Blog Archive

About Me