Monday, December 1, 2008

Why do we flog?

I just put an example of using flogs to compare scores on two years of placement scores.  But some of you may be confused why we take flogs in the first place.  Here are the main points.

1.  We want to compare two proportions, say p1 and p2.  There are two problems with a direct comparision like p1/p2.

a)  This measure depends on whether we consider the proportions p1, p2 or the proportions 1-p1, 1-p2, and that's a problem.  The choice of p1, p2 or 1-p1, 1-p2 shouldn't change our comparision.

b)  Small proportions tend to have smaller variation than large proportions (close to .5).  The ratio p1/p2 = 1.5 is more meaningful (significant) if the p's are close to zero, than if the p's are close to 0.5.

So we want to transform a proportion p so that 

-- it doesn't matter if we consider p or 1-p
-- the variance of the transformed p will be roughly the same for p near 0 and p near 0.5.

By using the flog reexpression log(p/(1-p)) we achieve these two goals.

By the way, the flog reexpression is the basis for logistic regression models.  You will likely be using logistic regression in your statistical life, but we are trying here to motivate why we use the flog reexpression.


Flogging placement data

Here's an illustration on transforming data for proportion data.

Every summer, a math placement test is given to over 3000 entering freshmen.  The score on this test is used to determine which math course they are allowed to take in the fall.

The score on the placement test is a course number that indicates that the student can take that course in the fall.  Here are the counts of the placement scores for the freshmen in the last six years:

     2003 2004 2005 2006 2007 2008
131H   33   54   51   57   44   41
131   196  364  342  361  320  251
130   192  245  236  248  211  208
128   557  707  647  700  603  618
126   428  518  489  480  442  428
122   501  612  580  565  464  498
215   661  912  888  838  792  747
112   207  216  230  208  212  180
095   418  524  545  419  407  335
090    46   78   89   58   62   46

How can we compare the scores for different years?  A first step to compute percentages of each column.

     2003 2004 2005 2006 2007 2008
131H  1.0  1.3  1.2  1.4  1.2  1.2
131   6.1  8.6  8.3  9.2  9.0  7.5
130   5.9  5.8  5.8  6.3  5.9  6.2
128  17.2 16.7 15.8 17.8 17.0 18.4
126  13.2 12.2 11.9 12.2 12.4 12.8
122  15.5 14.5 14.2 14.4 13.0 14.9
215  20.4 21.6 21.7 21.3 22.3 22.3
112   6.4  5.1  5.6  5.3  6.0  5.4
095  12.9 12.4 13.3 10.7 11.4 10.0
090   1.4  1.8  2.2  1.5  1.7  1.4

We see in 2003 that 6.1% of the students placed in MATH 131 and 20.4% placed in MATH 215.

Let's follow Tukey's strategy for comparing percentage vectors.  To make this simple to explain, let's focus on comparing the percentages of 2006 and 2007.

     2006 2007
131H  1.4  1.2
131   9.2  9.0
130   6.3  5.9
128  17.8 17.0
126  12.2 12.4
122  14.4 13.0
215  21.3 22.3
112   5.3  6.0
095  10.7 11.4
090   1.5  1.7

1.  First we cut the data by some row.  Let's try cutting the data after the second row.

     2006 2007
131H  1.4  1.2
131   9.2  9.0
---------------
130   6.3  5.9
128  17.8 17.0
126  12.2 12.4
122  14.4 13.0
215  21.3 22.3
112   5.3  6.0
095  10.7 11.4
090   1.5  1.7

2.  We compute a folded log for each year.  For 2006, we see that 1.4 + 9.2 = 10.6% are above the line and 100 - 10.6 = 89.4% are below the line, so the flog is

FLOG for 2006 = log(10.6/89.4) = -2.13

Likewise the flog for 2007 is given by

FLOG for 2007 = log(10.2/89.8) = -2.18

3.  To compare the years 2006 and 2007, we look at the difference in flogs:

Change in FLOG from 2006 to 2007 is -2.18 - (-2.13) = -0.05

The interpretation is that students did 0.05 worse in 2007 (on the flog scale).

What if we cut the table by a different row?  We can repeat the procedure using all possible cuts.

Here is the table of flogs:

       2006  2007
 [1,] -4.22 -4.38
 [2,] -2.13 -2.17
 [3,] -1.59 -1.65
 [4,] -0.63 -0.70
 [5,] -0.12 -0.18
 [6,]  0.46  0.35
 [7,]  1.56  1.44
 [8,]  1.98  1.88
 [9,]  4.20  4.03

To understand these values, -4.22 is the flog if we cut after the first row, -2.13 is the flog if we cut after the second row, etc.

To compare the years 2006 and 2007, we look at the difference in flogs:

     2007 FLOG - 2006 FLOG
 [1,] -0.16
 [2,] -0.04
 [3,] -0.06
 [4,] -0.07
 [5,] -0.06
 [6,] -0.11
 [7,] -0.12
 [8,] -0.10
 [9,] -0.17

What have we learned?  Note that all of the flog differences are negative and the median flog difference is -0.10.  So it is clear the 2007 students did a little worse than the 2006 students.