Bad Statistics in D.C.? How can this be?
There was an article I caught days ago that I’m just now getting around to. I hope no one has any expectations that I’m ever going to blog anything in real time. Anyway, there was a big teabagger (suppresses childish giggles) rally in D.C. on September 12th. Someone had the rather clever idea of using the daily D.C. Metro (light rail) usage statistics to estimate the number of people at the rally. Someone just also happens to be blatantly dishonest, as pointed out by Jesse Taylor at Pandagon.
David Freddoso compares the light rail usage for September 12, 2009, to the average for the summer. This is nothing more than a dishonest move to artificially inflate his estimate–there is no compelling reason to use this as a point of comparison. Jesse had a much better idea, use the same day the previous year, but this still isn’t exactly what you want. You want a set of days that fall on the same day of the week, had similar weather, and had some number of local events with a net crowd of similar as those events taking place on 9/12/2009 other than the teabagger (snicker) rally. Then you will have a number truly worthy of comparison.
But you’re still not done. You need to know how confidant you can be in your number. David points out (correctly) that not all attendees need take the metro, so the actual attendance could be higher. Jesse points out (also correctly) that there could have been other extra people on the light rail that day. We’ve attempted to factor the latter into our estimate of average attendance for a day with similar conditions, but it’s not perfect. We need to know how uncertain we should be about our number, for higher or lower results.
And thus, the standard deviation. That equation up in the title bar compares each data point to the mean, squares the result so that it’s always positive, averages over the number of data points, and then you take the square root to get the statistical uncertainty in the mean given the data set. We could then go on to find the standard error, which would tell us the confidence we have in the mean itself when compared to a theoretical value, but what we’re interested in here is just the statistical deviation.
One standard deviation describes the range in which ~70% of the data points should fall. So if the 9/12/2009 point is within one deviation of the mean, then it tells us that the teabagger (chortle) rally wasn’t a significant contribution to the DC Metro traffic, at least no more than the usual variations between similar days without the rally. The further it is away from the mean, the less likely the traffic could be said to be typical for those conditions aside from the rally.
So if you want to use those statistics, that is how I think you should do it. I’m sure you could also cook up some way of using counting (Poison) statistics, but that sounds harder to me.