The horror!

Can it really be? That which has been hammered into me for years as the scientific standard; that which I assumed to be at the very basis of science, linked tightly with the concept of falsifiable hypotheses, the very core of the much-lauded scientific method – is wrong?

I though 0.05 was the golden rule of science, the objective measure of significance. In fact, the 5% rule became so ingrained that over time I had come to think of 5% as a universal threshold of significance for everything. I mean, if a 5% threshold is good enough for science, then surely it is good enough for me. For example, if something had a 5% discount, I wouldn’t consider it a significant enough difference from the regular price. If a shortcut was only 5% shorter, it wasn’t worth the trouble.

Only now to find out that it is arbitrary, meaningless, useless. That each study should be evaluated and judged on its own particular peculiarities, that there can be no universal threshold of significance for statistical results, but only relative support for a given hypothesis, one piece of evidence out of many possibilities. My world will never be the same.

But I get the feeling I’m not the only one. When reading through many scientific papers (especially ecological ones), although I have a very cursory understanding of statistics, I often got the sense that the authors have at most only a slightly better idea of why they’re reporting these statistics and P values, but often are rather just following conventions without really knowing why. Many just state the test statistic and “p<0.05”, without any follow-up. At most, they would be giddy (in the usual, reserved way befitting a serious scientific writing) with a very small p value, and saddened by a large one.

Here I have to go on a bit of a tangent. Following conventions without understanding why has always been a pet peeve of mine. In everything from cultural traditions to wet lab procedures, I have always been asking ‘why’. And if my mentors/supervisors couldn’t give a satisfying answer (“because everybody does it this way” is just about as unsatisfying as “because god says so”), I had to go and figure it out for myself. I had to convince myself that what we were doing made logical sense and could by justified scientifically. I would look up the reference; work through the calculations myself; read up on the product’s website for concentrations, instructions, etc. But I have, up to now, failed to apply the same standard when it comes to statistics. I would have been perfectly happy just running my data through whatever test someone else recommended, and reporting the P value, unthinkingly. Statistics have just been a dense, incomprehensible alphabet soup. As a fellow graduate student put it: “it’s button-pushing magic!” – just another button in the statistical software package.

One thing that I find still somewhat missing from all of these discussions is a clear and practical example that illustrates the difference between an arbitrary threshold and a biologically meaningful one, and what considerations are needed to determine a meaningful critical effect size given a particular study population and research question. To paraphrase a well-known expression, one concrete example is worth a thousand words of abstract advice. Of course, what is meaningful would be different for every case, but one specific example would help clarify the general idea.

The readings over the past couple of weeks have helped clear up a lot of the clouds. Certain concepts are starting to come together into a coherent picture. After Hurlbert, I certainly have a much more specific idea of what a p value really can and cannot tell. And I will never look at p>0.05 the same way again.

At least now, when my genetics students show me a high p value and ask, “so we accept the null hypothesis, right?”, I make sure to correct them: “you failed to reject the null.” But I don’t go into it any further. Maybe one day in the future they too will be enlightened.

Reference:

Hurlbert & Lombardi (2009) Ann. Zool. Fennici 46:311-349

Deborah Mayo is the best person I know of for articulating why we do frequentist statistics the way we do (well, the way we *should* do them–Mayo is an acute critic of the sort of mindlessness you discuss). See her blog at http://errorstatistics.com/, and see her 2011 paper with Ari Spanos (linked to on her blog) for a short introduction to her approach, which she calls “error statistics”.