I bumped into a surprising article recently:
• Jonah Lehrer, Is there something wrong with the scientific method?, New Yorker, 13 December 2010.
It starts with a bit of a bang:
Before the effectiveness of a drug can be confirmed, it must be tested and tested again. Different scientists in different labs need to repeat the protocols and publish their results. The test of replicability, as it’s known, is the foundation of modern research. Replicability is how the community enforces itself. It’s a safeguard for the creep of subjectivity. Most of the time, scientists know what results they want, and that can influence the results they get. The premise of replicability is that the scientific community can correct for these flaws.
But now all sorts of well-established, multiply confirmed findings have started to look increasingly uncertain. It’s as if our facts were losing their truth: claims that have been enshrined in textbooks are suddenly unprovable. This phenomenon doesn’t yet have an official name, but it’s occurring across a wide range of fields, from psychology to ecology. In the field of medicine, the phenomenon seems extremely widespread, affecting not only antipsychotics but also therapies ranging from cardiac stents to vitamin E and antidepressants: Davis has a forthcoming analysis demonstrating that the efficacy of antidepressants has gone down as much as threefold in recent decades.
This phenomenon does have a name now: it’s called the decline effect. The article tells some amazing stories about it. If you’re in the mood for some fun, I suggest going to your favorite couch or café now, and reading them!
In it, Ioannides took three prestigious medical journals and looked at the 49 most cited clinical research studies. 45 of them used randomized controlled trials and reported positive results. But of the 34 that people tried to replicate, 41% were either directly contradicted or had their effect sizes significantly downgraded.
For more examples, read the article or listen to this radio show:
• Cosmic Habituation, Radiolab, May 3, 2011.
It’s a bit sensationalistic… but it’s fun. It features Jonathan Schooler, who discovered a famous effect in psychology, called verbal overshadowing. It doesn’t really matter what this effect is. What matters is that it showed up very strongly in his first experiments… but as he and others continued to study it, it gradually diminished over time! He got freaked out. And then looked around, and saw that this sort of decline happened all over the place, in lots of cases.
What could cause this ‘decline effect’? There are lots of possible explanations.
At one extreme, maybe the decline effect doesn’t really exist. Maybe this sort of decline just happens sometimes purely by chance. Maybe there are equally many cases where effects seem to get stronger each time they’re measured!
At the other extreme, a very disturbing possibility has been proposed by Jonathan Schooler. He suggests that somehow the laws of reality change when they’re studied, in such a way that initially strong effects gradually get weaker.
I don’t believe this. It’s logically possible, but there are lots of less radical explanations to rule out first.
But if it were true, maybe we could make the decline effect go away by studying it. The decline effect would itself decline!
Unless of course, you started studying the decline of the decline effect.
Okay. On to some explanations that are interesting but less far-out.
One plausible explanation is significance chasing. Scientists work really hard to find something that’s ‘statistically significant’ according to the widely-used criterion of having a p-value of less than 0.05.
That sounds technical, but basically all it means is this: there was at most a 5% chance of having found a deviation from the expected situation that’s as big as the one you found.
(To play this game, you have to say ahead of time what the ‘expected situation’ is: this is your null hypothesis.)
Why is significance chasing dangerous? How can it lead to the decline effect?
Well, here’s how to write a paper with a statistically significant result. Go through 20 different colors of jelly bean and see if people who eat them have more acne than average. There’s a good chance that one of your experiments will say ‘yes’ with a p-value of less than 0.05, just because 0.05 = 1/20. If so, this experiment gives a statistically significant result!
I took this example from Randall Munroe’s cartoon strip xkcd:
It’s funny… but it’s actually sad: some testing of drugs is not much better than this! Clearly a result obtained this way is junk, so when you try to replicate it, the ‘decline effect’ will kick in.
Another possible cause of the decline effect is publication bias: scientists and journals prefer positive results over null results, where no effect is found. And surely there are other explanations, too: for starters, all the ways people can fool themselves into thinking they’ve discovered something interesting.
For suggestions on how to avoid the evils of ‘publication bias’, try these:
• Jonathan Schooler, Unpublished results hide the decline effect, Nature 470 (2011), 437.
Putting an end to ‘significance chasing’ may require people to learn more about statistics:
• Geoff Cumming, Significant does not equal important: why we need the new statistics, 9 October 2011.
He explains the problem in simple language:
Consider a psychologist who’s investigating a new therapy for anxiety. She randomly assigns anxious clients to the therapy group, or a control group. You might think the most informative result would be an estimate of the benefit of therapy – the average improvement as a number of points on the anxiety scale-together with the amount that’s the confidence interval around that average. But psychology typically uses significance testing rather than estimation.
Introductory statistics books often introduce significance testing as a step-by-step recipe:
Step 1. Assume the new therapy has zero effect. You don’t believe this and you fervently hope it’s not true, but you assume it.
Step 2. You use that assumption to calculate a strange thing called a ‘p value’, which is the probability that, if the therapy really has zero effect, the experiment would have given a difference as large as you observed, or even larger.
Step 3. If the p value is small, in particular less than the hallowed criterion of .05 (that’s 1 chance in 20), you are permitted to reject your initial assumption—which you never believed anyway—and declare that the therapy has a ‘significant’ effect.
If that’s confusing, you’re in good company. Significance testing relies on weird backward logic. No wonder countless students every year are bamboozled by their introduction to statistics! Why this strange ritual they ask, and what does a p value actually mean? Why don’t we focus on how large an improvement the therapy gives, and whether people actually find it helpful? These are excellent questions, and estimation gives the best answers.
For half a century distinguished scholars have published damning critiques of significance testing, and explained how it hampers research progress. There’s also extensive evidence that students, researchers, and even statistics teachers often don’t understand significance testing correctly. Strangely, the critiques of significance testing have hardly prompted any defences by its supporters. Instead, psychology and other disciplines have simply continued with the significance testing ritual, which is now deeply entrenched. It’s used in more than 90% of published research in psychology, and taught in every introductory textbook.
For more discussion and references, try my co-blogger:
• Tom Leinster, Fetishizing p-values, n-Category Café.
He gives some good examples of how significance testing can lead us astray. Anyone who uses the p-test should read these! He also discusses this book:
• Stephen T. Ziliak and Deirdre N. McCloskey, The Cult of Statistical Significance, University of Michigan Press, Ann Arbor, 2008. (Online summary here.)
Now, back to the provocative title of that New Yorker article: “Is there something wrong with the scientific method?”
The answer is yes if we mean science as actually practiced, now. Lots of scientists are using cookbook recipes they learned in statistics class without understanding them, or investigating the alternatives. Worse, some are treating statistics as a necessary but unpleasant piece of bureaucratic red tape, and then doing whatever it takes to achieve the appearance of a significant result!
This is a bit depressing. There’s a student I know, who is taking an introductory statistics course. After she read about this stuff she said:
So, what I’m gleaning here is that what I’m studying is basically bull. It struck me as bull to start with, admittedly, but since my grade depended on it, I grinned and swallowed. At least my eyes are open now, I guess.
But there’s some good news, buried in her last sentence. Science has the marvelous ability to notice and correct its own mistakes. It’s scientists who noticed the decline effect and significance chasing. They’ll eventually figure out what’s going on, and learn how to fix any mistakes that they’ve been making. So ultimately, I don’t find this story depressing. It’s actually inspiring!
The scientific method is not a fixed rulebook handed down from on high. It’s a work in progress. It’s only been around for a few centuries—not very long, in the grand scheme of things. The widespread use of statistics in science has been around for less than one century. And computers, which make heavy-duty number-crunching easy, have only been cheap for 30 years! No wonder people still use primitive cookbook methods for analyzing data, when they could do better.
So science is still evolving. And I think that’s fun, because it means we can help it along. If you see someone claim their results are statistically significant, you can ask them what they mean, exactly… and what they had to do to get those results.
I thank a lot of people on Google+ for discussions on this topic, including (but not limited to) John Forbes, Roko Mijic, Heather Vandagriff, and Willie Wong.