The Decline Effect

I bumped into a surprising article recently:

• Jonah Lehrer, Is there something wrong with the scientific method?, New Yorker, 13 December 2010.

It starts with a bit of a bang:

Before the effectiveness of a drug can be confirmed, it must be tested and tested again. Different scientists in different labs need to repeat the protocols and publish their results. The test of replicability, as it’s known, is the foundation of modern research. Replicability is how the community enforces itself. It’s a safeguard for the creep of subjectivity. Most of the time, scientists know what results they want, and that can influence the results they get. The premise of replicability is that the scientific community can correct for these flaws.

But now all sorts of well-established, multiply confirmed findings have started to look increasingly uncertain. It’s as if our facts were losing their truth: claims that have been enshrined in textbooks are suddenly unprovable. This phenomenon doesn’t yet have an official name, but it’s occurring across a wide range of fields, from psychology to ecology. In the field of medicine, the phenomenon seems extremely widespread, affecting not only antipsychotics but also therapies ranging from cardiac stents to vitamin E and antidepressants: Davis has a forthcoming analysis demonstrating that the efficacy of antidepressants has gone down as much as threefold in recent decades.

This phenomenon does have a name now: it’s called the decline effect. The article tells some amazing stories about it. If you’re in the mood for some fun, I suggest going to your favorite couch or café now, and reading them!

For example: John Ioannides is the author of the most heavily downloaded paper in the open-access journal PLoS Medicine. It’s called Why most published research findings are false.

In it, Ioannides took three prestigious medical journals and looked at the 49 most cited clinical research studies. 45 of them used randomized controlled trials and reported positive results. But of the 34 that people tried to replicate, 41% were either directly contradicted or had their effect sizes significantly downgraded.

For more examples, read the article or listen to this radio show:

Cosmic Habituation, Radiolab, May 3, 2011.

It’s a bit sensationalistic… but it’s fun. It features Jonathan Schooler, who discovered a famous effect in psychology, called verbal overshadowing. It doesn’t really matter what this effect is. What matters is that it showed up very strongly in his first experiments… but as he and others continued to study it, it gradually diminished over time! He got freaked out. And then looked around, and saw that this sort of decline happened all over the place, in lots of cases.

What could cause this ‘decline effect’? There are lots of possible explanations.

At one extreme, maybe the decline effect doesn’t really exist. Maybe this sort of decline just happens sometimes purely by chance. Maybe there are equally many cases where effects seem to get stronger each time they’re measured!

At the other extreme, a very disturbing possibility has been proposed by Jonathan Schooler. He suggests that somehow the laws of reality change when they’re studied, in such a way that initially strong effects gradually get weaker.

I don’t believe this. It’s logically possible, but there are lots of less radical explanations to rule out first.

But if it were true, maybe we could make the decline effect go away by studying it. The decline effect would itself decline!

Unless of course, you started studying the decline of the decline effect.

Okay. On to some explanations that are interesting but less far-out.

One plausible explanation is significance chasing. Scientists work really hard to find something that’s ‘statistically significant’ according to the widely-used criterion of having a p-value of less than 0.05.

That sounds technical, but basically all it means is this: there was at most a 5% chance of having found a deviation from the expected situation that’s as big as the one you found.

(To play this game, you have to say ahead of time what the ‘expected situation’ is: this is your null hypothesis.)

Why is significance chasing dangerous? How can it lead to the decline effect?

Well, here’s how to write a paper with a statistically significant result. Go through 20 different colors of jelly bean and see if people who eat them have more acne than average. There’s a good chance that one of your experiments will say ‘yes’ with a p-value of less than 0.05, just because 0.05 = 1/20. If so, this experiment gives a statistically significant result!

I took this example from Randall Munroe’s cartoon strip xkcd:

It’s funny… but it’s actually sad: some testing of drugs is not much better than this! Clearly a result obtained this way is junk, so when you try to replicate it, the ‘decline effect’ will kick in.

Another possible cause of the decline effect is publication bias: scientists and journals prefer positive results over null results, where no effect is found. And surely there are other explanations, too: for starters, all the ways people can fool themselves into thinking they’ve discovered something interesting.

For suggestions on how to avoid the evils of ‘publication bias’, try these:

• Jonathan Schooler, Unpublished results hide the decline effect, Nature 470 (2011), 437.

Putting an end to ‘significance chasing’ may require people to learn more about statistics:

• Geoff Cumming, Significant does not equal important: why we need the new statistics, 9 October 2011.

He explains the problem in simple language:

Consider a psychologist who’s investigating a new therapy for anxiety. She randomly assigns anxious clients to the therapy group, or a control group. You might think the most informative result would be an estimate of the benefit of therapy – the average improvement as a number of points on the anxiety scale-together with the amount that’s the confidence interval around that average. But psychology typically uses significance testing rather than estimation.

Introductory statistics books often introduce significance testing as a step-by-step recipe:

Step 1. Assume the new therapy has zero effect. You don’t believe this and you fervently hope it’s not true, but you assume it.

Step 2. You use that assumption to calculate a strange thing called a ‘p value’, which is the probability that, if the therapy really has zero effect, the experiment would have given a difference as large as you observed, or even larger.

Step 3. If the p value is small, in particular less than the hallowed criterion of .05 (that’s 1 chance in 20), you are permitted to reject your initial assumption—which you never believed anyway—and declare that the therapy has a ‘significant’ effect.

If that’s confusing, you’re in good company. Significance testing relies on weird backward logic. No wonder countless students every year are bamboozled by their introduction to statistics! Why this strange ritual they ask, and what does a p value actually mean? Why don’t we focus on how large an improvement the therapy gives, and whether people actually find it helpful? These are excellent questions, and estimation gives the best answers.

For half a century distinguished scholars have published damning critiques of significance testing, and explained how it hampers research progress. There’s also extensive evidence that students, researchers, and even statistics teachers often don’t understand significance testing correctly. Strangely, the critiques of significance testing have hardly prompted any defences by its supporters. Instead, psychology and other disciplines have simply continued with the significance testing ritual, which is now deeply entrenched. It’s used in more than 90% of published research in psychology, and taught in every introductory textbook.

For more discussion and references, try my co-blogger:

• Tom Leinster, Fetishizing p-values, n-Category Café.

He gives some good examples of how significance testing can lead us astray. Anyone who uses the p-test should read these! He also discusses this book:

• Stephen T. Ziliak and Deirdre N. McCloskey, The Cult of Statistical Significance, University of Michigan Press, Ann Arbor, 2008. (Online summary here.)

Now, back to the provocative title of that New Yorker article: “Is there something wrong with the scientific method?”

The answer is yes if we mean science as actually practiced, now. Lots of scientists are using cookbook recipes they learned in statistics class without understanding them, or investigating the alternatives. Worse, some are treating statistics as a necessary but unpleasant piece of bureaucratic red tape, and then doing whatever it takes to achieve the appearance of a significant result!

This is a bit depressing. There’s a student I know, who is taking an introductory statistics course. After she read about this stuff she said:

So, what I’m gleaning here is that what I’m studying is basically bull. It struck me as bull to start with, admittedly, but since my grade depended on it, I grinned and swallowed. At least my eyes are open now, I guess.

But there’s some good news, buried in her last sentence. Science has the marvelous ability to notice and correct its own mistakes. It’s scientists who noticed the decline effect and significance chasing. They’ll eventually figure out what’s going on, and learn how to fix any mistakes that they’ve been making. So ultimately, I don’t find this story depressing. It’s actually inspiring!

The scientific method is not a fixed rulebook handed down from on high. It’s a work in progress. It’s only been around for a few centuries—not very long, in the grand scheme of things. The widespread use of statistics in science has been around for less than one century. And computers, which make heavy-duty number-crunching easy, have only been cheap for 30 years! No wonder people still use primitive cookbook methods for analyzing data, when they could do better.

So science is still evolving. And I think that’s fun, because it means we can help it along. If you see someone claim their results are statistically significant, you can ask them what they mean, exactly… and what they had to do to get those results.

I thank a lot of people on Google+ for discussions on this topic, including (but not limited to) John Forbes, Roko Mijic, Heather Vandagriff, and Willie Wong.

43 Responses to The Decline Effect

  1. Hudson Luce says:

    Was there any study of funding bias effects, i.e. studies paid for by pharmaceutical companies to show that their product was safe and effective, which finds said products to be safe and effective, even if they’re unsafe or even dangerous?

    • John Baez says:

      I’m sure there must have been – we can look for them. This is an incredibly important effect, which for the purposes of this short article I was lumping under ‘publication bias’: the bias toward publishing so-called ‘successful’ experiments, meaning those that get the result you want. The example of testing different colors of jelly beans is apparently not much worse than the way some people look for genes that are linked to cancer, or drugs that cure some disease.

  2. John Baez says:

    Here’s another quick intro to publication bias:

    • J. M. Scholey and J. E. Harrison, Publication bias: raising awareness of a potential problem in dental research, British Dental Journal 194 (2003), 235–237.

    You don’t need to care about dentistry: the findings are very general.

    By the way, people have even looked for publication bias in studies of publication bias:

    • Hans-Hermann Dubben and Hans-Peter Beck-Bornholdt, Systematic review of publication bias in studies on publication bias, BMJ 331 (3 June 2005).

    They didn’t find an… umm… statistically significant effect, perhaps because of a small sample. But it’s interesting to see how they looked for it.

  3. John Baez says:

    Here are some comments over on Google+. I’m rather sad about the splintering of discussion with different people talking in different places. So, for your reading pleasure:

    Betsy McCall – From the article, it sounds like it’s most prevalant in fields where there is a highly complex system being studied: medicine, psychology, ecology, etc., that are poorly understood. Poorly relative to, say, physics. Controlling responses of the body, brain or environment are a bit like controlling electrons with a club.

    Matt McIrvin – Isn’t this, explained as a consequence of significance chasing, just the classical phenomenon of “regression to the mean”? Linear regression was actually named after it, so it’s hardly radical statistics.

    John Baez – If you listen to the RadioLab show, you’ll hear some discussion of “regression to the mean”.

    Besides just that, I think we can imagine that at first people are very eager to publish studies confirming an exciting new effect. But then, when that gets to be old hat, people become more interested in casting doubt on it, or perhaps just studying it more objectively.

    Matt McIrvin – I think Jonathan Schooler’s cosmic habituation effect and Rupert Sheldrake’s morphogenetic fields are exactly canceling each other out, resulting in a world governed by probability theory.

    Miguel Angel – Just to point out that in the ABC podcast you linked, Geoff Cumming admits that significance testing and estimation are equivalent (“they are based on the same statistical models, so you can translate between the two” [starts at 11:00]), however, for psychological reasons, estimation is less prone to errors of judgment and understanding.

    Matt McIrvin – The problem I see the most often, especially in mass-media reports, is people explicitly or implicitly taking a tiny p-value as the probability that the effect is not real, which is wrong for all sorts of reasons. (I guess the whole terminology of a “confidence level” tends to lead one in that direction.)

    Matt Austern – You know the story about replication of the Millikan oil drop experiment, I assume.

    aimee whitcroft – Talking of dodgy/inappropriate use of stats, did you see Ben Goldacre’s post about bad stats in neuroscience?

    Matt McIrvinFeynman liked to show people the Millikan one. (I also remember my own classroom attempts at replication of the Millikan oil-drop experiment, which mostly served as demonstrations of how high I could jump when subjected to electric shock.)

    Jane Shevtsov – I think what’s going on is a mixture of significance chasing and pushing the hypothesis outside its range of applicability. The latter is perfectly reasonable — it’s how we know where the limits are. I think this happened with some of the studies in Lehrer’s article.

    • John Baez says:

      Feynman describes a “gradual convergence to the truth” after Millikan’s original experiment, with each experiment being in the error bars of the previous one:

      We have learned a lot from experience about how to handle some of the ways we fool ourselves. One example: Millikan measured the charge on an electron by an experiment with falling oil drops, and got an answer which we now know not to be quite right. It’s a little bit off because he had the incorrect value for the viscosity of air. It’s interesting to look at the history of measurements of the charge of an electron, after Millikan. If you plot them as a function of time, you find that one is a little bit bigger than Millikan’s, and the next one’s a little bit bigger than that, and the next one’s a little bit bigger than that, until finally they settle down to a number which is higher.

      Why didn’t they discover the new number was higher right away? It’s a thing that scientists are ashamed of – this history – because it’s apparent that people did things like this: When they got a number that was too high above Millikan’s, they thought something must be wrong – and they would look for and find a reason why something might be wrong. When they got a number close to Millikan’s value they didn’t look so hard. And so they eliminated the numbers that were too far off, and did other things like that…

      This is a bit similar to the “decline effect”, but different.

      There’s also another aspect to this story: Millikan discarded a lot of data when carrying out his Nobel-prize winning experiment:

      … later inspection of Millikan’s lab notebooks by historians and scientists has revealed that between February and April 1912, he took data on many more oil drops than he reported in the paper. This is troubling, since the August 1913 paper explicitly states at one point, “It is to be remarked, too, that this is not a selected group of drops, but represents all the drops experimented upon during 60 consecutive days.” However, at another point in the paper he writes that the 58 drops reported are those “upon which a complete series of observations were made.” Furthermore, the margins of his notebook contain notes such as, “beauty publish” or “something wrong.”

      Did Millikan deliberately disregard data that didn’t fit the results he wanted? Perhaps because he was under pressure from a rival and eager to make his mark as a scientist, Millikan misrepresented his data. Some have called this a clear case of scientific fraud. However, other scientists and historians have looked closely at his notebooks, and concluded that Millikan was striving for accuracy by reporting only his most reliable data, not trying to deliberately mislead others. For instance, he rejected drops that were too big, and thus fell too quickly to be measured accurately with his equipment, or too small, which meant they would have been overly influenced by Brownian motion. Some drops don’t have complete data sets, indicating they were aborted during the run.

      It’s difficult to know today whether Millikan intended to misrepresent his results, though some scientists have examined Millikan’s data and calculated that even if he had included all the drops in his analysis, his measurement for the elementary charge would not have changed much at all.

  4. Just looking at this article today: Significance testing as perverse probabilistic reasoning

    In their survey, 93% of medical diagnostics professionals answered this question incorrectly:

    Consider a typical medical research study, for example designed to test the efficacy of a drug, in which a null hypothesis H0 (‘no effect’) is tested against an alternative hypothesis H1 (‘some effect’). Suppose that the study results pass a test of statistical significance (that is P-value <0.05) in favor of H1. What has been shown?
    1. H0 is false.
    2. H1 is true.
    3. H0 is probably false.
    4. H1 is probably true.
    5. Both (1) and (2).
    6. Both (3) and (4).
    7. None of the above.

  5. Jeff Tansley says:

    In my opinion there is nothing wrong with the scientific method. It is people who are the problem.

    Every now and again this issue get raised. Bertram Russell’s writing are littered with such well expressed revelations. “People would rather die than think” etc, etc. You can even get a Nobel Prize in Economics for it.

    Why should we assume the Placebo effect just works in drug studies?

    What to do about it? Just keep hammering away and hoping. I can’t even guarantee the truth will out.

    • John Baez says:

      Jeff wrote:

      In my opinion there is nothing wrong with the scientific method.

      Well, for this to mean much, one needs to say what “the scientific method” is. I was using it to mean science as actually practiced, which changes with time. If you mean science as done right, your statement might be a tautology… or maybe you could write a big book about how to do it right.

      It’s not so easy to know what that book should say. For example, nowadays the scientific method involves a lot of p-tests, but that wasn’t always true, and someday p-tests may be less widely used, due to their limitations. As I said:

      The scientific method is not a fixed rulebook handed down from on high. It’s a work in progress. It’s only been around for a few centuries—not very long, in the grand scheme of things. The widespread use of statistics in science has been around for less than one century. And computers, which make heavy-duty number-crunching easy, have only been cheap for 30 years! No wonder people still use primitive cookbook methods for analyzing data, when they could do better.

      • Jeff Tansley says:

        Points taken – however occasionally it may be worth stating what’s important in current scientific method.

        For me its (a) transparent publication & access to information on experiments and data (b) a comprehensible model and (c) a forum like this for arguing the toss.

      • Tim van Beek says:

        I don’t like to use the phrase “scientific method”, and not just because I liked the book “against method” by Paul Feyerabend.

        Feyerabend inspired the work of Karin Knorr-Cetina, who wrote some interesting field studies about how scientists actually work (which is very very different from what is written about epistemology and the “scientific method” on Wikipedia :-)

        See, for example, “Epistemic Cultures: How the Sciences Make Knowledge”.

  6. Tim van Beek says:

    Here is a personal anecdote: While studying physics in Göttingen, we had to perform an experiment. Point a Geiger counter to a wall, measure the hits, and confirm that the time between hits has a Poisson distribution.

    Our data said that the hypothesis of the Poisson distribution was false. The lab assistant wanted us to repeat the experiment. I tried to explain to her for two hours, roughly, that according to the level of statistical significance one group in twenty was expected to reject the Poisson distribution. She did not understand it, but gave up from exhaustion.

    • Tim van Beek says:

      Maybe I should explain that the “lab assistant” was a postdoc with a PhD in experimental physics from the physics faculty of the university of Göttingen.

      My confidence in the results of experimental physics has much decreased due to this experience (and hasn’t recovered, I have quite a few more anecdotes along these lines).

  7. Tobias Fritz says:

    In particle physics, “significance chasing” seems to go under the name “look-elsewhere effect”. Particle physicists have apparently come up with ways to take it into account. There is a blog post by Tommaso Dorigo about this who points to this paper.

  8. neil clutterbuck says:

    I work in a clinical area (audiology) and am also a “manager”. In both areas I am faced with making decisions on limited data (audiology because the new hearing aids are obsolete before independent, controlled, randomised blah blah studies are published…and managers always have to decide before all the data is available). So in both areas I have to decide NOW, not wait for 95% confidence. Try crossing the road sometime – you’ll do it my way, or stay where you are without your chicken.

    Neil Clutterbuck

  9. davidtweed says:

    Note that in the straightforward case of multiple advance specified hypotheses there are techniques such as “The Bonferroni Adjustment” (another good name for an airport thriller) and variants to attempt to account for this effect. However, as is common with most statistics you’re it’s based on assumptions about the experimental situation, and there’s always a tradeoff between making conservative assumptions about the experimental situation and your ability to “detect weak signals”. It’s not per se wrong to be acting on weaker statistical support than, e.g., “statistical significance” particular in exploratory phases, but it’s important to be aware that’s what you’re doing with respect to the statistical support and not distort things.

  10. Jeff Tansley says:

    Obviously Tim should have continued – increasing the sample of experiments (joke).

    Seriously what are we looking at here – a people problem or is the ‘decline effect’ the discovery of a system that unambiguously defies the common experience of the laws of physics?

    I am inclined toward the former but this not to say I would deny the existence of complex systems with strange behaviour. However we should continually remind ourselves how we can be fooled ‘a la Nic Taub’. I guess for most people on this list this not a problem – as JB would say even amusing. The issue is Tim’s Gottenberg RA. Here is someone with a belief system that is rather fixed. She is not alone.

  11. Thomas says:

    “a very disturbing possibility” – funnily there is an old SciFi novel about this: (you know after whom “Vecherovsky” is modeled)

    • Thomas says:

      Hmm … even funnier: What if such a thing occured in mathematics? How would it show up and what would it tell about mathematics?

    • Todd Trimble says:

      you know after whom “Vecherovsky” is modeled

      No, I don’t. Velikovsky?

      • Frederik De Roo says:

        FYI, I was curious too. You can directly google “(you know after whom “Vecherovsky” is modeled)”. The first hit, which led me here (!) already explains it – after a new search on that page (or some tedious scrolling instead).

        • Todd Trimble says:

          Well, I found the cryptic comment by the same Thomas on the Café page, but I confess I’m still unenlightened, and now feeling both stupid and irritated as a result. “As most readers of this blog surely know” … groan. Is there some member of the “Grothendieck school” with a Russian surname that should leap to mind? I’m drawing a blank.

        • Todd Trimble says:

          Thomas very kindly let me in on the secret by email, and I don’t feel so stupid now for not having guessed, although perhaps I should anyway. :-) As a partial hint, the real person being alluded to is still alive, which explains why Thomas chose to be slightly cryptic here. :-)

  12. There seems to be quite a blend in Lehrer’s article between cases of what is known to be far from best practice (amounting in some cases to malpractice), cases where more caution could be hoped for in difficult young sciences (psychology, ethology, ecology), and cases where good caution is shown in mature sciences. Where he writes

    …from the disappearing benefits of second-generation antipsychotics to the weak coupling ratio exhibited by decaying neutrons, which appears to have fallen by more than ten standard deviations between 1969 and 2001. Even the law of gravity hasn’t always been perfect at predicting real-world phenomena. (In one test, physicists measuring gravity by means of deep boreholes in the Nevada desert found a two-and-a-half-per-cent discrepancy between the theoretical predictions and the actual data.) Despite these findings, second-generation antipsychotics are still widely prescribed, and our model of the neutron hasn’t changed. The law of gravity remains the same,

    he’s surely not taking the rational thing to do on the basis of those borehole tests to throw out the law of gravity (do we take that to be the Newtonian approximation to General Relativity?).

    • Bruce Bartlett says:

      I’m fascinated by that weak coupling claim. Physicists generally like to think they don’t manhandle statistics. But ten standard deviations? Does anyone know anything more about that point?

    • John Baez says:

      I don’t think Lehrer is a philosopher or historian of science, but he’s quite good for a journalist. By the way, I remember this episode:

      Even the law of gravity hasn’t always been perfect at predicting real-world phenomena. (In one test, physicists measuring gravity by means of deep boreholes in the Nevada desert found a two-and-a-half-per-cent discrepancy between the theoretical predictions and the actual data.)

      According to Wikipedia:

      Occasionally, physicists have postulated the existence of a fifth force in addition to the four known fundamental forces. The force is generally believed to have roughly the strength of gravity (i.e. it is much weaker than electromagnetism or the nuclear forces) and to have a range of anywhere from less than a millimeter to cosmological scales.

      The idea is difficult to test, because gravity is such a weak force: the gravitational interaction between two objects is only significant when one has a great mass. Therefore, it takes very precise equipment to measure gravitational interactions between objects that are small compared to the Earth. Nonetheless, in the late 1980s a fifth force, operating on municipal scales (i.e. with a range of about 100 meters), was reported by researchers (Fischbach et al.) who were reanalyzing results of Loránd Eötvös from earlier in the century. The force was believed to be linked with hypercharge.

      Neutrons have different hypercharge than protons, so a fifth force depending on hypercharge would have different strengths for different kinds of matter: light elements have a smaller fraction of their mass in neutrons than heavy elements. Also, a fifth force carried by a particle of nonzero mass would look like an inverse-square force law at short distances but decay exponentially at large distances, with a range inversely proportional to the mass of that particle.

      So, theories of a fifth force can mimic deviations from the equivalence principle (objects of the same inertial mass produce the same gravitational force) and also deviations from the inverse square law for gravity.

      So when such deviations seemed to show up, theorists went wild!

      So did experimentalists. But it’s much easier to write a theoretical paper than to do a careful measurement of the force of gravity. So, as usual, theorists come across looking like the excitable teenagers, with experimentalists being the adults of the family:

      Australian researchers, attempting to measure the gravitational constant deep in a mine shaft, found a discrepancy between the predicted and measured value, with the measured value being two percent too small. They concluded that the results may be explained by a repulsive fifth force with a range from a few centimetres to a kilometre. Similar experiments have been carried out onboard a submarine (USS Dolphin (AGSS-555)) while deeply submerged. A further experiment measuring the gravitational constant in a deep borehole in the Greenland ice sheet found discrepancies of a few percent, but it was not possible to eliminate a geological source for the observed signal.

      Some experiments used a lake and a 320m high tower. A comprehensive review suggested there is no compelling evidence for the fifth force, though scientists still search for it.

      The “comprehensive review”, called “Six years of the fifth force”, was written by Fischbach and Talmadge in 1992. I can’t find a free online version, but there’s a free version of their later paper, “Ten years of the fifth force”.

      It would be an interesting episode for a historian of science. I don’t think you can say that physicists lightly shrugged off the experimental discrepancy! But these experiments are hard to do, so if various teams get inconsistent results, with many seeing no effect, and the theory they’re questioning is well-established, most physicists wind up deciding that experimental errors were to blame.

  13. Scott McKuen says:

    Geoff Cumming’s excerpt reminds me of the thing I find most interesting about the “reject the null” viewpoint in hypothesis testing: in some sense, it’s a surrender to intuitionism. We analyze some data and see a statistical reason to claim NOT(null hypothesis). Since the null hypothesis is a placeholder for NOT(alternate hypothesis), we have to fall back on the principle of double negation to declare an effect.

  14. don't contact me says:

    Do you think that Planck’s constant will suffer from the decline effect?

  15. An article that touches upon aspects of this study has just been published in Molecular Systems Biology with the provocative title: The self-assessment trap: can we all be better than average?

    The article covers several interesting interactions between authors, editors, and journal policies. For any search or analytic program, many journals require authors to compare their program against others and demonstrate that their approach works better. This of course leads to publication of comparisons that show the superiority of the submitted program. As in Lake Wobegon, all programs are above average.

    On the application side, if your desire is to build support for your favorite hypothesis (a la the xkcd cartoon) or more modestly trying to publish a positive result, you will tend to choose among several different analytic tools the ones that favor your desired outcome. No malice or hidden agendas need be lurking beneath the surface. Ideally, statistical analysis should be used to help us check our confirmation biases and our supercharged ability to see patterns. Problems will always arise with large, complex, dynamic and self-referential data sets. For me, these analyses are merely guide posts and not final proofs to be enshrined in the Hall of Science. A colleague once quipped that these large data sets and their analyses are not hypothesis proving machines but rather are hypothesis generating machines.

    For me, statistical analysis, regardless of the outcome, is always beneficial since it requires examination of the key assumptions underlying any system: are the assumptions required for the analysis to work as advertised valid? When we overlook these, we are headed towards fooling ourselves yet again, but this time with a veneer of sophistication that can mask shaky foundations.

  16. Thomas says:

    The “decline effect” reminds me a bit to my work as private tutor with students: One has always to take care – esp. with the students with quick brains – that, instead of learning, they don’t just adapt to the tutor and the situation. If one misses such a transition, the students stop real learning or drift away in other ways (e.g. the tutor and the situation may become boring). That is very nicely described here:

    Click to access mcl-5-prev.pdf

    In the “decline effect”, one has not students and learning, but other complex, sensitive systems with a natural tendency to adapt. But in most cases, I guess it is just thoughtlessness produced by bad science teaching.

  17. Bruce Bartlett says:

    Say it ain’t so John, say it ain’t so!

    • Thomas says:

      “It struck me as bull to start with, admittedly, but since my grade depended on it, I grinned and swallowed.” sounds to me like a description of the switching from “learning” to “adapting” I had in mind. “At least my eyes are open now” sounds to me more like a strengthening of it (leading to “treating statistics as a necessary but unpleasant piece of bureaucratic red tape, and then doing whatever it takes to achieve the appearance of a significant result”), than like a switching back.

      • John Baez says:

        No, the person who wrote this is strong-willed and will not merely “adapt”. She will, however, do what it takes to pass the course!

      • The frustrated and quoted statistics student says:

        Hi Thomas: I have spent a significant amount of my life rebelling, and I have the abysmal GPA with little income to show it, so I’m very motivated to be a ‘good student’. I am halfway through the semester, I’ve memorized two formulas, and none – zero – of the problems I’ve been presented have required the usage of those formulae, nor has my instructor managed to give me any idea of a standard or typical situation in which to utilize them, or any standard methodology I should use in applying them. I have flung my book at the wall several times, and am just learning it from YT at this point. Frustrated. MUCH. I don’t understand why, and I’m not getting how, and so yeah I’ve given up trying to make any sense of what I’m being told in class.

  18. Florifulgurator says:

    “the disappearing benefits of second-generation antipsychotics” reminds me of a psychiatric case (or two, if not three from hearsay) I’ve accompanied for a year: Severe bipolar disorder with psychotic episodes of hypomania. “Psychotic” according to court documents. The poor clueless docs (I’ve witnessed the patient verbally ripping the balls off one) administered psychopharmaca worth about 1000€/month. Instead, some simple valium would have been of much more help to chill the patient down from hypomania (incl. and above all the pseudopsychotic episodes due to simple but severe sleep deprivation). Of course (…) instead a complete breakdown had to be suffered with an inevitably ensuing severe episode of clinic depression. But the story could have ended worse, n-th time. A genius witch doctor managed to smuggle a joint (0 €) into the closed ward of that asylum, instantly curing the patient. Alas, the patient didn’t get the point, entering an (n+1)-th asylum round due to THC abuse during hypomania.

  19. In most controlled trials of medical treatments you’re almost certain beforehand that the null hypothesis is wrong, so rejecting it actually adds no new information. In most cases the assumption that the treatment is no different than placebo can be rejected as absurd. You wouldn’t be interested enough to perform the experiment if you didn’t already have significant evidence against the null hypothesis.

    Null hypothesis testing is a worthless ritual that lends false authority to the results… What we really want to know is the strength of the effect, severity of side effects, and if it’s better than existing medical treatments.

  20. This one came out last year:

    “Retractions in the scientific literature: is the incidence of research fraud increasing?”

    Total papers retracted per year have increased sharply over the decade (r=0.96; p<0.001), as have retractions specifically for fraud (r=0.89; p<0.001).

    Another one of those weird situations where they may be castigating others based on poor statistics of their own. For example, the total number of papers may have increased dramatically over the past decade, and so proportionately its possible the retraction rate has stayed the same. Don’t know since I can’t get at the full text.

  21. Tom Leinster says:

    In our discussion at the n-Category Café, Lou Jost mentioned a really excellent (and entertaining) note on the habitual abuse of statistics:

    Jacob Cohen, The earth is round (p < 0.05). American Psychologist 49 (1994), no. 12, 997-1003.

    It’s short, and a great read.

  22. Steve Wenner says:

    I’m an applied statistician with 35 years experience trying to explain these concepts to various types of clients. The jargon “statistically significant” was invented by Sir Ronald Fisher in the early 20th century to describe his concept of hypothesis testing. By his choice of terminology he condemned millions of statistics students and users of statistics to misunderstanding and confusion. I think if he had simply used the phrase “an effect is ‘statistically detectable’ (or not) in the data at hand” most of this confusion would have been avoided.

    By the way, I think every statistician would agree that confidence intervals (error bars) are much more informative and intuitive than p-values and these concepts can be regarded as technically interchangeable if computed appropriately (i.e, given an estimate and p-value we can compute the confidence interval, or vice versa).

    I also think that “error bar” is a more appropriate term than “confidence interval,” since the concept has nothing to do with one’s psychological state of mind.

  23. […] ~10 signs that you’re a mathematician, CTK Insights studied curved dissections and, finally, Azimuth is a little late to the party of  Jonah Lehrer’s “Decline Effect” article but […]

  24. mthamm says:

    From my perspective, it’s all about the selling of the point. Saying something is “Statistically Significant” is easier to understand that the effect is not attributable to chance, than to show just how much of the effect is attributable to chance.

    I use the term to communicate things like response of a marketing campaign and if I showed error bars and and used words like “detectable” to my audience, it would not be accepted and moved on. It would be a point of discussion and probably a tangent.

    Just came across your blog and enjoy it.


  25. Luftiq says:

    So what about the CERN experiments and the search for the Higgs boson? Are they chasing significance too? Can we trust statistics when it comes to natural laws? It seems to me this is a very tricky subject and that not even scientists understand the complications.

    • John Baez says:

      If the folks at CERN are chasing significance, and I’m not saying they are, they are doing it in a vastly more subtle way than the experiments I’ve been discussing here. For starters, they are not satisfied with the usual criterion, namely that an effect is real if there’s only a 5% chance that it’s a coincidence. Particle physicists demand a ‘5-σ result’ before claiming they discovered a new particle. In other words, they say an effect is real if there’s only a 0.00005% chance that it’s a coincidence.

      They have not reached this point, so they did not, yesterday, claim to have discovered the Higgs. They said they found ‘tantalizing hints’ of the Higgs. For example, the ATLAS detector has gotten a 3.6-σ result, meaning there’s a chance of less than 0.01% of being a coincidence. The CMM detector, doing an independent experiment, has gotten a 2.8-σ result, meaning there’s a chance of less than 1% of being a coincidence. So, they will continue the experiment and collect more data to see if these hints hold up.

      For a good detailed but nontechnical intro, I recommend this:

      • Ethan Siegel, The Large Hadron Collider, the Higgs, and hope, Starts With a Bang!, 13 December 2011.

      My figures come from here:

      • Peter Woit, Today’s Higgs results, Not Even Wrong, 13 December 2011.

      You ask:

      Can we trust statistics when it comes to natural laws?

      We can’t trust anything except this: repeatedly re-examining all our beliefs in an intelligent and well-balanced way, using all the tools of reason at our disposal, trying hard to lapse neither into complacency nor paranoia. If we keep doing this, we can become quite sure (though never 100% sure) about some rather unobvious things.

      And when we’re doing this, there’s something a lot worse than using statistics: namely, not using statistics.

  26. John Baez says:

    Jonah Lehrer, the author of this essay ‘Is there something wrong with the scientific method?’, has been dismissed from The New Yorker for making up quotes by Bob Dylan for a book he wrote. Luckily, there’s a lot of stuff in this blog post here (and all your comments) that does not rely on anything he said someone said.

You can use Markdown or HTML in your comments. You can also use LaTeX, like this: $latex E = m c^2 $. The word 'latex' comes right after the first dollar sign, with a space after it.

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.