El Niño Project (Part 6)

guest post by Steven Wenner

Hi, I’m Steve Wenner.

I’m an industrial statistician with over 40 years of experience in a wide range applications (quality, reliability, product development, consumer research, biostatistics); but, somehow, time series only rarely crossed my path. Currently I’m working for a large consumer products company.

My undergraduate degree is in physics, and I also have a master’s in pure math. I never could reconcile how physicists used math (explain that Dirac delta function to me again in math terms? Heaviside calculus? On the other hand, I thought category theory was abstract nonsense until John showed me otherwise!). Anyway, I had to admit that I lacked the talent to pursue pure math or theoretical physics, so I became a statistician. I never regretted it—statistics has provided a very interesting and intellectually challenging career.

I got interested in Ludescher et al’s paper on El Niño prediction by reading Part 3 of this series. I have no expertise in climate science, except for an intense interest in the subject as a concerned citizen. So, I won’t talk about things like how Ludescher et al use a nonstandard definition of ‘El Niño’—that’s a topic for another time. Instead, I’ll look at some statistical aspects of their paper:

• Josef Ludescher, Avi Gozolchiani, Mikhail I. Bogachev, Armin Bunde, Shlomo Havlin, and Hans Joachim Schellnhuber, Very early warning of next El Niño, Proceedings of the National Academy of Sciences, February 2014. (Click title for free version, journal name for official version.)

Analysis

I downloaded the NOAA adjusted monthly temperature anomaly data and compared the El Niño periods with the charts in this paper. I found what appear to be two errors (“phantom” El Niños) and noted some interesting situations. Some of these are annotated on the images below. Click to enlarge them:

 

I also listed for each year whether an El Niño initiation was predicted, or not, and whether one actually happened. I did the predictions five ways: first, I listed the author’s “arrows” as they appeared on their charts, and then I tried to match their predictions by following in turn four sets of rules. Nevertheless, I could not come up with any detailed rules that exactly reproduced the author’s results.

These were the rules I used:

An El Niño initiation is predicted for a calendar year if during the preceding year the average link strength crossed above the 2.82 threshold. However, we could also invoke additional requirements. Two possibilities are:

1. Preemption rule: the prediction of a new El Niño is canceled if the preceding year ends in an El Niño period.

2. End-of-year rule: the link strength must be above 2.82 at year’s end.

I counted the predictions using all four combinations of these two rules and compared the results to the arrows on the charts.

I defined an “El Niño initiation month” to be a month where the monthly average adjusted temperature anomaly rises up to at least 0.5 C and remains above or equal to 0.5 °C for at least five months. Note that the NOAA El Niño monthly temperature estimates are rounded to hundredths; and, on occasion, the anomaly is reported as exactly 0.5 °C. I found slightly better agreement with the authors’ El Niño periods if I counted an anomaly of exactly 0.5 °C as satisfying the threshold criterion, instead of using the strictly “greater than” condition.

Anyway, I did some formal hypothesis testing and estimation under all five scenarios. The good news is that under most scenarios the prediction method gave better results than merely guessing. (But, I wonder how many things the authors tried before they settled on their final method? Also, did they do all their work on the learning series, and then only at the end check the validation series—or were they checking both as they went about their investigations?)

The bad news is that the predictions varied with the method, and the methods were rather weak. For instance, in the training series there were 9 El Niño periods in 30 years; the authors’ rules (whatever they were, exactly) found five of the nine. At the same time, they had three false alarms in the 21 years that did not have an El Niño initiated.

I used Fisher’s exact test to compute some p-values. Suppose (as our ‘null hypothesis’) that Ludescher et al’s method does not improve the odds of a successful prediction of an El Nino initiation. What’s the probability of that method getting at least as many predictions right just by chance? Answer: 0.032 – this is marginally more significant than the conventional 1 in 20 chance that is the usual threshold for rejecting a null hypothesis, but still not terribly convincing. This was, by the way, the most significant of the five p-values for the alternative rule sets applied to the learning series.

I also computed the “relative risk” statistics for all scenarios; for instance, we are more than three times as likely to see an El Niño initiation if Ludescher et al predict one, than if they predict otherwise (the 90% confidence interval for that ratio is 1.2 to 9.7, with the point estimate 3.4). Here is a screen shot of some statistics for that case:

Here is a screen shot of part of the spreadsheet list I made. In the margin on the right I made comments about special situations of interest.

Again, click to enlarge—but my whole working spreadsheet is available with more details for anyone who wishes to see it. I did the statistical analysis with a program called JMP, a product of the SAS corporation.

My overall impression from all this is that Ludescher et al are suggesting a somewhat arbitrary (and not particularly well-defined) method for revealing the relationship between link strength and El Niño initiation, if, indeed, a relationship exists. Slight variations in the interpretation of their criteria and slight variations in the data result in appreciably different predictions. I wonder if there are better ways to analyze these two correlated time series.

29 Responses to El Niño Project (Part 6)

  1. Are you in touch with Ludescher et al? It would be interesting to hear their response to your points.

    • John Baez says:

      Nadja Kutz contacted them about the issue mentioned in the section “Mathematical nuances” in Part 3, before that blog article came out. She never heard back. I suppose we should give them another try now.

      Eventually some of us may try to publish a paper on this subject.

  2. I have documented my reservations regarding the contingency table part of this analysis and its interpretation separately.

  3. I also would be more comfortable about the communication if the findings if the heading reporting the Fisher Exact said “p-value” instead of “Prob”

  4. davetweed says:

    This is very interesting and impressive stuff you’ve all been working on; I’m sorry I’m not in a position to help out much at the moment.

    Anyway, one thing that would be interesting is, since it’s about predicting a binary event over a relatively small time series, it should be possible to compute all the Θ values where the classifier gets a different score on the training set. I do start to wonder if the 2.82 value is far below the Θ value with the next lowest score then I’d take this as a suggestion that the precise 2.82 value is significantly overfitting to the accommodate the weakest El Nino events in what’s a relatively small dataset. If it looks like a more even progression of “score changepoints” as Θ rises from 2.82 then that suggests its a more robust threshold.

    • John Baez says:

      We would be happy to have as much or little help as you can give. I’m thinking of taking our work and turning it into a publishable paper at some stage. If so, everyone who wants to help out is welcome to do so, and some subset can be coauthors, and everyone else will be thanked, and nobody has to do more work than they want—except, I guess, me.

      So what about the sensitivity of Ludescher et al’s work on their choice of θ, and the danger of overfitting?

      For what it’s worth, in their 2014 paper Ludescher et al write:

      In the first part (1950–1980), which represents the learning phase, all thresholds above the temporal mean of S(t) are considered, and the optimal ones, i.e., those ones that lead to the best predictions in the learning phase, are determined. We found that Θ-values between 2.815 and 2.834 lead to the best performance (31). In the second part of the data set (1981–2011), which represents the prediction (hindcasting) phase, the performance of these thresholds was tested. We found that the thresholds between 2.815 and 2.826 gave the best results (Fig. 2, where Θ=2:82). The alarms were correct in 76% and the nonalarms in 86% of all cases. For Θ values between 2.827 and 2.834, the performance was only slightly weaker.

      The reference (31) is to their earlier paper, which has this chart:

      (Click to enlarge.)

      They write:

      Next, we use the thresholds selected in the learning phase to predict El Niño episodes in the second half of the dataset between 1982 and 2011, where we have 9 episodes and 21 nonepisodes.

      For Θ between 2.816 and 2.822, which is depicted in Fig. 2B, the hit rate is D = 6/9 ≅ 0.667—at a false-alarm rate α = 1/21 ≅ 0.048. For Θ between 2.805 and 2.816, the hit rate is also D = 6/9, but the false-alarm rate is α = 2/21 ≅ 0.095. For 2.780 < Θ ≤ 2.792, we have D = 5/9 ≅ 0.556 at α = 1/21 ≅ 0.048. These results are highly significant because the prediction efficiency is considerably better than for the shuffled data.

      For comparison, we show also the results for 6- and 12-mo forecasts based on state-of-the-art climate models (21, 37). In ref. 21, an ensemble of model trajectories has been used, whereas for the forecast of ref. 37, only a single trajectory has been used. In both references, the forecast has been compared with the NINO3.4 index, as in the current analysis. Fig. 3B shows that the method suggested here for predicting El Niño episodes more than 1 y ahead considerably outperforms the conventional 6-mo and 1-y forecasts.

      Figure 2B is one we’ve already seen repeatedly; it’s the second part here:


      I don’t deeply understand their discussion, since I don’t know what “receiver operating characteristic” analysis is, or the point of “reshuffling” the data. (Reshuffling something makes it sound like you’re testing to see if your predictions do better than completely random predictions, but you can do a lot better than random and still be really bad.)

      • davetweed says:

        Thanks. The quick upshot is that it sounds like Θ=2.8-ish is more robust than it looked from the graphs.

        Incidentally, the simplest part of Receiver Operating Characteristic analysis is just an expansion of the idea that when comparing classifiers in a system doing binary prediction there’s two kinds of error: false positives and false negatives, and a learning procedure can often trade off between the two so there’s not one number that represents accuracy but a curve. If you plot, say, hit rate (ie true positive rate) vs false positive rate then you get a curve that should have a shape similar-ish to an upper-right quadrant of a circle.

        When comparing different classifiers you often don’t know what the precise (false-positive, false-negative) you’ll want in deployment, but comparing the ROC curves you’ll often find that one classifier is “outermost” on the graph, and hence better, for a wide range of (false positive, false negative) values. (If you’re really lucky there’s one classifier that’s outermost for all the values, in which case it’s always the best classifier.) This is difficult to describe in words rather than with a diagram like this plot of some ROC curves from Wikipedia’s ROC article.

        • davetweed says:

          Ugh, that’s is why I hate trying to describe this in words. I meant the ROC curve is “upper left quadrant of a circle” (a shape like a letter \Gamma but with a lot more steps.).

        • John Baez says:

          Thanks, I think I get it! I don’t care much about left versus right, since I’m left-handed and the distinction is arbitrary. But now I get why Ludescher et al are graphing the “hit rate” versus the “false alarm rate”. Before it was a complete mystery to me; now it makes perfect sense. I guess they get a reflected version of the “false positive” versus “false negative” graph you describe.

        • davetweed says:

          It’s conventional in ROC analysis to have “a quantity you want to maximise, generally the true positive rate” on the vertical y-axis and “a quantity you want to minimise, generally the false positive rate” on the horizontal x-axis, so that above or to the left corresponds to “better”. As you can probably tell, I find this mixed convention confuses me so always end up thinking about what should be happening in terms of two “errors that should be minimised”, ie, false positive and false negatives, and then convert that statement into a statement in terms of the ROC curve variables.

  5. John Baez says:

    Hypergeometric wrote:

    I have documented my reservations regarding the contingency table part of this analysis and its interpretation separately.

    Let me show people here what you did:

    Okay, I’ve done the analysis of Steve Wenner’s contingency table. I don’t know the JMP facility well enough to tell how they use the Fisher-Exact interface, but there may be a misunderstanding of its result involved. Given the Bayesian analysis, the output of the the Fisher-Exact JMP is giving is a significance test p-value, not a probability. Someone familiar with JMP would need to check this. If so, that 0.0318 in Steve’s report is the p-value of the data, and, thus, indicates, the pattern of counts is unlikely to be due to random variation. I’m not going to defend or explain significance tests or the Fisher-Exact, apart from saying the idea is to enumerate all possible 2×2 tables having marginals agreeing with the given table, and, seeing what fraction are “consistent with” the given table. Another way to look at it is to say that, if rows and columns are independent, the cell values should be well estimated by the total number of counts in the table times the products of the row and column densities. Thus, given the table, and assuming fixed marginals, the “independence version” of the table is 15 and 7 in the first row, and 6 and 2 in the second row. I’ll leave more to others.

    Now, I’m callous about the Fisher-Exact because I’ve studied it and the idea of fixed marginals is inconsistent with reality here. In some situations, such as controlled experiments, the marginals are indeed fixed, and analysis should proceed considering them. For example, Steve did not know that the total number of “Author Arrows = No” would be 22, nor that the total number of the counts would be 30. Thus, these could have been nearly anything and the results of analysis depend in part upon considering the possibility that they, too, are suffering random variation.

    Moving on to the Bayesian version, which is done using the code available, the spread of possible counts for the cells in this table, without assuming fixed marginals, is shown in the figure:


    (Click any of these figures to enlarge them.)

    That’s quite a spread. It reflects the small number of overall events.

    Nevertheless, it’s possible to consider, given the data we have, what probabilities of various scenarios. In particular, if the odds ratio of having an ENSO activation given a Ludescher forecast to not having an ENSO activation, or P(AI=Yes|AA=Yes)/P(AI=No|AA=Yes), we get the posterior density:

    Essentially, the odds go from about unity to almost 5. That’s a big range, attributable to the low number of successful instances, but it hardly dismisses the Ludescher et al algorithm as poor.

    The other “relative risks”, another name for “odds ratios”, are shown below:



    Consider especially


    That last one, which is the odds of an ENSO starting given that Ludescher et al have predicted it, to it starting without their having predicted it, agrees, more or less, with the odds of the Fisher-Exact, being 3.4, but the Fisher-Exact markedly understates the possible range of control the Ludescher et al algorithm has over the future, given the uncertainties and limited data. Indeed, that the Fisher-Exact says this is 3.4 is why I think any interpretation of that 0.0318 p-value as being dismissive of the Ludescher, et al, results may result from misreading the JMP documentation.

    The entire package is available for download as a gzipped tarball. If you are interested in re-running the example, you’ll need R and JAGS installed, and the code is currently configured to use 4 cores of a system. Final results and diagnostics are available.

    If you want to reproduce this and need help, give a yell. I love to help people learn about the modern methods of Bayesian analysis, including things like JAGS.

    Some technical details… The counts were modeled using Poisson densities having means which are exponentials of the sum of a factor common to all cells, factors common to each column, factors common to each row, and then factors specific to each cell. Uniform priors were used in lieu of the sometimes recommended Normal priors. The posterior densities of odds ratios were obtained from the posterior densities of the estimated cell means (the Poisson means), using Bayes Rule, and were calculated along with the rest of the Gibbs Sampler run in JAGS. Gelman-Rubin PSRF statistics over 10 separately initialized chains. There were 20,000 burn-in steps, 10,000 adaptation steps, and 750,000 primary steps were done in each chain, thinning to take every 15th, trying to improve the correlation values in the results for the factors. Execution took 3.7 minutes on a 4-core 3.2 GHz AMD 64-bit system, with essentially no memory constraints, running under Windows 7, Home Premium.

  6. Thanks so much, John! Steve concludes the same as I that the contingency table analysis supports the basic correctness of Ludescher, et al algorithm, but he’s underwhelmed by the significance of the p-value, and I say the Bayes analysis shows a sharper conclusion needs more time and more El Niño instances. I suspect, but have not done and so do not know, that a frequentist analysis of statistical power would show the same thing.

  7. John Baez says:

    Hypergeometric wrote:

    I don’t know the JMP facility well enough to tell how they use the Fisher-Exact interface, but there may be a misunderstanding of its result involved. Given the Bayesian analysis, the output of the the Fisher-Exact JMP is giving is a significance test p-value, not a probability.

    Steve Wenner was reporting significance test p-values. The word “probability” was part of an explanation introduced by me, because I think many readers here will not understand the phrase “Fischer exact test p-value”. Steve Wenner polished my explanation, and I think it’s correct:

    I used Fisher’s exact test to compute some p-values. Suppose (as our ‘null hypothesis’) that Ludescher et al’s method does not improve the odds of a successful prediction of an El Nino initiation. What’s the probability of that method getting at least as many predictions right just by chance? Answer: 0.032.

    That number is the p-value. Of course this explanation is a bit rough; the link provides a more detailed one… but I think it’s clear enough that a mathematician could reinvent the Fisher exact p-test starting from this description. (I know I could.)

    I completely defer to the experts on how happy or unhappy I should be about a p-value of 0.032, whether the Fisher exact p-test is the best thing to be doing here, and what might be better. I wish I understood all the tests you ran! If I/we ever write a paper about the work of Ludescher et al, or some other El Niño prediction algorithm of our own, that would give me a great excuse to learn about these tests.

    My dream would be something like this. Simplify Ludescher et al’s method without making it work any less well (we already know some ways to do that). Use it to make predictions starting not just from surface air temperatures but from ocean temperatures, which may be more important. Try a number of variations. Use really good statistical techniques to rate these different variations. Also use the same techniques to rate some of the standard methods of predicting El Niños.

    (Nonexperts should always remember that this “climate network” method is very unusual and not what most meteorologists would want to use! So, it will be interesting to see how it compares.)

    Well, I said I would defer to experts about whether the Fisher exact p-test is the best thing to be doing here, but that’s not quite true. When I’m evaluating a weather forecast, my main concern is not the p-value: I’m willing to concede immediately that the weather forecast is significantly better than a “random” one of some sort, so I won’t be happier being told there’s a 99.999% chance that the weatherman is doing better than random.

    My main concern is “how good” the forecast is. But this concept requires further clarification; it’s actually a bunch of different concepts rolled into one. Some of the other tests you and Steve ran seem more like measures of “how good” Ludescher et al’s forecast is.

    • (Did not see your later comments ’til I put in my previous reply.)

      In many respects the Fisher Exact and the Bayesian posteriors are comparable. They give similar results. BUT where the Fisher Exact gives p-values and odds ratios and confidence intervals for them based on some dubious assumptions, the Bayesian integration (and it does derive from calculating integrals, albeit numerically) gives full marginal posteriors for these relaxing all those assumptions. Each is just done using the integral form of Bayes Rule with the kernel of the integral being a likelihood expression for the data assuming a Poisson model of counts for each cell, conditioned on parameters of actual cell probabilities AND on values for the margins of the 2×2 treated as random parameters. These are assigned exponential priors, with their exponents being linear expressions of row only terms, of column only terms, and if cross terms. The terms have parameters to which uninformative hyperpriors are assigned. (I could write the math but I’m limited to my Kindle for the moment and that’s painful enough.) This gets rolled into separate big expectation calculations which produce, e.g., the posterior estimates of the cell counts.

      It’s almost harder to describe than do. The Markov Chain model, given the model structure and data, “becomes” the various posteriors because of the Rosenbluth-Hastings theorem, and, once checked for convergence, offers samples from each of these posteriors to plot, calculate with, predict, etc.

  8. Thanks for the clarification. My concern is the casual reader will misinterpret “probability” to mean “likelihood”, and “p-value” is at least something they’ll need to look up.

    Moreover, my broader concern is conceptual, not only because only the Bayesian approach gives the full picture, but because the arguments suggest p-values are an index to plausibility, that if somehow, the smaller they are, the more plausible the Ludescher, et al, test of accuracy.. This is quite incorrect, even from a frequentist perspective. P-values, like it or no, are random variables, and a properly done frequentist significance test has the student of the problem establish a cutoff for acceptance based upon determined test specificity, power, and effect size, BEFORE the p-value is calculated, then drawing an unambiguous conclusion based upon the threshold.

    It is also particularly noxious, in my opinion, to do something which Steve Wenner DID NOT, but Fyfe, Gillett, and Zwiers, for example, from “Warming Slowdown?” definitely did, which was to I interpret a p-value as an event probability and claim it equivalent to a random excursion happening once every so many years.

    I’m not being pedantic about this. I”m trying to circumscribe bad frequentist practice. Doing things a Bayesian way is one way of doing that. Keeping to the narrow interpretations of frequentist p-values is another.

  9. Steve Wenner says:

    Oh my gosh, in little more time than it took me to fly across the Atlantic my guest post acquired quite a string of expert comments!

    I agree completely with Hypergeometric’s criticisms and concerns about the assumption of fixed marginals in the Fisher exact test. I am really a Bayesian at heart and actually prefer Hypergeometric’s philosophy and analytical approach to my own frequentist analysis. I must point out, however, that the Bayesian approach does not reduce our list of assumptions – as you can see from Hypergeometric’s technical notes, the list of assumptions necessary to perform a Bayesian analysis is very long! (Hypergeometric – could you chart cumulative probabilities, instead of or in addition to densities? And with grid lines? I think that would help us in extracting more information).

    I’m not sure I share Hypergeometric’s concern over the interpretation of p-values in my post. I do think we can use p-values as a measure of the credibility of a null hypothesis: if the p-value is very small then we are compelled to discard the null hypothesis (that doesn’t mean that any one particular alternative hypothesis is necessarily valid or useful!). Of course, in the broad Bayesian perspective p-values are random variables, but I don’t think that invalidates all frequentist interpretations.

    A comment about (my attitude toward) null hypotheses: except in rare circumstances null hypothses are always false, and we don’t need any data to confirm that! The only issue is the size of an effect and its direction. So, does Ludesher et al’s approach help or hurt? I think there is weak evidence that it might help a little.

    • Thanks, Steve.

      The notes were provided to support repeatability of result, and to provide minimal documentation of the cited code.

      As far as assumptions go, the only serious one is of Poisson counts. That is, the distribution could be under- or overdispersed, something that could be checked by trying an alternative negative binomial model. But that, in my opinion, would be silly. There are only 30 damn events! It could be done, though.

      I’ve promised JohnB I’ll (eventually) provide gory details of the analysis, not that it requires them, but in the form of an explanatory tutorial, my blog.

      This all does suggest some thought should be given to how anyone’s going to be able to tell if an Azimuth-derived improvement is actually an improvement over Ludescher, et al. I’m not saying they can’t. I!m saying it should be given some thought, and perhaps some of us more test engineering-oriented people should invest in that.

      The output presented is standard Coda diagnostics and results for MCMC/Gibbs samplers as is accepted. Next time I cook something there, I’ll think about doing it with cumulatives and then adding a grid to that.

      Tuning out … Back to work problems and to sea-level rise. It’s been fun.

  10. Steve Wenner says:

    “null hypothses are always false” – from a Bayesian viewpoint, this is usually true except for a set of measure zero (or, “almost everywhere”, as a probabilist might say.

  11. lee bloomquist says:

    Steve wrote “…my whole working spreadsheet is available with more details for anyone who wishes to see it.” But I get the Not Found page.(?)

    • John Baez says:

      Whoops! Thanks, I fixed it. You can find Steve Wenner’s spreadsheet here. The annotations are interesting.

      (My link called the file “ElNinoTemps.xslx”, but it’s “ElNinoTemps.xlsx”.)

  12. John Baez says:

    Steve wrote:

    A comment about (my attitude toward) null hypotheses: except in rare circumstances null hypothses are always false, and we don’t need any data to confirm that!

    Right! You and Hypergeometric (= Jan Galkowski) already know this stuff… but since we have lots of non-statistician readers here, I urge them to read this great blog article:

    • Tom Leinster, Fetishizin p-values, The n-Category Café, 23 September 2010.

    For those too lazy to click the link, here’s the basic point:

    Suppose that you and I are each trying to develop a drug to speed up the healing of broken legs. You do some trials and run some statistics and conclude that your drug works—in other words, you reject the null hypothesis that patients taking your drug heal no faster than a control group. I do the same for my drug. Your pp-value is 0.1 (a certainty of 90%), and mine is 0.001 (a certainty of 99.9%).

    What does this tell us about the comparative usefulness of our drugs? Nothing. That’s because we know nothing about the magnitude of the effect. I can now reveal that my wonder drug for broken legs is… an apple a day. Who knows, maybe this does have a minuscule positive effect; for the sake of argument, let’s say that it does. [….] Given enough time to conduct trials, I can truthfully claim any degree of statistical significance I like.

    The danger is that someone who buys into the ‘cult of statistical significance’ might simply look at the numbers and say ‘90% is less than 99.9%, so the second drug must be better’.

    So, I agree completely with Jan’s point:

    This all does suggest some thought should be given to how anyone’s going to be able to tell if an Azimuth-derived improvement is actually an improvement over Ludescher, et al. I’m not saying they can’t. I!m saying it should be given some thought, and perhaps some of us more test engineering-oriented people should invest in that.

    And here’s a reason why it’s really worthwhile. The question “how good is this method of El Niño prediction” is not only important for the methods we come up with here at Azimuth. It’s also important for rating the methods the ‘big boys’ use!

    Look at this huge range of predictions the ‘big boys’ are making:



    Which of these models should we trust? That’s an important question.

    Of course, the ‘big boys’ have also thought about this question:

    • Anthony G. Barnston, Michael K. Tippett, Michelle L. L’Heureux,
    Shuhua Li and David G. Dewitt, Skill of real-time seasonal ENSO model predictions during 2002-2011 — is our capability increasing?, Science and Technology Infusion Climate Bulletin, NOAA’s National Weather Service, 36th NOAA Annual Climate Diagnostics and Prediction Workshop, Fort Worth, TX, 3-6 October 2011.

    So, we’d need to understand what they’re doing before we could do something better.

  13. Steve Wenner says:

    “Barnston et al.” Is interesting. It is disappointing that the newer models have improved so little, if at all, over the older ones. I was struck by the great difficulty the statistical models had overcoming the “northern spring predictability barrier”; I think I can see how this could happen with ARIMA models, since the lag terms might swing a model in the wrong direction during periods of transition (but this is pure speculation on my part – I wonder what the experts would say). Has anyone tried composite models – weighting more heavily the dynamical forecasts during the late spring, early summer months and relying more on the statistical models at other times?

  14. In case you’re interested, here are the other parts of this series:

    El Niño project (part 1): basic introduction to El Niño and our project here.

    El Niño project (part 2): introduction to the physics of El Niño.

    El Niño project (part 3): summary of the work of Ludescher et al.

    El Niño project (part 4): how Graham Jones replicated the work by Ludescher et al, using software written in R.

    El Niño project (part 5): how to download R and use it to get files of climate data.

    El Niño project (part 6): Steve Wenner’s statistical analysis of the work of Ludescher et al.

    El Niño project (part 7): the definition of El Niño.

  15. In the work by Ludescher et al that we’ve been looking at, they don’t feed all the daily time series values into their classifier but take the […]

  16. Steve Wenner, a statistician helping the Azimuth Project, noted some ambiguities in Ludescher et al‘s El Niño prediction rules and disambiguated them in a number of ways. For each way he used Fischer’s exact test to compute the p-value of the null hypothesis that Ludescher et al‘s El Niño prediction does not improve the odds of a successful prediction of an El Niño.

  17. John Baez says:

    In email Steve Wenner wrote:

    Hi John,

    I computed Bayes probability intervals for the success probabilities of Ludescher’s predictions (and for my variations). My method is not perfect, since I assumed independence across years, which is surely not correct. However, taking into account the serial correlations would only increase the length of the error bars, and they are plenty long enough already! I’m not certain how to do better, but I suspect I would need to use Monte Carlo methods, and they have their own problems.

    Ludescher’s method has an expected posterior probability of successfully predicting an El Nino initiation event, when one actually occurs, of 0.601. This is very close to the frequentist estimate (11 successes / 18 El Ninos = 0.611); so, the prior distribution has little effect on the estimate of the mean. The 95% confidence interval is from 0.387 to 0.798; so, the data and method did succeed in narrowing the prior uniform interval (see the next paragraph) that extends from 0.286 to 1. The intervals for “non” El Nino events are shorter: 0.768 to 0.951 for the probability of successfully predicting a “non” event; however, the prior for the non-event is from 0.714 to 1, so Ludescher’s method doesn’t narrow the range very much!

    If we don’t condition on the outcome, then the estimate of the mean success probability is 0.795 using Ludescher’s method; but, if we simply use the “dumb” rule (always predict “no El Nino”) then we will be right with probability 0.714 – the data and Ludescher gain us very little!

    Truncated Uniform Prior:

    I assume that any reasonable method will do at least as well as chance. Over many years we have experienced an El Nino initiation event in 28.6% of those years. So, a dumb method that simply declares “El Nino will happen next year!” with probability 28.6% and “will not happen!” with probability 71.4% will be successful in “predicting” 28.6% of all El Ninos. So, I set the minimum success probability at p0 = 28.6%, given that an El Nino actually occurs. Similarly, the dumb method successfully predicts 71.4% of the “no El Nino” years; so, I set the minimum success probability at p0 = 71.4% for any prediction method, given that the outcome is “no El Nino”. In both cases the upper limit is p1 = 1 for a perfect prediction method.

    For a binomial sampling situation with a truncated uniform prior the posterior density is expressible with the help of the beta distribution function (normalized incomplete beta function). The formulas can be found in Bayesian Reliability Analysis by Martz & Waller, Wiley, 1982, pp262-264. The posterior mean has a closed form, but the Bayes probability intervals must be found by iterative methods (I used the Excel “solver” add-in).

    The details are on the spreadsheet. I greyed out superfluous stuff I used to help me get the formulas straight.

    Cheers,

    Steve

    He added:

    Hi, I just wanted to add a couple of comments:

    The reason I used both the training and the validation data in estimating confidence limits was because the validation data show a better fit to the model than the training data; so, it seemed more fair to use both data sets for these calculations.

    I did some rough calculations to estimate the increase in the error bars that might result if I were to take the serial correlations into account. For instance, I think that the lower limit for the probability of successfully predicting an El Nino with Ludescher’s method is actually closer to 0.366, rather than the 0.387 reported below, and the upper limit would increase from 0.798 to 0.815. Since my ideas for this adjustment are only half-baked, I won’t go into the details here.

Leave a reply to hypergeometric Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.