In February, this paper claimed there’s a 75% chance the next El Niño will arrive by the end of 2014:

• Josef Ludescher, Avi Gozolchiani, Mikhail I. Bogachev, Armin Bunde, Shlomo Havlin, and Hans Joachim Schellnhuber, Very early warning of next El Niño, *Proceedings of the National Academy of Sciences*, February 2014. (Click title for free version, journal name for official version.)

Since it was published in a reputable journal, it created a big stir! Being able to predict an El Niño more than 6 months in advance would be a big deal. El Niños can cause billions of dollars of damage.

But that’s not the only reason we at the Azimuth Project want to analyze, criticize and improve this paper. Another reason is that it uses a *climate network*—and we like network theory.

Very roughly, the idea is this. Draw a big network of dots representing different places in the Pacific Ocean. For each pair of dots, compute a number saying how strongly correlated the temperatures are at those two places. The paper claims that when a El Niño is getting ready to happen, the average of these numbers is big. In other words, temperatures in the Pacific tend to go up and down in synch!

Whether this idea is right or wrong, it’s interesting—and it’s not very hard for programmers to dive in and study it.

Two Azimuth members have done just that: David Tanzer, a software developer who works for financial firms in New York, and Graham Jones, a self-employed programmer who also works on genomics and Bayesian statistics. These guys have really brought new life to the Azimuth Code Project in the last few weeks, and it’s exciting! It’s even gotten me to do some programming myself.

Soon I’ll start talking about the programs they’ve written, and how you can help. But today I’ll summarize the paper by Ludescher *et al*. Their methodology is also explained here:

• Josef Ludescher, Avi Gozolchiani, Mikhail I. Bogachev, Armin Bunde, Shlomo Havlin, and Hans Joachim Schellnhuber, Improved El Niño forecasting by cooperativity detection, *Proceedings of the National Academy of Sciences*, 30 May 2013.

### The basic idea

The basic idea is to use a climate network. There are lots of variants on this idea, but here’s a simple one. Start with a bunch of dots representing different places on the Earth. For any pair of dots and compute the cross-correlation of temperature histories at those two places. Call some function of this the ‘link strength’ for that pair of dots. Compute the average link strength… and get excited when this gets bigger than a certain value.

The papers by Ludescher *et al* use this strategy to predict El Niños. They build their climate network using correlations between daily temperature data for 14 grid points in the El Niño basin and 193 grid points outside this region, as shown here:

The red dots are the points in the El Niño basin.

Starting from this temperature data, they compute an ‘average link strength’ in a way I’ll describe later. When this number is bigger than a certain fixed value, they claim an El Niño is coming.

How do they decide if they’re right? How do we tell when an El Niño actually arrives? One way is to use the ‘Niño 3.4 index’. This the area-averaged sea surface temperature anomaly in the yellow region here:

**Anomaly** means the temperature minus its average over time: how much *hotter than usual* it is. When the Niño 3.4 index is over 0.5°C for at least 5 months, Ludescher *et al* say there’s an El Niño. (By the way, this is not the standard definition. But we will discuss that some other day.)

Here is what they get:

The blue peaks are El Niños: episodes where the Niño 3.4 index is over 0.5°C for at least 5 months.

The red line is their ‘average link strength’. Whenever this exceeds a certain threshold and the Niño 3.4 index is not *already* over 0.5°C, they predict an El Niño will start in the following calendar year.

The green arrows show their successful predictions. The dashed arrows show their false alarms. A little letter n appears next to each El Niño that they failed to predict.

You’re probably wondering where the number came from. They get it from a learning algorithm that finds this threshold by optimizing the predictive power of their model. Chart A here shows the ‘learning phase’ of their calculation. In this phase, they adjusted the threshold so their procedure would do a good job. Chart B shows the ‘testing phase’. Here they used the value of chosen in the learning phase, and checked to see how good a job it did. I’ll let you read their paper for more details on how they chose

But what about their prediction now? That’s the green arrow at far right here:

On 17 September 2013, the red line went above the threshold! So, their scheme predicts an El Niño sometime in 2014. The chart at right is a zoomed-in version that shows the red line in August, September, October and November of 2013.

### The details

Now I mainly need to explain how they compute their ‘average link strength’.

Let stand for any point in this 9 × 23 grid:

For each day between June 1948 and November 2013, let be the average surface air temperature at the point on day

Let be minus its **climatological average**. For example, if is June 1st 1970, we average the temperature at location over all June 1sts from 1948 to 2013, and subtract that from to get

They call the **temperature anomaly**.

(A subtlety here: when we are doing prediction we can’t know the future temperatures, so the climatological average is only the average over *past* days meeting the above criteria.)

For any function of time, denote its moving average over the last 365 days by:

Let be a point in the El Niño basin, and be a point outside it. For any time lag between 0 and 200 days, define the **time-delayed cross-covariance** by:

Note that this is a way of studying the linear correlation between the temperature anomaly at node and the temperature anomaly a time earlier at some node So, it’s about how temperature anomalies inside the El Niño basin are correlated to temperature anomalies outside this basin at earlier times.

Ludescher *et al* then normalize this, defining the **time-delayed cross-correlation** to be the time-delayed cross-covariance divided by

This is something like the standard deviation of times the standard deviation of Dividing by standard deviations is what people usually do to turn covariances into correlations. But there are some potential problems here, which I’ll discuss later.

They define in a similar way, by taking

and normalizing it. So, this is about how temperature anomalies outside the El Niño basin are correlated to temperature anomalies inside this basin at earlier times.

Next, for nodes and and for each time they determine the maximum, the mean and the standard deviation of as ranges from -200 to 200 days.

They define the **link strength** as the difference between the maximum and the mean value, divided by the standard deviation.

Finally, they let be the **average link strength**, calculated by averaging over all pairs where is a node in the El Niño basin and is a node outside.

They compute for every 10th day between January 1950 and November 2013. When goes over 2.82, and the Niño 3.4 index is not *already* over 0.5°C, they predict an El Niño in the next calendar year.

There’s more to say about their methods. We’d like you to help us check their work and improve it. Soon I want to show you Graham Jones’ software for replicating their calculations! But right now I just want to conclude by:

• mentioning a potential problem in the math, and

• telling you where to get the data used by Ludescher *et al*.

### Mathematical nuances

Ludescher *et al* normalize the time-delayed cross-covariance in a somewhat odd way. They claim to divide it by

This is a strange thing, since it has nested angle brackets. The angle brackets are defined as a running average over the 365 days, so this quantity involves data going back twice as long: 730 days. Furthermore, the ‘link strength’ involves the above expression where goes up to 200 days.

So, taking their definitions at face value, Ludescher *et al* could not actually compute their ‘link strength’ until 930 days after the surface temperature data first starts at the beginning of 1948. That would be *late 1950*. But their graph of the link strength starts at the *beginning* of 1950!

Perhaps they actually normalized the time-delayed cross-covariance by dividing it by this:

This simpler expression avoids nested angle brackets, and it makes more sense conceptually. It is the standard deviation of over the last 365 days, times of the standard deviation of over the last 365 days.

As Nadja Kutz noted, the expression written by Ludescher *et al* does not equal this simpler expression, since:

The reason is that

which is generically different from

since the terms in the latter expression contain products that can’t appear in the former.

Moreover:

But since as was just shown, those terms do not cancel out in the above expression. In particular, this means that

contains terms which do not appear in hence

So at least for the case of the standard deviation it is clear that those two definitions are not the same for a running mean. For the covariances this would still need to be shown.

### Surface air temperatures

Remember that is the average surface air temperature at the grid point on day You can get these temperatures from here:

• Earth System Research Laboratory, NCEP Reanalysis Daily Averages Surface Level, or ftp site.

These sites will give you worldwide daily average temperatures on a 2.5° latitude × 2.5° longitude grid (144 × 73 grid points), from 1948 to now. Ihe website will help you get data from within a chosen rectangle in a grid, for a chosen time interval. Alternatively, you can use the ftp site to download temperatures worldwide one year at a time. Either way, you’ll get ‘NetCDF files’—a format we will discuss later, when we get into more details about programming!

### Niño 3.4

**Niño 3.4** is the area-averaged sea surface temperature anomaly in the region 5°S-5°N and 170°-120°W. You can get Niño 3.4 data here:

You can get Niño 3.4 data here:

• Niño 3.4 data since 1870 calculated from the HadISST1, NOAA. Discussed in N. A. Rayner *et al*, Global analyses of sea surface temperature, sea ice, and night marine air temperature since the late nineteenth century, *J. Geophys. Res.* **108** (2003), 4407.

You can also get Niño 3.4 data here:

• Monthly Niño 3.4 index, Climate Prediction Center, National Weather Service.

The actual temperatures in Celsius are close to those at the other website, but the anomalies are rather different, because they’re computed in a way that takes global warming into account. See the website for details.

Niño 3.4 is just one of several official regions in the Pacific:

• Niño 1: 80°W-90°W and 5°S-10°S.

• Niño 2: 80°W-90°W and 0°S-5°S

• Niño 3: 90°W-150°W and 5°S-5°N.

• Niño 3.4: 120°W-170°W and 5°S-5°N.

• Niño 4: 160°E-150°W and 5°S-5°N.

For more details, read this:

• Kevin E. Trenberth, The definition of El Niño, *Bulletin of the American Meteorological Society* **78** (1997), 2771–2777.