Networks in Climate Science

29 November, 2014

What follows is draft of a talk I’ll be giving at the Neural Information Processing Seminar on December 10th. The actual talk may contain more stuff—for example, more work that Dara Shayda has done. But I’d love comments now, so I’m posting this now and hoping you can help out.

You can click on any of the pictures to see where it came from or get more information.

Preliminary throat-clearing

I’m very flattered to be invited to speak here. I was probably invited because of my abstract mathematical work on networks and category theory. But when I got the invitation, instead of talking about something I understood, I thought I’d learn about something a bit more practical and talk about that. That was a bad idea. But I’ll try to make the best of it.

I’ve been trying to learn climate science. There’s a subject called ‘complex networks’ where people do statistical analyses of large graphs like the worldwide web or Facebook and draw conclusions from it. People are trying to apply these ideas to climate science. So that’s what I’ll talk about. I’ll be reviewing a lot of other people’s work, but also describing some work by a project I’m involved in, the Azimuth Project.

The Azimuth Project is an all-volunteer project involving scientists and programmers, many outside academia, who are concerned about environmental issues and want to use their skills to help. This talk is based on the work of many people in the Azimuth Project, including Jan Galkowski, Graham Jones, Nadja Kutz, Daniel Mahler, Blake Pollard, Paul Pukite, Dara Shayda, David Tanzer, David Tweed, Steve Wenner and others. Needless to say, I’m to blame for all the mistakes.

Climate variability and El Niño

Okay, let’s get started.

You’ve probably heard about the ‘global warming pause’. Is this a real thing? If so, is it due to ‘natural variability’, heat going into the deep oceans, some combination of both, a massive failure of our understanding of climate processes, or something else?

Here is chart of global average air temperatures at sea level, put together by NASA’s Goddard Institute of Space Science:


You can see a lot of fluctuations, including a big dip after 1940 and a tiny dip after 2000. That tiny dip is the so-called ‘global warming pause’. What causes these fluctuations? That’s a big, complicated question.

One cause of temperature fluctuations is a kind of cycle whose extremes are called El Niño and La Niña.

A lot of things happen during an El Niño. For example, in 1997 and 1998, a big El Niño, we saw all these events:

El Niño is part of an irregular cycle that happens every 3 to 7 years, called the El Niño Southern Oscillation or ENSO. Two strongly correlated signs of an El Niño are:

1) Increased sea surface temperatures in a patch of the Pacific called the Niño 3.4 region. The temperature anomaly in this region—how much warmer it is than usual for that time of year—is called the Niño 3.4 index.

2) A decrease in air pressures in the western side of the Pacific compared to those further east. This is measured by the Southern Oscillation Index or SOI.

You can see the correlation here:

El Niños are important because they can cause billions of dollars of economic damage. They also seem to bring heat stored in the deeper waters of the Pacific into the atmosphere. So, one reason for the ‘global warming pause’ may be that we haven’t had a strong El Niño since 1998. The global warming pause might end with the next El Niño. For a while it seemed we were due for a big one this fall, but that hasn’t happened.

Teleconnections

The ENSO cycle is just one of many cycles involving teleconnections: strong correlations between weather at distant locations, typically thousands of kilometers. People have systematically looked for these teleconnections using principal component analysis of climate data, and also other techniques.

The ENSO cycle shows up automatically when you do this kind of study. It stands out as the biggest source of climate variability on time scales greater than a year and less than a decade. Some others include:

• The Pacific-North America Oscillation.
• The Pacific Decadal Oscillation.
• The North Atlantic Oscillation.
• The Arctic Oscillation.

For example, the Pacific Decadal Oscillation is a longer-period relative of the ENSO, centered in the north Pacific:

Complex network theory

Recently people have begun to study teleconnections using ideas from ‘complex network theory’.

What’s that? In complex network theory, people often start with a weighted graph: that is, a set N of nodes and for any pair of nodes i, j \in N, a weight A_{i j}, which can be any nonnegative real number.

Why is this called a weighted graph? It’s really just a matrix of nonnegative real numbers!

The reason is that we can turn any weighted graph into a graph by drawing an edge from node j to node i whenever A_{i j} >0. This is a directed graph, meaning that we should draw an arrow pointing from j to i. We could have an edge from i to j but not vice versa! Note that we can also have an edge from a node to itself.

Conversely, if we have any directed graph, we can turn it into a weighted graph by choosing the weight A_{i j} = 1 when there’s an edge from j to i, and A_{i j} = 0 otherwise.

For example, we can make a weighted graph where the nodes are web pages and A_{i j} is the number of links from the web page j to the web page i.

People in complex network theory like examples of this sort: large weighted graphs that describe connections between web pages, or people, or cities, or neurons, or other things. The goal, so far, is to compute numbers from weighted graphs in ways that describe interesting properties of these complex networks—and then formulate and test hypotheses about the complex networks we see in real life.

The El Niño basin

Here’s a very simple example of what we can do with a weighted graph. For any node i, we can sum up the weights of edges going into i:

\sum_{j \in N} A_{j i}

This is called the degree of the node i. For example, if lots of people have web pages with lots of links to yours, your webpage will have a high degree. If lots of people like you on Facebook, you will have a high degree.

So, the degree is some measure of how ‘important’ a node is.

People have constructed climate networks where the nodes are locations on the Earth’s surface, and the weight A_{i j} measures how correlated the weather is at the ith and jth location. Then, the degree says how ‘important’ a given location is for the Earth’s climate—in some vague sense.

For example, in Complex networks in climate dynamics, Donges et al take surface air temperature data on a grid and compute the correlation between grid points.

More precisely, let T_i(t) be the temperature at the ith grid point at month t after the average for that month in all years under consideration has been subtracted off, to eliminate some seasonal variations. They compute the Pearson correlation A_{i j} of T_i(t) and T_j(t) for each pair of grid points i, j. The Pearson correlation is the simplest measure of linear correlation, normalized to range between -1 and 1.

We could construct a weighted graph this way, and it would be symmetric, or undirected:

A_{i j} = A_{j i}

However, Donges et al prefer to work with a graph rather than a weighted graph. So, they create a graph where there is an edge from i to j (and also from j to i) when |A_{i j}| exceeds a certain threshold, and no edge otherwise.

They can adjust this threshold so that any desired fraction of pairs i, j actually have an edge between them. After some experimentation they chose this fraction to be 0.5%.


A certain patch dominates the world! This is the El Niño basin. The Indian Ocean comes in second.

(Some details, which I may not say:

The Pearson correlation is the covariance

\Big\langle \left( T_i - \langle T_i \rangle \right) \left( T_j - \langle T_j \rangle \right) \Big\rangle

normalized by dividing by the standard deviation of T_i and the standard deviation of T_j.

The reddest shade of red in the above picture shows nodes that are connected to 5% or more of the other nodes. These nodes are connected to at least 10 times as many nodes as average.)

The Pearson correlation detects linear correlations. A more flexible measure is mutual information: how many bits of information knowing the temperature at time t at grid point i tells you about the temperature at the same time at grid point j.

Donges et al create a climate network this way as well, putting an edge between nodes if their mutual information exceeds a certain cutoff. They choose this cutoff so that 0.5% of node pairs have an edge between them, and get the following map:


The result is almost indistinguishable in the El Niño basin. So, this feature is not just an artifact of focusing on linear correlations.

El Niño breaks climate links

We can also look at how climate networks change with time—and in particular, how they are affected by El Niños. This is the subject of a 2008 paper by Tsonis and Swanson, Topology and predictability of El Niño and La Niña networks.

They create a climate network in a way that’s similar to the one I just described. The main differences are that they:

  1. separately create climate networks for El Niño and La Niña time periods;
  2. create a link between grid points when their Pearson correlation has absolute value greater than $0.5;$

  3. only use temperature data from November to March in each year, claiming that summertime introduces spurious links.

They get this map for La Niña conditions:


and this map for El Niño conditions:


They conclude that “El Niño breaks climate links”.

This may seem to contradict what I just said a minute ago. But it doesn’t! While the El Niño basin is a region where the surface air temperatures are highly correlated to temperatures at many other points, when an El Niño actually occurs it disrupts correlations between temperatures at different locations worldwide—and even in the El Niño basin!

For the rest of the talk I want to focus on a third claim: namely, that El Niños can be predicted by means of an increase in correlations between temperatures within the El Niño basin and temperatures outside this region. This claim was made in a recent paper by Ludescher et al. I want to examine it somewhat critically.

Predicting El Niños

People really want to predict El Niños, because they have huge effects on agriculture, especially around the Pacific ocean. However, it’s generally regarded as very hard to predict El Niños more than 6 months in advance. There is also a spring barrier: it’s harder to predict El Niños through the spring of any year.

It’s controversial how much of the unpredictability in the ENSO cycle is due to chaos intrinsic to the Pacific ocean system, and how much is due to noise from outside the system. Both may be involved.

There are many teams trying to predict El Niños, some using physical models of the Earth’s climate, and others using machine learning techniques. There is a kind of competition going on, which you can see at a National Oceanic and Atmospheric Administration website.

The most recent predictions give a sense of how hard this job is:

When the 3-month running average of the Niño 3.4 index exceeds 0.5°C for 5 months, we officially declare that there is an El Niño.

As you can see, it’s hard to be sure if there will be an El Niño early next year! However, the consensus forecast is yes, a weak El Niño. This is the best we can do, now. Right now multi-model ensembles have better predictive skill than any one model.

The work of Ludescher et al

The Azimuth Project has carefully examined a 2013 paper by Ludescher et al called Very early warning of next El Niño, which uses a climate network for El Niño prediction.

They build their climate network using correlations between daily surface air temperature data between points inside the El Niño basin and certain points outside this region, as shown here:


The red dots are the points in their version of the El Niño basin.

(Next I will describe Ludescher’s procedure. I may omit some details in the actual talk, but let me include them here.)

The main idea of Ludescher et al is to construct a climate network that is a weighted graph, and to say an El Niño will occur if the average weight of edges between points in the El Niño basin and points outside this basin exceeds a certain threshold.

As in the other papers I mentioned, Ludescher et al let T_i(t) be the surface air temperature at the ith grid point at time t minus the average temperature at that location at that time of year in all years under consideration, to eliminate the most obvious seasonal effects.

They consider a time-delayed covariance between temperatures at different grid points:

\langle T_i(t) T_j(t - \tau) \rangle - \langle T_i(t) \rangle \langle T_j(t - \tau) \rangle

where \tau is a time delay, and the angle brackets denote a running average over the last year, that is:

\displaystyle{ \langle f(t) \rangle = \frac{1}{365} \sum_{d = 0}^{364} f(t - d) }

where t is the time in days.

They normalize this to define a correlation C_{i,j}^t(\tau) that ranges from -1 to 1.

Next, for any pair of nodes i and j, and for each time t, they determine the maximum, the mean and the standard deviation of |C_{i,j}^t(\tau)|, as the delay \tau ranges from -200 to 200 days.

They define the link strength S_{i,j}(t) as the difference between the maximum and the mean value of |C_{i,j}^t(\tau)|, divided by its standard deviation.

Finally, they let S(t) be the average link strength, calculated by averaging S_{i j}(t) over all pairs i,j where i is a grid point inside their El Niño basin and j is a grid point outside this basin, but still in their larger rectangle.

Here is what they get:


The blue peaks are El Niños: episodes where the Niño 3.4 index is over 0.5°C for at least 5 months.

The red line is their ‘average link strength’. Whenever this exceeds a certain threshold \Theta = 2.82, and the Niño 3.4 index is not already over 0.5°C, they predict an El Niño will start in the following calendar year.

Ludescher et al chose their threshold for El Niño prediction by training their algorithm on climate data from 1948 to 1980, and tested it on data from 1981 to 2013. They claim that with this threshold, their El Niño predictions were correct 76% of the time, and their predictions of no El Niño were correct in 86% of all cases.

On this basis they claimed—when their paper was published in February 2014—that the Niño 3.4 index would exceed 0.5 by the end of 2014 with probability 3/4.

The latest data as of 1 December 2014 seems to say: yes, it happened!

Replication and critique

Graham Jones of the Azimuth Project wrote code implementing Ludescher et al’s algorithm, as best as we could understand it, and got results close to theirs, though not identical. The code is open-source; one goal of the Azimuth Project is to do science ‘in the open’.

More interesting than the small discrepancies between our calculation and theirs is the question of whether ‘average link strengths’ between points in the El Niño basin and points outside are really helpful in predicting El Niños.

Steve Wenner, a statistician helping the Azimuth Project, noted some ambiguities in Ludescher et al‘s El Niño prediction rules and disambiguated them in a number of ways. For each way he used Fischer’s exact test to compute the p-value of the null hypothesis that Ludescher et al‘s El Niño prediction does not improve the odds that what they predict will occur.

The best he got (that is, the lowest p-value) was 0.03. This is just a bit more significant than the conventional 0.05 threshold for rejecting a null hypothesis.

Do high average link strengths between points in the El Niño basin and points elsewhere in the Pacific really increase the chance that an El Niño is coming? It is hard to tell from the work of Ludescher et al.

One reason is that they treat El Niño as a binary condition, either on or off depending on whether the Niño 3.4 index for a given month exceeds 0.5 or not. This is not the usual definition of El Niño, but the real problem is that they are only making a single yes-or-no prediction each year for 65 years: does an El Niño occur during this year, or not? 31 of these years (1950-1980) are used for training their algorithm, leaving just 34 retrodictions and one actual prediction (1981-2013, and 2014).

So, there is a serious problem with small sample size.

We can learn a bit by taking a different approach, and simply running some linear regressions between the average link strength and the Niño 3.4 index for each month. There are 766 months from 1950 to 2013, so this gives us more data to look at. Of course, it’s possible that the relation between average link strength and Niño is highly nonlinear, so a linear regression may not be appropriate. But it is at least worth looking at!

Daniel Mahler and Dara Shayda of the Azimuth Project did this and found the following interesting results.

Simple linear models

Here is a scatter plot showing the Niño 3.4 index as a function of the average link strength on the same month:


(Click on these scatter plots for more information.)

The coefficient of determination, R^2, is 0.0175. In simple terms, this means that the average link strength in a given month explains just 1.75% of the variance of the Niño 3.4 index. That’s quite low!

Here is a scatter plot showing the Niño 3.4 index as a function of the average link strength six months earlier:


Now R^2 is 0.088. So, the link strength explains 8.8% of the variance in the Niño 3.4 index 6 months later. This is still not much—but interestingly, it’s much more than when we try to relate them at the same moment in time! And the p-value is less than 2.2 \cdot 10^{-16}, so the effect is statistically significant.

Of course, we could also try to use Niño 3.4 to predict itself. Here is the Niño 3.4 index plotted against the Niño 3.4 index six months earlier:

Now R^2 = 0.162. So, this is better than using the average link strength!

That doesn’t sound good for average link strength. But now let’s could try to predict Niño 3.4 using both itself and the average link strength 6 months earlier. Here is a scatter plot showing that:

Here the x axis is an optimally chosen linear combination of average and link strength and Niño 3.4: one that maximizes R^2.

In this case we get R^2 = 0.22.

Conclusions

What can we conclude from this?

Using a linear model, the average link strength on a given month accounts for only 8% of the variance of Niño 3.4 index 6 months in the future. That sounds bad, and indeed it is.

However, there are more interesting things to say than this!

Both the Niño 3.4 index and the average link strength can be computed from the surface air temperature of the Pacific during some window in time. The Niño 3.4 index explains 16% of its own variance 6 months into the future; the average link strength explains 8%, and taken together they explain 22%. So, these two variables contain a fair amount of independent information about the Niño 3.4 index 6 months in the future.

Furthermore, they explain a surprisingly large amount of its variance for just 2 variables.

For comparison, Mahler used a random forest variant called ExtraTreesRegressor to predict the Niño 3.4 index 6 months into the future from much larger collections of data. Out of the 778 months available he trained the algorithm on the first 400 and tested it on the remaining 378.

The result: using a full world-wide grid of surface air temperature values at a given moment in time explains only 23% of the Niño 3.4 index 6 months into the future. A full grid of surface air pressure values does considerably better, but still explains only 34% of the variance. Using twelve months of the full grid of pressure values only gets around 37%.

From this viewpoint, explaining 22% of the variance with just two variables doesn’t look so bad!

Moreover, while the Niño 3.4 index is maximally correlated with itself at the same moment in time, for obvious reasons, the average link strength is maximally correlated with the Niño 3.4 index 10 months into the future:

(The lines here occur at monthly intervals.)

However, we have not tried to determine if the average link strength as Ludescher et al define it is optimal in this respect. Graham Jones has shown that simplifying their definition of this quantity doesn’t change it much. Maybe modifying their definition could improve it. There seems to be a real phenomenon at work here, but I don’t think we know exactly what it is!

My talk has avoided discussing physical models of the ENSO, because I wanted to focus on very simple, general ideas from complex network theory. However, it seems obvious that really understanding the ENSO requires a lot of ideas from meteorology, oceanography, physics, and the like. I am not advocating a ‘purely network-based approach’.


El Niño Project (Part 8)

14 October, 2014

So far we’ve rather exhaustively studied a paper by Ludescher et al which uses climate networks for El Niño prediction. This time I’d like to compare another paper:

• Y. Berezin, Avi Gozolchiani, O. Guez and Shlomo Havlin, Stability of climate networks with time, Scientific Reports 2 (2012).

Some of the authors are the same, and the way they define climate networks is very similar. But their goal here is different: they want to see see how stable climate networks are over time. This is important, since the other paper wants to predict El Niños by changes in climate networks.

They divide the world into 9 zones:

For each zone they construct several climate networks. Each one is an array of numbers W_{l r}^y, one for each year y and each pair of grid points l, r in that zone. They call W_{l r}^y a link strength: it’s a measure of how how correlated the weather is at those two grid points during that year.

I’ll say more later about how they compute these link strengths. In Part 3 we explained one method for doing it. This paper uses a similar but subtly different method.

The paper’s first big claim is that W_{l r}^y doesn’t change much from year to year, “in complete contrast” to the pattern of local daily air temperature and pressure fluctuations. In simple terms: the strength of the correlation between weather at two different points tends to be quite stable.

Moreover, the definition of link strength involves an adjustable time delay, \tau. We can measure the correlation between the weather at point l at any given time and point r at a time \tau days later. The link strength is computed by taking a maximum over time delays \tau. Naively speaking, the value of \tau that gives the maximum correlation is “how long it typically takes for weather at point l to affect weather at point r”. Or the other way around, if \tau is negative.

This is a naive way of explaining the idea, because I’m mixing up correlation with causation. But you get the idea, I hope.

Their second big claim is that when the link strength between two points l and r is big, the value of \tau that gives the maximum correlation doesn’t change much from year to year. In simple terms: if the weather at two locations is strongly correlated, the amount of time it takes for weather at one point to reach the other point doesn’t change very much.

The data

How do Berezin et al define their climate network?

They use data obtained from here:

NCEP-DOE Reanalysis 2.

This is not exactly the same data set that Ludescher et al use, namely:

NCEP/NCAR Reanalysis 1.

“Reanalysis 2″ is a newer attempt to reanalyze and fix up the same pile of data. That’s a very interesting issue, but never mind that now!

Berezin et al use data for:

• the geopotential height for six different pressures

and

• the air temperature at those different heights

The geopotential height for some pressure says roughly how high you have to go for air to have that pressure. Click the link if you want a more precise definition! Here’s the geopotential height field for the pressure of 500 millibars on some particular day of some particular year:



The height is in meters.

Berezin et al use daily values for this data for:

• locations world-wide on a grid with a resolution of 5° × 5°,

during:

• the years from 1948 to 2006.

They divide the globe into 9 zones, and separately study each zone:


So, they’ve got twelve different functions of space and time, where space is a rectangle discretized using a 5° × 5° grid, and time is discretized in days. From each such function they build a ‘climate network’.

How do they do it?

The climate networks

Berezin’s method of defining a climate network is similar to Ludescher et al‘s, but different. Compare Part 3 if you want to think about this.

Let \tilde{S}^y_l(t) be any one of their functions, evaluated at the grid point l on day t of year y.

Let S_l^y(t) be \tilde{S}^y_l(t) minus its climatological average. For example, if t is June 1st and y is 1970, we average the temperature at location l over all June 1sts from 1948 to 2006, and subtract that from \tilde{S}^y_l(t) to get S^y_l(t). In other words:

\displaystyle{  \tilde{S}^y_l(t) = S^y_l(t) - \frac{1}{N} \sum_y S^y_l(t)  }

where N is the number of years considered.

For any function of time f, let \langle f^y(t) \rangle be the average of the function over all days in year y. This is different than the ‘running average’ used by Ludescher et al, and I can’t even be 100% sure that Berezin mean what I just said: they use the notation \langle f^y(t) \rangle.

Let l and r be two grid points, and \tau any number of days in the interval [-\tau_{\mathrm{max}}, \tau_{\mathrm{max}}]. Define the cross-covariance function at time t by:

\Big(f_l(t) - \langle f_l(t) \rangle\Big) \; \Big( f_r(t + \tau) - \langle f_r(t + \tau) \rangle \Big)

I believe Berezin mean to consider this quantity, because they mention two grid points l and r. Their notation omits the subscripts l and r so it is impossible to be completely sure what they mean! But what I wrote is the reasonable quantity to consider here, so I’ll assume this is what they meant.

They normalize this quantity and take its absolute value, forming:

\displaystyle{ X_{l r}^y(\tau) = \frac{\Big|\Big(f_l(t) - \langle f_l(t) \rangle\Big) \; \Big( f_r(t + \tau) - \langle f_r(t + \tau) \rangle \Big)\Big|}   {\sqrt{\Big\langle \Big(f_l(t)      - \langle f_l(t)\rangle \Big)^2 \Big\rangle  }  \; \sqrt{\Big\langle \Big(f_r(t+\tau) - \langle f_r(t+\tau)\rangle\Big)^2 \Big\rangle  } }  }

They then take the maximum value of X_{l r}^y(\tau) over delays \tau \in [-\tau_{\mathrm{max}}, \tau_{\mathrm{max}}], subtract its mean over delays in this range, and divide by the standard deviation. They write something like this:

\displaystyle{ W_{l r}^y = \frac{\mathrm{MAX}\Big( X_{l r}^y - \langle X_{l r}^y\rangle \Big) }{\mathrm{STD} X_{l r}^y} }

and say that the maximum, mean and standard deviation are taken over the (not written) variable \tau \in [-\tau_{\mathrm{max}}, \tau_{\mathrm{max}}].

Each number W_{l r}^y is called a link strength. For each year, the matrix of numbers W_{l r}^y where l and r range over all grid points in our zone is called a climate network.

We can think of a climate network as a weighted complete graph with the grid points l as nodes. Remember, an undirected graph is one without arrows on the edges. A complete graph is an undirected graph with one edge between any pair of nodes:



A weighted graph is an undirected graph where each edge is labelled by a number called its weight. But right now we’re also calling the weight the ‘link strength’.

A lot of what’s usually called ‘network theory’ is the study of weighted graphs. You can learn about it here:

• Ernesto Estrada, The Structure of Complex Networks: Theory and Applications, Oxford U. Press, Oxford, 2011.

Suffice it to say that given a weighted graph, there are lot of quantities you can compute from it, which are believed to tell us interesting things!

The conclusions

I will not delve into the real meat of the paper, namely what they actually do with their climate networks! The paper is free online, so you can read this yourself.

I will just quote their conclusions and show you a couple of graphs.

The conclusions touch on an issue that’s important for the network-based approach to El Niño prediction. If climate networks are ‘stable’, not changing much in time, why would we use them to predict a time-dependent phenomenon like the El Niño Southern Oscillation?

We have established the stability of the network of connections between the dynamics of climate variables (e.g. temperatures and geopotential heights) in different geographical regions. This stability stands in fierce contrast to the observed instability of the original climatological field pattern. Thus the coupling between different regions is, to a large extent, constant and predictable. The links in the climate network seem to encapsulate information that is missed in analysis of the original field.

The strength of the physical connection, W_{l r}, that each link in this network represents, changes only between 5% to 30% over time. A clear boundary between links that represent real physical dependence and links that emerge due to noise is shown to exist. The distinction is based on both the high link average strength \overline{W_{l r}} and on the low variability of time delays \mathrm{STD}(T_{l r}).

Recent studies indicate that the strength of the links in the climate network changes during the El Niño Southern Oscillation and the North Atlantic Oscillation cycles. These changes are within the standard deviation of the strength of the links found here. Indeed in Fig. 3 it is clearly seen that the coefficient of variation of links in the El Niño basin (zone 9) is larger than other regions such as zone 1. Note that even in the El Niño basin the coefficient of variation is relatively small (less than 30%).

Beside the stability of single links, also the hierarchy of the link strengths in the climate network is preserved to a large extent. We have shown that this hierarchy is partially due to the two dimensional space in which the network is embedded, and partially due to pure physical coupling processes. Moreover the contribution of each of these effects, and the level of noise was explicitly estimated. The spatial effect is typically around 50% of the observed stability, and the noise reduces the stability value by typically 5%–10%.

The network structure was further shown to be consistent across different altitudes, and a monotonic relation between the altitude distance and the correspondence between the network structures is shown to exist. This yields another indication that the observed network structure represents effects of physical coupling.

The stability of the network and the contributions of different effects were summarized in specific relation to different geographical areas, and a clear distinction between equatorial and off–equatorial areas was observed. Generally, the network structure of equatorial regions is less stable and more fluctuative.

The stability and consistence of the network structure during time and across different altitudes stands in contrast to the known unstable variability of the daily anomalies of climate variables. This contrast indicates an analogy between the behavior of nodes in the climate network and the behavior of coupled chaotic oscillators. While the fluctuations of each coupled oscillators are highly erratic and unpredictable, the interactions between the oscillators is stable and can be predicted. The possible outreach of such an analogy lies in the search for known behavior patterns of coupled chaotic oscillators in the climate system. For example, existence of phase slips in coupled chaotic oscillators is one of the fingerprints for their cooperated behavior, which is evident in each of the individual oscillators. Some abrupt changes in climate variables, for example, might be related to phase slips, and can be understood better in this context.

On the basis of our measured coefficient of variation of single links (around 15%), and the significant overall network stability of 20–40%, one may speculatively assess the extent of climate change. However, for this assessment our current available data is too short and does not include enough time from periods before the temperature trends. An assessment of the relation between the network stability and climate change might be possible mainly through launching of global climate model “experiments” realizing other climate conditions, which we indeed intend to perform.

A further future outreach of our work can be a mapping between network features (such as network motifs) and known physical processes. Such a mapping was previously shown to exist between an autonomous cluster in the climate network and El Niño. Further structures without such a climate interpretation might point towards physical coupling processes which were not observed earlier.

(I have expanded some acronyms and deleted some reference numbers.)

Finally, here two nice graphs showing the average link strength as a function of distance. The first is based on four climate networks for Zone 1, the southern half of South America:



The second is based on four climate networks for Zone 9, a big patch of the Pacific north of the Equator which roughly corresponds to the ‘El Niño basin':



As we expect, temperatures and geopotential heights get less correlated at points further away. But the rate at which the correlation drops off conveys interesting information! Graham Jones has made some interesting charts of this for the rectangle of the Pacific that Ludescher et al use for El Niño prediction, and I’ll show you those next time.

The series so far

El Niño project (part 1): basic introduction to El Niño and our project here.

El Niño project (part 2): introduction to the physics of El Niño.

El Niño project (part 3): summary of the work of Ludescher et al.

El Niño project (part 4): how Graham Jones replicated the work by Ludescher et al, using software written in R.

El Niño project (part 5): how to download R and use it to get files of climate data.

El Niño project (part 6): Steve Wenner’s statistical analysis of the work of Ludescher et al.

El Niño project (part 7): the definition of El Niño.

El Niño project (part 8): Berezin et al on the stability of climate networks.


Exploring Climate Data (Part 2)

16 September, 2014

guest post by Blake Pollard

I have been learning to make animations using R. This is an animation of the profile of the surface air temperature at the equator. So, the x axis here is the longitude, approximately from 120° E to 280° E. I pulled the data from the region that Graham Jones specified in his code on github: it’s equatorial line in the region that Ludescher et al. used:

For this animation I tried to show the 1997-1998 El Niño. Typically the Pacific is much cooler near South America, due to the upwelling of deep cold water:

(Click for more information.) That part of the Pacific gets even cooler during La Niña:

But it warms up during El Niños:


You can see that in the surface air temperature during the 1997-1998 El Niño, although by summer of 1998 things seem to be getting back to normal:

I want to practice making animations like this. I could make a much prettier and better-labelled animation that ran all the way from 1948 to today, but I wanted to think a little about what exactly is best to plot if we want to use it as an aid to understanding some of this El Niño business.


El Niño Project (Part 7)

18 August, 2014

So, we’ve seen that Ludescher et al have a way to predict El Niños. But there’s something a bit funny: their definition of El Niño is not the standard one!

Precisely defining a complicated climate phenomenon like El Niño is a tricky business. Lots of different things tend to happen when an El Niño occurs. In 1997-1998, we saw these:


But what if just some of these things happen? Do we still have an El Niño or not? Is there a right answer to this question, or is it partially a matter of taste?

A related puzzle: is El Niño a single phenomenon, or several? Could there be several kinds of El Niño? Some people say there are.

Sometime I’ll have to talk about this. But today let’s start with the basics: the standard definition of El Niño. Let’s see how this differs from Ludescher et al’s definition.

The standard definition

The most standard definitions use the Oceanic Niño Index or ONI, which is the running 3-month mean of the Niño 3.4 index:

• An El Niño occurs when the ONI is over 0.5 °C for at least 5 months in a row.

• A La Niña occurs when the ONI is below -0.5 °C for at least 5 months in a row.

Of course I should also say exactly what the ‘Niño 3.4 index’ is, and what the ‘running 3-month mean’ is.

The Niño 3.4 index is the area-averaged, time-averaged sea surface temperature anomaly for a given month in the region 5°S-5°N and 170°-120°W:

Here anomaly means that we take the area-averaged, time-averaged sea surface temperature for a given month — say February — and subtract off the historical average of this quantity — that is, for Februaries of other years on record.

If you’re clever you can already see room for subtleties and disagreements. For example, you can get sea surface temperatures in the Niño 3.4 region here:

Niño 3.4 data since 1870 calculated from the HadISST1, NOAA. Discussed in N. A. Rayner et al, Global analyses of sea surface temperature, sea ice, and night marine air temperature since the late nineteenth century, J. Geophys. Res. 108 (2003), 4407.

However, they don’t actually provide the Niño 3.4 index.

You can get the Niño 3.4 index here:

TNI (Trans-Niño Index) and N3.4 (Niño 3.4 Index), NCAR.

You can also get it from here:

Monthly Niño 3.4 index, Climate Prediction Center, National Weather Service.

The actual temperatures in Celsius on the two websites are quite close — but the anomalies are rather different, because the second one ‘subtracts off the historical average’ in a way that takes global warming into account. For example, to compute the Niño 3.4 index in June 1952, instead of taking the average temperature that month and subtracting off the average temperature for all Junes on record, they subtract off the average for Junes in the period 1936-1965. Averages for different periods are shown here:

You can see how these curves move up over time: that’s global warming! It’s interesting that they go up fastest during the cold part of the year. It’s also interesting to see how gentle the seasons are in this part of the world. In the old days, the average monthly temperatures ranged from 26.2 °C in the winter to 27.5 °C in the summer — a mere 1.3 °C fluctuation.

Finally, to compute the ONI in a given month, we take the average of the Niño 3.4 index in that month, the month before, and the month after. This definition of running 3-month mean has a funny feature: we can’t know the ONI for this month until next month!

You can get a table of the ONI here:

Cold and warm episodes by season, Climate Prediction Center, National Weather Service.

It’s not particularly computer-readable.

Ludescher et al

Now let’s compare Ludescher et al. They say there’s an El Niño when the Niño 3.4 index is over 0.5°C for at least 5 months in a row. By not using the ONI — by using the Niño 3.4 index instead of its 3-month running mean — they could be counting some short ‘spikes’ in the Niño 3.4 index as El Niños, that wouldn’t count as El Niños by the usual definition.

I haven’t carefully checked to see how much changing the definition would affect the success rate of their predictions. To be fair, we should also let them change the value of their parameter θ, which is tuned to be good for predicting El Niños in their setup. But we can see that there could be some ‘spike El Niños’ in this graph of theirs, that might go away with a different definition. These are places where the red line goes over the horizontal line for more than 5 months, but no more:



Let’s see look at the spike around 1975. See that green arrow at the beginning of 1975? That means Ludescher et al are claiming to successfully predict an El Niño sometime the next calendar year. We can zoom in for a better look:



The tiny blue bumps are where the Niño 3.4 index exceeds 0.5.

Let’s compare the ONI as computed by the National Weather Service, month by month, with El Niños in red and La Niñas in blue:

1975: 0.5, -0.5, -0.6, -0.7, -0.8, -1.0, -1.1, -1.2, -1.4, -1.5, -1.6, -1.7

1976: -1.5, -1.1, -0.7, -0.5, -0.3, -0.1, 0.2, 0.4, 0.6, 0.7, 0.8, 0.8

1977: 0.6, 0.6, 0.3, 0.3, 0.3, 0.4, 0.4, 0.4, 0.5, 0.7, 0.8, 0.8

1978: 0.7, 0.5, 0.1, -0.2, -0.3, -0.3, -0.3, -0.4, -0.4, -0.3, -0.1, -0.1

So indeed an El Niño started in September 1976. The ONI only stayed above 0.5 for 6 months, but that’s enough. Ludescher and company luck out!

Just for fun, let’s look at the National Weather service Niño 3.4 index to see what that’s like:

1975: -0.33, -0.48, -0.72, -0.54, -0.68, -1.17, -1.07, -1.19, -1.36, -1.69 -1.45, -1.76

1976: -1.78, -1.10, -0.55, -0.53, -0.33, -0.10, 0.20, 0.39, 0.49, 0.88, 0.85, 0.63

So, this exceeded 0.5 in October 1976. That’s when Ludescher et al would say the El Niño starts, if they used the National Weather Service data.

Let’s also compare the NCAR Niño 3.4 index:

1975: -0.698, -0.592, -0.579, -0.801, -1.025, -1.205, -1.435, -1.620, -1.699 -1.855, -2.041, -1.960

1976: -1.708, -1.407, -1.026, -0.477, -0.095, 0.167, 0.465, 0.805, 1.039, 1.137, 1.290, 1.253

It’s pretty different! But it also gives an El Niño in 1976 according to Ludescher et al’ definition: the Niño 3.4 index exceeds 0.5 starting in August 1976.

For further study

This time we didn’t get into the interesting question of why one definition of El Niño is better than another. For that, try:

• Kevin E. Trenberth, The definition of El Niño, Bulletin of the American Meteorological Society 78 (1997), 2771–2777.

There could also be fundamentally different kinds of El Niño. For example, besides the usual sort where high sea surface temperatures are centered in the Niño 3.4 region, there could be another kind centered farther west near the International Date Line. This is called the dateline El Niño or El Niño Modoki. For more, try this:

• Nathaniel C. Johnson, How many ENSO flavors can we distinguish?, Journal of Climate 26 (2013), 4816-4827.

which has lots of references to earlier work. Here, to whet your appetite, is his picture showing the 9 most common patterns of sea surface temperature anomalies in the Pacific:

At the bottom of each is a percentage showing how frequently that pattern has occurred from 1950 to 2011. To get these pictures Johnson used something called a ‘self-organizing map analysis’ – a fairly new sort of cluster analysis done using neural networks. This is the sort of thing I hope we get into as our project progresses!

The series so far

Just in case you want to get to old articles, here’s the story so far:

El Niño project (part 1): basic introduction to El Niño and our project here.

El Niño project (part 2): introduction to the physics of El Niño.

El Niño project (part 3): summary of the work of Ludescher et al.

El Niño project (part 4): how Graham Jones replicated the work by Ludescher et al, using software written in R.

El Niño project (part 5): how to download R and use it to get files of climate data.

El Niño project (part 6): Steve Wenner’s statistical analysis of the work of Ludescher et al.

El Niño project (part 7): the definition of El Niño.

El Niño project (part 8): Berezin et al on the stability of climate networks.


Exploring Climate Data (Part 1)

1 August, 2014

joint with Dara O Shayda

Emboldened by our experiments in El Niño analysis and prediction, people in the Azimuth Code Project have been starting to analyze weather and climate data. A lot of this work is exploratory, with no big conclusions. But it’s still interesting! So, let’s try some blog articles where we present this work.

This one will be about the air pressure on the island of Tahiti and in a city called Darwin in Australia: how they’re correlated, and how each one varies. This article will also be a quick introduction to some basic statistics, as well as ‘continuous wavelet transforms’.

Darwin, Tahiti and El Niños

The El Niño Southern Oscillation is often studied using the air pressure in Darwin, Australia versus the air pressure in Tahiti. When there’s an El Niño, it gets stormy in the eastern Pacific so the air temperatures tend to be lower in Tahiti and higher in Darwin. When there’s a La Niña, it’s the other way around:



The Southern Oscillation Index or SOI is a normalized version of the monthly mean air pressure anomaly in Tahiti minus that in Darwin. Here anomaly means we subtract off the mean, and normalized means that we divide by the standard deviation.

So, the SOI tends to be negative when there’s an El Niño. On the other hand, when there’s an El Niño the Niño 3.4 index tends to be positive—this says it’s hotter than usual in a certain patch of the Pacific.

Here you can see how this works:



When the Niño 3.4 index is positive, the SOI tends to be negative, and vice versa!

It might be fun to explore precisely how well correlated they are. You can get the data to do that by clicking on the links above.

But here’s another question: how similar are the air pressure anomalies in Darwin and in Tahiti? Do we really need to take their difference, or are they so strongly anticorrelated that either one would be enough to detect an El Niño?

You can get the data to answer such questions here:

Southern Oscillation Index based upon annual standardization, Climate Analysis Section, NCAR/UCAR. This includes links to monthly sea level pressure anomalies in Darwin and Tahiti, in either ASCII format (click the second two links) or netCDF format (click the first one and read the explanation).

In fact this website has some nice graphs already made, which I might as well show you! Here’s the SOI and also the sum of the air pressure anomalies in Darwin and Tahiti, normalized in some way:


(Click to enlarge.)

If the sum were zero, the air pressure anomalies in Darwin and Tahiti would contain the same information and life would be simple. But it’s not!

How similar in character are the air pressure anomalies in Darwin and Tahiti? There are many ways to study this question. Dara tackled it by taking the air pressure anomaly data from 1866 to 2012 and computing some ‘continuous wavelet transforms’ of these air pressure anomalies. This is a good excuse for explaining how a continuous wavelet transform works.

Very basic statistics

It helps to start with some very basic statistics. Suppose you have a list of numbers

x = (x_1, \dots, x_n)

You probably know how to take their mean, or average. People often write this with angle brackets:

\displaystyle{ \langle x \rangle = \frac{1}{n} \sum_{i = 1}^n x_i }

You can also calculate the mean of their squares:

\displaystyle{  \langle x^2 \rangle = \frac{1}{n} \sum_{i = 1}^n x_i^2 }

If you were naive you might think \langle x^2 \rangle = \langle x \rangle^2, but in fact we have:

\langle x^2 \rangle \ge \langle x \rangle^2

and they’re equal only if all the x_i are the same. The point is that if the numbers x_i are spread out, the squares of the big ones (positive or negative) contribute more to the average of the squares than if we had averaged them out before squaring. The difference

\langle x^2 \rangle - \langle x \rangle^2

is called the variance; it says how spread out our numbers are. The square root of the variance is the standard deviation:

\sigma_x = \sqrt{\langle x^2 \rangle - \langle x \rangle^2 }

and this has the slight advantage that if you multiply all the numbers x_i by some constant c, the standard deviation gets multiplied by |c|. (The variance gets multiplied by c^2.)

We can generalize the variance to a situation where we have two lists of numbers:

x = (x_1, \dots, x_n)

y = (y_1, \dots, y_n)

Namely, we can form the covariance

\langle x y \rangle - \langle x \rangle \langle y \rangle

This reduces to the variance when x = y. It measures how much x and y vary together — ‘hand in hand’, as it were. A bit more precisely: if x_i is greater than its mean value mainly for i such that y_i is greater than its mean value, the covariance is positive. On the other hand, if x_i tends to be greater than average when y_i is smaller than average — like with the air pressures at Darwin and Tahiti — the covariance will be negative.

For example, if

x = (1,-1), \quad y = (1,-1)

then they ‘vary hand in hand’, and the covariance

\langle x y \rangle - \langle x \rangle \langle y \rangle = 1 - 0 = 1

is positive. But if

x = (1,-1), \quad y = (-1,1)

then one is positive when the other is negative, so the covariance

\langle x y \rangle - \langle x \rangle \langle y \rangle = -1 - 0 = -1

is negative.

Of course the covariance will get bigger if we multiply both x and y by some big number. If we don’t want this effect, we can normalize the covariance and get the correlation:

\displaystyle{ \frac{ \langle x y \rangle - \langle x \rangle \langle y \rangle }{\sigma_x \sigma_y} }

which will always be between -1 and 1.

For example, if we compute the correlation between the air pressure anomalies at Darwin and Tahiti, measured monthly from 1866 to 2012, we get
-0.253727. This indicates that when one goes up, the other tends to go down. But since we’re not getting -1, it means they’re not completely locked into a linear relationship where one is some negative number times the other.

Okay, we’re almost ready for continuous wavelet transforms! Here is the main thing we need to know. If the mean of either x or y is zero, the formula for covariance simplifies a lot, to

\displaystyle{  \langle x y \rangle = \frac{1}{n} \sum_{i = 1}^n x_i y_i }

So, this quantity says how much the numbers x_i ‘vary hand in hand’ with the numbers y_i, in the special case when one (or both) has mean zero.

We can do something similar if x, y : \mathbb{R} \to \mathbb{R} are functions of time defined for all real numbers t. The sum becomes an integral, and we have to give up on dividing by n. We get:

\displaystyle{  \int_{-\infty}^\infty x(t) y(t)\; d t }

This is called the inner product of the functions x and y, and often it’s written \langle x, y \rangle, but it’s a lot like the covariance.

Continuous wavelet transforms

What are continuous wavelet transforms, and why should we care?

People have lots of tricks for studying ‘signals’, like series of numbers x_i or functions x : \mathbb{R} \to \mathbb{R}. One method is to ‘transform’ the signal in a way that reveals useful information. The Fourier transform decomposes a signal into sines and cosines of different frequencies. This lets us see how much power the signal has at different frequencies, but it doesn’t reveal how the power at different frequencies changes with time. For that we should use something else, like the Gabor transform explained by Blake Pollard in a previous post.

Sines and cosines are great, but we might want to look for other patterns in a signal. A ‘continuous wavelet transform’ lets us scan a signal for appearances of a given pattern at different times and also at different time scales: a pattern could go by quickly, or in a stretched out slow way.

To implement the continuous wavelet transform, we need a signal and a pattern to look for. The signal could be a function x : \mathbb{R} \to \mathbb{R}. The pattern would then be another function y: \mathbb{R} \to \mathbb{R}, usually called a wavelet.

Here’s an example of a wavelet:


If we’re in a relaxed mood, we could call any function that looks like a bump with wiggles in it a wavelet. There are lots of famous wavelets, but this particular one is the fourth derivative of a certain Gaussian. Mathematica calls this particular wavelet DGaussianWavelet[4], and you can look up the formula under ‘Details’ on their webpage.

However, the exact formula doesn’t matter at all now! If we call this wavelet y, all that matters is that it’s a bump with wiggles on it, and that its mean value is 0, or more precisely:

\displaystyle{ \int_{-\infty}^\infty y(t) \; d t = 0 }

As we saw in the last section, this fact lets us take our function x and the wavelet y and see how much they ‘vary hand it hand’ simply by computing their inner product:

\displaystyle{ \langle x , y \rangle = \int_{-\infty}^\infty x(t) y(t)\; d t }

Loosely speaking, this measures the ‘amount of y-shaped wiggle in the function x’. It’s amazing how hard it is to say something in plain English that perfectly captures the meaning of a simple formula like the above one—so take the quoted phrase with a huge grain of salt. But it gives a rough intuition.

Our wavelet y happens to be centered at t  = 0. However, we might be interested in y-shaped wiggles that are centered not at zero but at some other number s. We could detect these by shifting the function y before taking its inner product with x:

\displaystyle{ \int_{-\infty}^\infty x(t) y(t-s)\; d t }

We could also be interested in measuring the amount of some stretched-out or squashed version of a y-shaped wiggle in the function x. Again we could do this by changing y before taking its inner product with x:

\displaystyle{ \int_{-\infty}^\infty x(t) \; y\left(\frac{t}{P}\right) \; d t }

When P is big, we get a stretched-out version of y. People sometimes call P the period, since the period of the wiggles in y will be proportional to this (though usually not equal to it).

Finally, we can combine these ideas, and compute

\displaystyle{ \int_{-\infty}^\infty x(t) \; y\left(\frac{t- s}{P}\right)\; dt }

This is a function of the shift s and period P which says how much of the s-shifted, P-stretched wavelet y is lurking in the function x. It’s a version of the continuous wavelet transform!

Mathematica implements this idea for time series, meaning lists of numbers x = (x_1,\dots,x_n) instead of functions x : \mathbb{R} \to \mathbb{R}. The idea is that we think of the numbers as samples of a function x:

x_1 = x(\Delta t)

x_2 = x(2 \Delta t)

and so on, where \Delta t is some time step, and replace the integral above by a suitable sum. Mathematica has a function ContinuousWaveletTransform that does this, giving

\displaystyle{  w(s,P) = \frac{1}{\sqrt{P}} \sum_{i = 1}^n x_i \; y\left(\frac{i \Delta t - s}{P}\right) }

The factor of 1/\sqrt{P} in front is a useful extra trick: it’s the right way to compensate for the fact that when you stretch out out your wavelet y by a factor of P, it gets bigger. So, when we’re doing integrals, we should define the continuous wavelet transform of y by:

\displaystyle{ w(s,P) = \frac{1}{\sqrt{P}} \int_{-\infty}^\infty x(t) y(\frac{t- s}{P})\; dt }

The results

Dara Shayda started with the air pressure anomaly at Darwin and Tahiti, measured monthly from 1866 to 2012. Taking DGaussianWavelet[4] as his wavelet, he computed the continuous wavelet transform w(s,P) as above. To show us the answer, he created a scalogram:


This is a 2-dimensional color plot showing roughly how big the continuous wavelet transform w(s,P) is for different shifts s and periods P. Blue means it’s very small, green means it’s bigger, yellow means even bigger and red means very large.

Tahiti gave this:


You’ll notice that the patterns at Darwin and Tahiti are similar in character, but notably different in detail. For example, the red spots, where our chosen wavelet shows up strongly with period of order ~100 months, occur at different times.

Puzzle 1. What is the meaning of the ‘spikes’ in these scalograms? What sort of signal would give a spike of this sort?

Puzzle 2. Do a Gabor transform, also known as a ‘windowed Fourier transform’, of the same data. Blake Pollard explained the Gabor transform in his article Milankovitch vs the Ice Ages. This is a way to see how much a signal wiggles at a given frequency at a given time: we multiply the signal by a shifted Gaussian and then takes its Fourier transform.

Puzzle 3. Read about continuous wavelet transforms. If we want to reconstruct our signal x from its continuous wavelet transform, why should we use a wavelet y with

\displaystyle{\int_{-\infty}^\infty y(t) \; d t = 0 ? }

In fact we want a somewhat stronger condition, which is implied by the above equation when the Fourier transform of y is smooth and integrable:

Continuous wavelet transform, Wikipedia.

Another way to understand correlations

David Tweed mentioned another approach from signal processing to understanding the quantity

\displaystyle{  \langle x y \rangle = \frac{1}{n} \sum_{i = 1}^n x_i y_i }

If we’ve got two lists of data x and y that we want to compare to see if they behave similarly, the first thing we ought to do is multiplicatively scale each one so they’re of comparable magnitude. There are various possibilities for assigning a scale, but a reasonable one is to ensure they have equal ‘energy’

\displaystyle{  \sum_{i=1}^n x_i^2 = \sum_{i=1}^n y_i^2 }

(This can be achieved by dividing each list by its standard deviation, which is equivalent to what was done in the main derivation above.) Once we’ve done that then it’s clear that looking at

\displaystyle{  \sum_{i=1}^n (x_i-y_i)^2 }

gives small values when they have a very good match and progressively bigger values as they become less similar. Observe that

\begin{array}{ccl}  \displaystyle{\sum_{i=1}^n (x_i-y_i)^2 }  &=& \displaystyle{ \sum_{i=1}^n (x_i^2 - 2 x_i y_i + y_i^2) }\\  &=& \displaystyle{ \sum_{i=1}^n x_i^2 - 2 \sum_{i=1}^n x_i y_i + \sum_{i=1}^n y_i^2 }  \end{array}

Since we’ve scaled things so that \sum_{i=1}^n x_i^2 and \sum_{i=1}^n y_i^2 are constants, we can see that when \sum_{i=1}^n x_i y_i becomes bigger,

\displaystyle{ \sum_{i=1}^n (x_i-y_i)^2 }

becomes smaller. So,

\displaystyle{\sum_{i=1}^n x_i y_i}

serves as a measure of how close the lists are, under these assumptions.


El Niño Project (Part 6)

23 July, 2014

guest post by Steven Wenner

Hi, I’m Steve Wenner.

I’m an industrial statistician with over 40 years of experience in a wide range applications (quality, reliability, product development, consumer research, biostatistics); but, somehow, time series only rarely crossed my path. Currently I’m working for a large consumer products company.

My undergraduate degree is in physics, and I also have a master’s in pure math. I never could reconcile how physicists used math (explain that Dirac delta function to me again in math terms? Heaviside calculus? On the other hand, I thought category theory was abstract nonsense until John showed me otherwise!). Anyway, I had to admit that I lacked the talent to pursue pure math or theoretical physics, so I became a statistician. I never regretted it—statistics has provided a very interesting and intellectually challenging career.

I got interested in Ludescher et al’s paper on El Niño prediction by reading Part 3 of this series. I have no expertise in climate science, except for an intense interest in the subject as a concerned citizen. So, I won’t talk about things like how Ludescher et al use a nonstandard definition of ‘El Niño’—that’s a topic for another time. Instead, I’ll look at some statistical aspects of their paper:

• Josef Ludescher, Avi Gozolchiani, Mikhail I. Bogachev, Armin Bunde, Shlomo Havlin, and Hans Joachim Schellnhuber, Very early warning of next El Niño, Proceedings of the National Academy of Sciences, February 2014. (Click title for free version, journal name for official version.)

Analysis

I downloaded the NOAA adjusted monthly temperature anomaly data and compared the El Niño periods with the charts in this paper. I found what appear to be two errors (“phantom” El Niños) and noted some interesting situations. Some of these are annotated on the images below. Click to enlarge them:

 

I also listed for each year whether an El Niño initiation was predicted, or not, and whether one actually happened. I did the predictions five ways: first, I listed the author’s “arrows” as they appeared on their charts, and then I tried to match their predictions by following in turn four sets of rules. Nevertheless, I could not come up with any detailed rules that exactly reproduced the author’s results.

These were the rules I used:

An El Niño initiation is predicted for a calendar year if during the preceding year the average link strength crossed above the 2.82 threshold. However, we could also invoke additional requirements. Two possibilities are:

1. Preemption rule: the prediction of a new El Niño is canceled if the preceding year ends in an El Niño period.

2. End-of-year rule: the link strength must be above 2.82 at year’s end.

I counted the predictions using all four combinations of these two rules and compared the results to the arrows on the charts.

I defined an “El Niño initiation month” to be a month where the monthly average adjusted temperature anomaly rises up to at least 0.5 C and remains above or equal to 0.5 °C for at least five months. Note that the NOAA El Niño monthly temperature estimates are rounded to hundredths; and, on occasion, the anomaly is reported as exactly 0.5 °C. I found slightly better agreement with the authors’ El Niño periods if I counted an anomaly of exactly 0.5 °C as satisfying the threshold criterion, instead of using the strictly “greater than” condition.

Anyway, I did some formal hypothesis testing and estimation under all five scenarios. The good news is that under most scenarios the prediction method gave better results than merely guessing. (But, I wonder how many things the authors tried before they settled on their final method? Also, did they do all their work on the learning series, and then only at the end check the validation series—or were they checking both as they went about their investigations?)

The bad news is that the predictions varied with the method, and the methods were rather weak. For instance, in the training series there were 9 El Niño periods in 30 years; the authors’ rules (whatever they were, exactly) found five of the nine. At the same time, they had three false alarms in the 21 years that did not have an El Niño initiated.

I used Fisher’s exact test to compute some p-values. Suppose (as our ‘null hypothesis’) that Ludescher et al’s method does not improve the odds of a successful prediction of an El Nino initiation. What’s the probability of that method getting at least as many predictions right just by chance? Answer: 0.032 – this is marginally more significant than the conventional 1 in 20 chance that is the usual threshold for rejecting a null hypothesis, but still not terribly convincing. This was, by the way, the most significant of the five p-values for the alternative rule sets applied to the learning series.

I also computed the “relative risk” statistics for all scenarios; for instance, we are more than three times as likely to see an El Niño initiation if Ludescher et al predict one, than if they predict otherwise (the 90% confidence interval for that ratio is 1.2 to 9.7, with the point estimate 3.4). Here is a screen shot of some statistics for that case:

Here is a screen shot of part of the spreadsheet list I made. In the margin on the right I made comments about special situations of interest.

Again, click to enlarge—but my whole working spreadsheet is available with more details for anyone who wishes to see it. I did the statistical analysis with a program called JMP, a product of the SAS corporation.

My overall impression from all this is that Ludescher et al are suggesting a somewhat arbitrary (and not particularly well-defined) method for revealing the relationship between link strength and El Niño initiation, if, indeed, a relationship exists. Slight variations in the interpretation of their criteria and slight variations in the data result in appreciably different predictions. I wonder if there are better ways to analyze these two correlated time series.


El Niño Project (Part 5)

12 July, 2014

 

And now for some comic relief.

Last time I explained how to download some weather data and start analyzing it, using programs written by Graham Jones. When you read that, did you think “Wow, that’s easy!” Or did you think “Huh? Run programs in R? How am I supposed to do that?”

If you’re in the latter group, you’re like me. But I managed to do it. And this is the tale of how. It’s a blow-by-blow account of my first steps, my blunders, my fears.

I hope that if you’re intimidated by programming, my tale will prove that you too can do this stuff… provided you have smart friends, or read this article.

More precisely, this article explains how to:

• download and run software that runs the programming language R;

• download temperature data from the National Center for Atmospheric Research;

• use R to create a file of temperature data for a given latitude/longitude rectangle for a given time interval.

I will not attempt to explain how to program in R.

If you want to copy what I’m doing, please remember that a few details depend on the operating system. Since I don’t care about operating systems, I use a Windows PC. If you use something better, some details will differ for you.

Also: at the end of this article there are some very basic programming puzzles.

A sad history

First, let me explain a bit about my relation to computers.

I first saw a computer at the Lawrence Hall of Science in Berkeley, back when I was visiting my uncle in the summer of 1978. It was really cool! They had some terminals where you could type programs in BASIC and run them.

I got especially excited when he gave me the book Computer Lib/Dream Machines by Ted Nelson. It espoused the visionary idea that people could write texts on computers all around the world—“hypertexts” where you could click on a link in one and hop to another!

I did more programming the next year in high school, sitting in a concrete block room with a teletype terminal that was connected to a mainframe somewhere far away. I stored my programs on paper tape. But my excitement gradually dwindled, because I was having more fun doing math and physics using just pencil and paper. My own brain was more easy to program than the machine. I did not start a computer company. I did not get rich. I learned quantum mechanics, and relativity, and Gödel’s theorem.

Later I did some programming in APL in college, and still later I did a bit in Mathematica in the early 1990s… but nothing much, and nothing sophisticated. Indeed, none of these languages would be the ones you’d choose to explore sophisticated ideas in computation!

I’ve just never been very interested… until now. I now want to do a lot of data analysis. It will be embarrassing to keep asking other people to do all of it for me. I need to learn how to do it myself.

Maybe you’d like to do this stuff too—or at least watch me make a fool of myself. So here’s my tale, from the start.

Downloading and running R

To use the programs written by Graham, I need to use R, a language currently popular among statisticians. It is not the language my programmer friends would want me to learn—they’d want me to use something like Python. But tough! I can learn that later.

To download R to my Windows PC, I cleverly type download R into Google, and go to the top website it recommends:

http://cran.r-project.org/bin/windows/base/

I click the big fat button on top saying

Download R 3.1.0 for Windows

and get asked to save a file R-3.1.0-win.exe. I save it in my Downloads folder; it takes a while to download since it was 57 megabytes. When I get it, I click on it and follow the easy default installation instructions. My Desktop window now has a little icon on it that says R.

Clicking this, I get an interface where I can type commands after a red

>

symbol. Following Graham’s advice, I start by trying

> 2^(1:8)

which generates a list of powers of 2 from 21 to 28, like this:

[1] 2 4 8 16 32 64 128 256

Then I try

> mean(2^(1:8))

which gives the arithmetic mean of this list. Somewhat more fun is

> plot(rnorm(20))

which plots a bunch of points, apparently 20 standard normal deviates.

When I hear “20 standard normal deviates” I think of the members of a typical math department… but no, those are deviants. Standard normal deviates are random numbers chosen from a Gaussian distribution of mean zero and variance 1.

Downloading climate data

To do something more interesting, I need to input data.

The papers by Ludescher et al use surface air temperatures in a certain patch of the Pacific, so I want to get ahold of those. They’re here:

NCEP/NCAR Reanalysis 1: Surface.

NCEP is the National Centers for Environmental Prediction, and NCAR is the National Center for Atmospheric Research. They have a bunch of files here containing worldwide daily average temperatures on a 2.5 degree latitude × 2.5 degree longitude grid (that’s 144 × 73 grid points), from 1948 to 2010. And if you go here, the website will help you get data from within a chosen rectangle in a grid, for a chosen time interval.

These are NetCDF files. NetCDF stands for Network Common Data Form:

NetCDF is a set of software libraries and self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data.

According to my student Blake Pollard:

… the method of downloading a bunch of raw data via ftp (file transfer protocol) is a great one to become familiar with. If you poke around on ftp://ftp.cdc.noaa.gov/Datasets or some other ftp servers maintained by government agencies you will find all the data you could ever want. Examples of things you can download for free: raw multispectral satellite images, processed data products, ‘re-analysis’ data (which is some way of combining analysis/simulation to assimilate data), sea surface temperature anomalies at resolutions much higher than 2.5 degrees (although you pay for that in file size). Also, believe it or not, people actually use NetCDF files quite widely, so once you know how to play around with those you’ll find the world quite literally at your fingertips!

I know about ftp: I’m so old that I know this was around before the web existed. Back then it meant “faster than ponies”. But I need to get R to accept data from these NetCDF files: that’s what scares me!

Graham said that R has a “package” called RNetCDF for using NetCDF files. So, I need to get ahold of this package, download some files in the NetCDF format, and somehow get R to eat those files with the help of this package.

At first I was utterly clueless! However, after a bit of messing around, I notice that right on top of the R interface there’s a menu item called Packages. I boldly click on this and choose Install Package(s).

I am rewarded with an enormous alphabetically ordered list of packages… obviously statisticians have lots of stuff they like to do over and over! I find RNetCDF, click on that and click something like “OK”.

I’m asked if I want to use a “personal library”. I click “no”, and get an error message. So I click “yes”. The computer barfs out some promising text:

utils:::menuInstallPkgs()
trying URL 'http://cran.stat.nus.edu.sg/bin/windows/contrib/3.1/RNetCDF_1.6.2-3.zip'
Content type 'application/zip' length 548584 bytes (535 Kb)
opened URL
downloaded 535 Kb

package ‘RNetCDF’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
C:\Users\JOHN\AppData\Local\Temp\Rtmp4qJ2h8\downloaded_packages

Success!

But now I need to figure out how to download a file and get R to eat it and digest it with the help of RNetCDF.

At this point my deus ex machina, Graham, descends from the clouds and says:

You can download the files from your browser. It is probably easiest to do that for starters. Put
ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/surface/
into the browser, then right-click a file and Save link as…

This code will download a bunch of them:

for (year in 1950:1979) {
download.file(url=paste0("ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/surface/air.sig995.", year, ".nc"), destfile=paste0("air.sig995.", year, ".nc"), mode="wb")
}

It will put them into the “working directory”, probably C:\Users\JOHN\Documents. You can find the working directory using getwd(), and change it with setwd(). But you must use / not \ in the filepath.

Compared to UNIX, the Windows operating system has the peculiarity of using \ instead of / in path names, but R uses the UNIX conventions even on Windows.

So, after some mistakes, in the R interface I type

> setwd("C:/Users/JOHN/Documents/My Backups/azimuth/el nino")

and then type

> getwd()

to see if I’ve succeeded. I’m rewarded with

[1] "C:/Users/JOHN/Documents/My Backups/azimuth/el nino"

Good!

Then, following Graham’s advice, I cut-and-paste this into the R interface:

for (year in 1950:1979) {
download.file(url=paste0("ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/surface/air.sig995.", year, ".nc"), destfile=paste0("air.sig995.", year, ".nc"), mode="wb")
}

It seems to be working! A little bar appears showing how each year’s data is getting downloaded. It chugs away, taking a couple minutes for each year’s worth of data.

Using R to process NetCDF files

Okay, now I’ve got all the worldwide daily average temperatures on a 2.5 degree latitude × 2.5 degree longitude grid from 1950 to 1970.

The world is MINE!

But what do I do with it? Graham’s advice is again essential, along with a little R program, or script, that he wrote:

The R script netcdf-convertor.R from

https://github.com/azimuth-project/el-nino/tree/master/R

will eat the file, digest it, and spit it out again. It contains instructions.

I go to this URL, which is on GitHub, a popular free web-based service for software development. You can store programs here, edit them, and GitHub will help you keep track of the different versions. I know almost nothing about this stuff, but I’ve seen it before, so I’m not intimidated.

I click on the blue thing that says netcdf-convertor.R and see something that looks like the right script. Unfortunately I can’t see how to download it! I eventually see a button I’d overlooked, cryptically labelled “Raw”. I realize that since I don’t want a roasted or oven-broiled piece of software, I should click on this. I indeed succeed in downloading netcdf-convertor.R this way. Graham later says I could have done something better, but oh well. I’m just happy nothing has actually exploded yet.

Once I’ve downloaded this script, I open it using an text processor and look at it. At top are a bunch of comments written by Graham:


######################################################
######################################################

# You should be able to use this by editing this
# section only.

setwd("C:/Users/Work/AAA/Programming/ProgramOutput/Nino")

lat.range <- 13:14
lon.range <- 142:143

firstyear <- 1957
lastyear <- 1958

outputfilename <- paste0("Scotland-", firstyear, "-", lastyear, ".txt")

######################################################
######################################################

# Explanation

# 1. Use setwd() to set the working directory
# to the one containing the .nc files such as
# air.sig995.1951.nc.
# Example:
# setwd("C:/Users/Work/AAA/Programming/ProgramOutput/Nino")

# 2. Supply the latitude and longitude range. The
# NOAA data is every 2.5 degrees. The ranges are
# supplied as the number of steps of this size.
# For latitude, 1 means North Pole, 73 means South
# Pole. For longitude, 1 means 0 degrees East, 37
# is 90E, 73 is 180, 109 is 90W or 270E, 144 is
# 2.5W.

# These roughly cover Scotland.
# lat.range <- 13:14
# lon.range <- 142:143

# These are the area used by Ludescher et al,
# 2013. It is 27x69 points which are then
# subsampled to 9 by 23.
# lat.range <- 24:50
# lon.range <- 48:116

# 3. Supply the years
# firstyear <- 1950
# lastyear <- 1952

# 4. Supply the output name as a text string.
# paste0() concatenates strings which you may find
# handy:
# outputfilename <- paste0("Pacific-", firstyear, "-", lastyear, ".txt")

######################################################
######################################################

# Example of output

# S013E142 S013E143 S014E142 S014E143
# Y1950P001 281.60000272654 281.570002727211 281.60000272654 280.970002740622
# Y1950P002 280.740002745762 280.270002756268 281.070002738386 280.49000275135
# Y1950P003 280.100002760068 278.820002788678 281.120002737269 280.070002760738
# Y1950P004 281.070002738386 279.420002775267 281.620002726093 280.640002747998
# ...
# Y1950P193 285.450002640486 285.290002644062 285.720002634451 285.75000263378
# Y1950P194 285.570002637804 285.640002636239 286.070002626628 286.570002615452
# Y1950P195 285.92000262998 286.220002623275 286.200002623722 286.620002614334
# ...
# Y1950P364 276.100002849475 275.350002866238 276.37000284344 275.200002869591
# Y1950P365 276.990002829581 275.820002855733 276.020002851263 274.72000288032
# Y1951P001 278.220002802089 277.470002818853 276.700002836064 275.870002854615
# Y1951P002 277.750002812594 276.890002831817 276.650002837181 275.520002862439
# ...
# Y1952P365 280.35000275448 280.120002759621 280.370002754033 279.390002775937

# There is one row for each day, and 365 days in
# each year (leap days are omitted). In each row,
# you have temperatures in Kelvin for each grid
# point in a rectangle.

# S13E142 means 13 steps South from the North Pole
# and 142 steps East from Greenwich. The points
# are in reading order, starting at the top-left
# (Northmost, Westmost) and going along the top
# row first.

# Y1950P001 means year 1950, day 1. (P because
# longer periods might be used later.)

######################################################
######################################################

The instructions are admirably detailed concerning what I should do, but they don't say where the output will appear when I do it. This makes me nervous. I guess I should just try it. After all, the program is not called DestroyTheWorld.

Unfortunately, at this point a lot of things start acting weird.

It's too complicated and boring to explain in detail, but basically, I keep getting a file missing error message. I don't understand why this happens under some conditions and not others. I try lots of experiments.

Eventually I discover that one year of temperature data failed to download—the year 1949, right after the first year available! So, I'm getting the error message whenever I try to do anything involving that year of data.

To fix the problem, I simply download the 1949 data by hand from here:

ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/surface/

(You can open ftp addresses in a web browser just like http addresses.) I put it in my working directory for R, and everything is fine again. Whew!

By the time things I get this file, I sort of know what to do—after all, I've spent about an hour trying lots of different things.

I decide to create a file listing temperatures near where I live in Riverside from 1948 to 1979. To do this, I open Graham's script netcdf-convertor.R in a word processor and change this section:

setwd("C:/Users/Work/AAA/Programming/ProgramOutput/Nino")
lat.range <- 13:14
lon.range <- 142:143
firstyear <- 1957
lastyear <- 1958
outputfilename <- paste0("Scotland-", firstyear, "-", lastyear, ".txt")

to this:

setwd("C:/Users/JOHN/Documents/My Backups/azimuth/el nino")
lat.range <- 23:23
lon.range <- 98:98
firstyear <- 1948
lastyear <- 1979
outputfilename <- paste0("Riverside-", firstyear, "-", lastyear, ".txt")

Why? Well, I want it to put the file in my working directory. I want the years from 1948 to 1979. And I want temperature data from where I live!

Googling the info, I see Riverside, California is at 33.9481° N, 117.3961° W. 34° N is about 56 degrees south of the North Pole, which is 22 steps of size 2.5°. And because some idiot decided everyone should count starting at 1 instead of 0 even in contexts like this, the North Pole itself is step 1, not step 0… so Riverside is latitude step 23. That's why I write:

lat.range <- 23:23

Similarly, 117.5° W is 242.5° E, which is 97 steps of size 2.5°… which counts as step 98 according to this braindead system. That's why I write:

lon.range <- 98:98

Having done this, I save the file netcdf-convertor.R under another name, Riverside.R.

And then I do some stuff that it took some fiddling around to discover.

First, in my R interface I go to the menu item File, at far left, and click on Open script. It lets me browse around, so I go to my working directory for R and choose Riverside.R. A little window called R editor opens up in my R interface, containing this script.

I'm probably not doing this optimally, but I can now right-click on the R editor and see a menu with a choice called Select all. If I click this, everything in the window turns blue. Then I can right-click again and choose Run line or selection. And the script runs!

Voilà!

It huffs and puffs, and then stops. I peek in my working directory, and see that a file called

Riverside.1948-1979.txt

has been created. When I open it, it has lots of lines, starting with these:

S023E098
Y1948P001 279.95
Y1948P002 280.14
Y1948P003 282.27
Y1948P004 283.97
Y1948P005 284.27
Y1948P006 286.97

As Graham promised, each line has a year and day label, followed by a vector… which in my case is just a single number, since I only wanted the temperature in one location. I’m hoping this is the temperature near Riverside, in kelvin.

A small experiment

To see if this is working, I’d like to plot these temperatures and see if they make sense. Unfortunately I have no idea how to get R to take a file containing data of the sort I have and plot it! I need to learn how, but right now I’m exhausted, so I use another method to get the job done— a method that’s too suboptimal and embarrassing to describe here. (Hint: it involves the word “Excel”.)

I do a few things, but here’s the most interesting one—namely, not very interesting. I plot the temperatures for 1963:


I compare it to some publicly available data, not from Riverside, but from nearby Los Angeles:


As you can see, there was a cold day on January 13th, when the temperature dropped to 33°F. That seems to be visible on the graph I made, and looking at the data from which I made the graph, I see the temperature dropped to 251.4 kelvin on the 13th: that’s -7°F, very cold for here. It does get colder around Riverside than in Los Angeles in the winter, since it’s a desert, with temperatures not buffered by the ocean. So, this does seem compatible with the public records. That’s mildly reassuring.

But other features of the graph don’t match, and I’m not quite sure if they should or not. So, all this very tentative and unimpressive. However, I’ve managed to get over some of my worst fears, download some temperature data, and graph it! Now I need to learn how to use R to do statistics with this data, and graph it in a better way.

Puzzles

You can help me out by answering these puzzles. Later I might pose puzzles where you can help us write really interesting programs. But for now it’s just about learning R.

Puzzle 1. Given a text file with lots of lines of this form:

S023E098
Y1948P001 279.95
Y1948P002 280.14
Y1948P003 282.27
Y1948P004 283.97

write an R program that creates a huge vector, or list of numbers, like this:

279.95, 280.14, 282.27, 283.97, ...

Puzzle 2: Extend the above program so that it plots this list of numbers, or outputs it to a new file.

If you want to test your programs, here’s the actual file:

Riverside-1948-1979.txt

More puzzles

If those puzzles are too easy, here are two more. I gave these last time, but everyone was too wimpy to tackle them.

Puzzle 3. Modify the software so that it uses the same method to predict El Niños from 1980 to 2013. You’ll have to adjust two lines in netcdf-convertor-ludescher.R:

firstyear <- 1948
lastyear <- 1980

should become

firstyear <- 1980
lastyear <- 2013

or whatever range of years you want. You’ll also have to adjust names of years in ludescher-replication.R. Search the file for the string 19 and make the necessary changes. Ask me if you get stuck.

Puzzle 4. Right now we average the link strength over all pairs (i,j) where i is a node in the El Niño basin defined by Ludescher et al and j is a node outside this basin. The basin consists of the red dots here:


What happens if you change the definition of the El Niño basin? For example, can you drop those annoying two red dots that are south of the rest, without messing things up? Can you get better results if you change the shape of the basin?

To study these questions you need to rewrite ludescher-replication.R a bit. Here’s where Graham defines the El Niño basin:

ludescher.basin <- function() {
  lats <- c( 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6)
  lons <- c(11,12,13,14,15,16,17,18,19,20,21,22,16,22)
  stopifnot(length(lats) == length(lons))
  list(lats=lats,lons=lons)
}

These are lists of latitude and longitude coordinates: (5,11), (5,12), (5,13), etc. A coordinate like (5,11) means the little circle that’s 5 down and 11 across in the grid on the above map. So, that’s the leftmost point in Ludescher’s El Niño basin. By changing these lists, you can change the definition of the El Niño basin.

Next time I’ll discuss some criticisms of Ludescher et al’s paper, but later we will return to analyzing temperature data, looking for interesting patterns.


Follow

Get every new post delivered to your Inbox.

Join 2,876 other followers