## El Niño Project (Part 4)

As the first big step in our El Niño prediction project, Graham Jones replicated the paper by Ludescher et al that I explained last time. Let’s see how this works!

Graham did it using R, a programming language that’s good for statistics. If you prefer another language, go ahead and write software for that… and let us know! We can add it to our repository.

Today I’ll explain this stuff to people who know their way around computers. But I’m not one of those people! So, next time I’ll explain the nitty-gritty details in a way that may be helpful to people more like me.

### Getting temperature data

Say you want to predict El Niños from 1950 to 1980 using Ludescher et al’s method. To do this, you need daily average surface air temperatures in this grid in the Pacific Ocean:

Each square here is 7.5° × 7.5°. To compute these temperatures, you have to start with temperatures on a grid with smaller squares that are 2.5° × 2.5° in size:

• Earth System Research Laboratory, NCEP Reanalysis Daily Averages Surface Level, or ftp site.

This website will give you daily average surface air temperatures in whatever rectangle and time interval you want. It delivers this data in a format called NetCDF, meaning Network Common Data Form.

We’ll take a different approach. We’ll download all the temperatures in this database, and then extract the data we need using R scripts. That way, when we play other games with temperature data later, we’ll already have it.

So, go ahead and download all the files from air.sig995. 1948.nc to air.sig995.2013.nc. It will take a while… but you’ll own the world.

There are different ways to do this. If you have R fired up, just cut-and-paste this into the console:

for (year in 1950:1979) {
"ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/surface/air.sig995.",
year, ".nc"),
destfile=paste0("air.sig995.", year, ".nc"), mode="wb")
}


### Getting the temperatures you need

Now you have files of daily average temperatures on a 2.5° by 2.5° grid from 1948 to 2013. Make sure all these files are in your working directory for R, and download this R script from GitHub:

You can use this to get the temperatures in any time interval and any rectangle of grid points you want. The details are explained in the script. But the defaults are set to precisely what you need now!

So, just run this script. You should get a file called Pacific-1948-1980.txt. This has daily average temperatures in the region we care about, from 1948 to 1980. It should start with a really long line listing locations in a 27 × 69 grid, starting with S024E48 and ending with S248E116. I’ll explain this coordinate scheme at the end of this post. Then come hundreds of lines listing temperatures in kelvin at those locations on successive days. The first of these lines should start with Y1948P001, meaning the first day of 1948.

And I know what you’re dying to ask: yes, leap days are omitted! This annoys the perfectionist in me… but leap years make data analysis more complicated, so Ludescher et al ignore leap days, and we will too.

### Getting the El Niño data

You’ll use this data to predict El Niños, so you also want a file of the Niño 3.4 index. Remember from last time, this says how much hotter the surface of this patch of seawater is than usual for this time of year:

You can download the file from here:

This is a copy of the monthly Niño 3.4 index from the US National Weather Service, which I discussed last time. It has monthly Niño 3.4 data in the column called ANOM.

Put this file in your working directory.

### Predicting El Niños

Now for the cool part. Last time I explained the average link strength’, which Ludescher et al use to predict El Niños. Now you’ll compute it.

You’ve got Pacific-1948-1980.txt and nino3.4-anoms.txt in your working directory. Download this R script written by Graham Jones, and run it:

It takes about 45 minutes on my laptop. It computes the average link strength $S$ at ten-day intervals. Then it plots $S$ in red and the Niño 3.4 index in blue, like this:

(Click to enlarge.) The shaded region is where the Niño 3.4 index is below 0.5°C. When the blue curve escapes this region and then stays above 0.5°C for at least 5 months, Ludescher et al say that there’s an El Niño.

The horizontal red line shows the threshold $\theta = 2.82.$ When $S$ exceeds this, and the Niño 3.4 index is not already over 0.5°C, Ludescher et al predict that there will be an El Niño in the next calendar year!

Our graph almost agrees with theirs:

Here the green arrows show their successful predictions, dashed arrows show false alarms, and a little letter n appears next to each El Niño they failed to predict.

The graphs don’t match perfectly. For the blue curves, we could be using Niño 3.4 from different sources. Differences in the red curves are more interesting, since that’s where all the work is involved, and we’re starting with the same data. Besides actual bugs, which are always possible, I can think of various explanations. None of them are extremely interesting, so I’ll stick them in the last section!

If you want to get ahold of our output, you can do so here:

This has the average link strength $S$ at 10-day intervals, starting from day 730 and going until day 12040, where day 1 is the first of January 1948.

So, you don’t actually have to run all these programs to get our final result. However, these programs will help you tackle some programming challenges which I’ll list now!

### Programming challenges

There are lots of variations on the Ludescher et al paper which we could explore. Here are a few easy ones to get you started. If you do any of these, or anything else, let me know!

I’ll start with a really easy one, and work on up.

Challenge 1. Repeat the calculation with temperature data from 1980 to 2013. You’ll have to adjust two lines in netcdf-convertor-ludescher.R:

firstyear <- 1948
lastyear <- 1980


should become

firstyear <- 1980
lastyear <- 2013


or whatever range of years you want. You’ll also have to adjust names of years in ludescher-replication.R. Search the file for the string 19 and make the necessary changes. Ask me if you get stuck.

Challenge 2. Repeat the calculation with temperature data on a 2.5° × 2.5° grid instead of the coarser 7.5° × 7.5° grid Ludescher et al use. You’ve got the data you need. Right now, the program ludescher-replication.R averages out the temperatures over little 3 × 3 squares. It starts with temperatures on a 27 × 69 grid and averages them out to obtain temperatures on the 9 × 23 grid shown here:

Here’s where that happens:

# the data per day is reduced from e.g. 27x69 to 9x23.

subsample.3x3 <- function(vals) {
stopifnot(dim(vals)[2] %% 3 == 0)
stopifnot(dim(vals)[3] %% 3 == 0)
n.sslats <- dim(vals)[2]/3
n.sslons <- dim(vals)[3]/3
ssvals <- array(0, dim=c(dim(vals)[1], n.sslats, n.sslons))
for (d in 1:dim(vals)[1]) {
for (slat in 1:n.sslats) {
for (slon in 1:n.sslons) {
ssvals[d, slat, slon] <- mean(vals[d, (3*slat-2):(3*slat), (3*slon-2):(3*slon)])
}
}
}
ssvals
}


So, you need to eliminate this and change whatever else needs to be changed. What new value of the threshold $\theta$ looks good for predicting El Niños now? Most importantly: can you get better at predicting El El Niños this way?

The calculation may take a lot longer, since you’ve got 9 times as many grid points and you’re calculating correlations between pairs. So if this is too tough, you can go the other way: use a coarser grid and see how much that degrades your ability to predict El Niños.

Challenge 3. Right now we average the link strength over all pairs $(i,j)$ where $i$ is a node in the El Niño basin defined by Ludescher et al and $j$ is a node outside this basin. The basin consists of the red dots here:

What happens if you change the definition of the El Niño basin? For example, can you drop those annoying two red dots that are south of the rest, without messing things up? Can you get better results if you change the shape of the basin?

To study these questions you need to rewrite ludescher-replication.R a bit. Here’s where Graham defines the El Niño basin:

ludescher.basin <- function() {
lats <- c( 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6)
lons <- c(11,12,13,14,15,16,17,18,19,20,21,22,16,22)
stopifnot(length(lats) == length(lons))
list(lats=lats,lons=lons)
}


These are lists of latitude and longitude coordinates: (5,11), (5,12), (5,13), etc. A coordinate like (5,11) means the little circle that’s 5 down and 11 across in the grid on the above map. So, that’s the leftmost point in Ludescher’s El Niño basin. By changing these lists, you can change the definition of the El Niño basin. You’ll also have to change these lists if you tackle Challenge 2.

There’s a lot more you can do… the sky’s the limit! In the weeks to come, I’ll show you lots of things we’ve actually done.

### Annoying nuances

Here are two reasons our average link strengths could differ from Ludescher’s.

Last time I mentioned that Ludescher et al claim to normalize their time-delayed cross-covariances in a sort of complicated way. I explained why I don’t think they could have actually used this method. In ludescher-replication.R, Graham used the simpler normalization described last time: namely, dividing by

$\sqrt{\langle T_i(t)^2 \rangle - \langle T_i(t) \rangle^2} \; \sqrt{\langle T_j(t-\tau)^2 \rangle - \langle T_j(t-\tau) \rangle^2}$

$\sqrt{ \langle (T_i(t) - \langle T_i(t)\rangle)^2 \rangle} \; \sqrt{ \langle (T_j(t-\tau) - \langle T_j(t-\tau)\rangle)^2 \rangle}$

Since we don’t really know what Ludescher et al did, they might have done something else.

We might also have used a different ‘subsampling’ procedure. That’s a name for how we get from the temperature data on a 9 × 69 grid to temperatures on a 3 × 23 grid. While the original data files give temperatures named after grid points, each is really an area-averaged temperature for a 2.5° × 2.5° square. Is this square centered at the grid point, or is the square having that grid point as its north-west corner, or what? I don’t know.

This data is on a grid where the coordinates are the number of steps of 2.5 degrees, counting from 1. So, for latitude, 1 means the North Pole, 73 means the South Pole. For longitude, 1 means the prime meridian, 37 means 90° east, 73 means 180° east, 109 means 270°E or 90°W, and 144 means 2.5° west. It’s an annoying system, as far as I’m concerned.

In ludescher-replication.R we use this range of coordinates:

lat.range <- 24:50
lon.range <- 48:116


That’s why your file Pacific-1948-1980.txt` has locations starting with S024E48 and ending with S248E116. Maybe Ludescher et al used a slightly different range or subsampling procedure!

There are probably lots of other nuances I haven’t noticed. Can you think of some?

### 20 Responses to El Niño Project (Part 4)

1. Graham Jones says:

Typo: Challenge 2 is labelled 1. In a comment to part 3 you predicted average link strengths will go up in challenge 2. Why? I make the same prediction but probably for a different reason!

• John Baez says:

I think they will go up because there will be more nodes $i$ inside the El Niño basin that are very close to nodes outside it, and the link strength should be largest at short distances.

• Graham Jones says:

I think they will go up because of

(1) what I have been thinking about in the section ‘Understanding the link strength’ in http://www.azimuthproject.org/azimuth/show/Climate+network.

(2) thinking that the data from a finer grid will have more day-to-day fluctuations. (I think the spatial smoothing also effectively does some smoothing in time.)

I think the greater day-to-day fluctuations will result in bigger S. But it is very difficult to get an intuition for what their link strength actually measures! It is something like ‘the extent to which some delay (tau value) gives a higher correlation than other delays’.

• Graham Jones says:

I tested my idea by using the middle grid point of each 3×3 square in the data. So I am using one 2.5 degree pixel compared to using an average over nine. The distances between the centres of the pixels are the same as before, and the number of pixels is the same too.

And I was right, the signal strength goes up, though not by much. I used the 1950-1979 range, starting day 730, and every 100 days, for a total of 103 signal values. In 91 of 103 cases, the ‘centre pixel’ value of the signal was higher. The mean value of the signal was about 0.05 higher in the ‘centre pixel’ version.

2. Steve Wenner says:

Guys, if you skip leap days, then is day 12040 = 19-Dec-78 (incrementing the date 10 days each step), or what?

3. Steve Wenner says:

I figured it out: the last day is 27-Dec-78. I got the dates by adding an extra day after each leap year. A pain.

• John Baez says:

In this setup there are 365 days each year: February 29ths don’t exist. I believe 12040 = 365 × 32 + 360. So, if “day 1” is 1 January 1948, we end up 12040 days later on “day 12041”, which is the 361st day (not counting leap days!) of the year 1948+32 = 1980.

The 365th day (not counting leap days) would be is December 31, so the 361st day would be four days earlier, December 27th.

(At first I was off a day, because I was figuring out “day 12040” instead of “day 12041”. That’s called a fencepost error.)

• John Baez says:

Actually, you want to know “day 12040”, not the day 12040 days after 1 January 1948. Right? I believe that’s December 26th.

4. Steve Wenner says:

Is the listed day the first or last day of each 10-day interval?

• John Baez says:

I’ll get Graham to tackle this question. I think something about my wording confused you.

I believe the answer is this: the average link strength $S$ is something that can be computed on any day (after there’s been enough time for running averages to make sense). His program computes $S$ on day 730, day 740, etcetera, where “day 1” is 1 January 1948.

• Graham Jones says:

The code is the final arbiter of details like this. But if it’s doing what I intended, John is right. The first day in the output is day 730, where “day 1″ is 1 January 1948. So the first day in the output is actually the last day of 1949. S(d) uses data from 365+200 days ending in day d.

5. Steve Wenner says:

I’m trying to match up, in the most reasonable way, the monthly El Nino averages with the link strength 10-day running averages. So, I can either turn the monthly averages into 10-day chunks, or consolidate the 10-day running averages into monthly averages. I want to get three columns in a spreadsheet, with dates (monthly, or 10-day increments) in the first column, El Nino in the second column and link strength in the 3rd column. Then I can do some time series analysis. The 10-day periods that bleed from one month to the next are one of several complications.

• John Baez says:

If you’ve got a good computer or some patience you could solve this problem by running a version of Graham’s program that computes the link strength every day. It took 45 minutes on my laptop to compute it every 10th day, so maybe it would take 7.5 hours to compute it every day. I could easily tweak the program to do this. You would not need to learn any R to run this: my next post will tell you all you need to know. If I knew more R, I could tweak it to compute the link strength every month. Or, perhaps I could beg some programmer friends to do it.

(Who were the idiots who decided what the lengths of the months should be? The Romans. I recently learned how idiotic they truly were: in the Julian calendar they got mixed up and put in a leap year every third year due to a fencepost error!)

But what I really want to know is this: what do you mean by “match up, in the most reasonable way, the monthly El Niño averages with the link strength 10-day running averages.” What sort of “reasonableness” criterion do you have in mind?

I’ll be more inclined to bite the bullet and write a loop that computes link strengths on the 1st day of each month if I feel you’re doing something good with it.

6. John Baez says:

Over on G+, Jim Carver wrote:

The definition in that paper you are using is not accurate. They say something to the effect that an ENSO event is when 5 consecutive months exceed thresholds by +- 0.5C.

That’s not the traditional definition. It is when 5 three month averages exceed the thresholds. That’s not the same as five months.

Other definitions do exist, such as conditions forecast in Nino 3.4 to be above thresholds for three months after a three month period has been validated but that’s only if atmospheric anomalies are present.

If you look at the real models output which take into account all parameters (not just the ones here) you will find the slope of trends are down and most forecasters are calling for a weak to moderate event by N. Hemisphere winter.

I am not. My outlook is for a 50% chance for it to remain stable below thresholds for the foreseeable future and no El Niño event occurring. If it does, it will be weak, and the strength of the event is more important than its actual occurrence. ﻿

I replied:

You’re right, Ludescher et al are not using the standard definition of El Niño – I don’t know why. Of course they’re allowed to predict whatever they want to predict. But in Part 6 of this series we’ll present serious criticisms of their ability to predict what they claim to predict.

Regarding current El Niño predictions, we should note that Ludescher’s paper was published in February and written before that: it’s claiming to do “long-range” El Niño forecasts, so it’s a different thing than the best forecasts we have now of what’ll happen this fall.

The National Weather Service came out with new predictions on July 10th:

They claim the chance of El Niño is about 70% during the Northern Hemisphere summer and is close to 80% during the fall and early winter. They expect it to peak at weak-to-moderate strength during the late fall and early winter (3-month averages of the Niño 3.4 index between 0.5°C and 1.4°C).

7. Last time I explained how to download some weather data and start analyzing it, using programs written by Graham Jones. When you read that, did you think “Wow, that’s easy!” Or did you think “Huh? Run programs in R? How am I supposed to do that?”

If you’re in the latter group, you’re like me. But I managed to do it. And this is the tale of how. It’s a blow-by-blow account of my first steps, my blunders, my fears.

8. Benjamin Antieau says:

Is it just me or do all of the pictures in this post fail to display?

• John Baez says:

The U. C. Riverside math department website where all my pictures are kept seems to be down tonight—for repairs, I hope. So, most pictures in most posts here won’t display now.

Right now it’s 12:47 Monday morning in California, shortly after midnight. If the pictures still aren’t working when the sun comes up, I’ll contact someone.

9. John,
a couple of things.

First, you said to download all the files from 1948 to 2013, but in your initial code it looks like you only download files from 1950 to 1979.

Second, i think downloading and installing R is not enough, since it appears that you need to also install the RNetCDF library, otherwise you get an error saying that there is no such package and, understandably, open.nc does not work.

So you might want to insert a line or two explaining how to install that library …

10. In case you’re interested, here are the other parts of this series:

El Niño project (part 1): basic introduction to El Niño and our project here.

El Niño project (part 2): introduction to the physics of El Niño.

El Niño project (part 3): summary of the work of Ludescher et al.

El Niño project (part 4): how Graham Jones replicated the work by Ludescher et al, using software written in R.

El Niño project (part 5): how to download R and use it to get files of climate data.

El Niño project (part 6): Steve Wenner’s statistical analysis of the work of Ludescher et al.

El Niño project (part 7): the definition of El Niño.

11. Graham Jones of the Azimuth Project wrote code implementing Ludescher et al’s algorithm, as best as we could understand it, and got results close to theirs, though not identical. […]