El Niño Project (Part 6)

23 July, 2014

guest post by Steven Wenner

Hi, I’m Steve Wenner.

I’m an industrial statistician with over 40 years of experience in a wide range applications (quality, reliability, product development, consumer research, biostatistics); but, somehow, time series only rarely crossed my path. Currently I’m working for a large consumer products company.

My undergraduate degree is in physics, and I also have a master’s in pure math. I never could reconcile how physicists used math (explain that Dirac delta function to me again in math terms? Heaviside calculus? On the other hand, I thought category theory was abstract nonsense until John showed me otherwise!). Anyway, I had to admit that I lacked the talent to pursue pure math or theoretical physics, so I became a statistician. I never regretted it—statistics has provided a very interesting and intellectually challenging career.

I got interested in Ludescher et al’s paper on El Niño prediction by reading Part 3 of this series. I have no expertise in climate science, except for an intense interest in the subject as a concerned citizen. So, I won’t talk about things like how Ludescher et al use a nonstandard definition of ‘El Niño’—that’s a topic for another time. Instead, I’ll look at some statistical aspects of their paper:

• Josef Ludescher, Avi Gozolchiani, Mikhail I. Bogachev, Armin Bunde, Shlomo Havlin, and Hans Joachim Schellnhuber, Very early warning of next El Niño, Proceedings of the National Academy of Sciences, February 2014. (Click title for free version, journal name for official version.)

Analysis

I downloaded the NOAA adjusted monthly temperature anomaly data and compared the El Niño periods with the charts in this paper. I found what appear to be two errors (“phantom” El Niños) and noted some interesting situations. Some of these are annotated on the images below. Click to enlarge them:

I also listed for each year whether an El Niño initiation was predicted, or not, and whether one actually happened. I did the predictions five ways: first, I listed the author’s “arrows” as they appeared on their charts, and then I tried to match their predictions by following in turn four sets of rules. Nevertheless, I could not come up with any detailed rules that exactly reproduced the author’s results.

These were the rules I used:

An El Niño initiation is predicted for a calendar year if during the preceding year the average link strength crossed above the 2.82 threshold. However, we could also invoke additional requirements. Two possibilities are:

1. Preemption rule: the prediction of a new El Niño is canceled if the preceding year ends in an El Niño period.

2. End-of-year rule: the link strength must be above 2.82 at year’s end.

I counted the predictions using all four combinations of these two rules and compared the results to the arrows on the charts.

I defined an “El Niño initiation month” to be a month where the monthly average adjusted temperature anomaly rises up to at least 0.5 C and remains above or equal to 0.5 °C for at least five months. Note that the NOAA El Niño monthly temperature estimates are rounded to hundredths; and, on occasion, the anomaly is reported as exactly 0.5 °C. I found slightly better agreement with the authors’ El Niño periods if I counted an anomaly of exactly 0.5 °C as satisfying the threshold criterion, instead of using the strictly “greater than” condition.

Anyway, I did some formal hypothesis testing and estimation under all five scenarios. The good news is that under most scenarios the prediction method gave better results than merely guessing. (But, I wonder how many things the authors tried before they settled on their final method? Also, did they do all their work on the learning series, and then only at the end check the validation series—or were they checking both as they went about their investigations?)

The bad news is that the predictions varied with the method, and the methods were rather weak. For instance, in the training series there were 9 El Niño periods in 30 years; the authors’ rules (whatever they were, exactly) found five of the nine. At the same time, they had three false alarms in the 21 years that did not have an El Niño initiated.

I used Fisher’s exact test to compute some p-values. Suppose (as our ‘null hypothesis’) that Ludescher et al’s method does not improve the odds of a successful prediction of an El Nino initiation. What’s the probability of that method getting at least as many predictions right just by chance? Answer: 0.032 – this is marginally more significant than the conventional 1 in 20 chance that is the usual threshold for rejecting a null hypothesis, but still not terribly convincing. This was, by the way, the most significant of the five p-values for the alternative rule sets applied to the learning series.

I also computed the “relative risk” statistics for all scenarios; for instance, we are more than three times as likely to see an El Niño initiation if Ludescher et al predict one, than if they predict otherwise (the 90% confidence interval for that ratio is 1.2 to 9.7, with the point estimate 3.4). Here is a screen shot of some statistics for that case:

Here is a screen shot of part of the spreadsheet list I made. In the margin on the right I made comments about special situations of interest.

Again, click to enlarge—but my whole working spreadsheet is available with more details for anyone who wishes to see it. I did the statistical analysis with a program called JMP, a product of the SAS corporation.

My overall impression from all this is that Ludescher et al are suggesting a somewhat arbitrary (and not particularly well-defined) method for revealing the relationship between link strength and El Niño initiation, if, indeed, a relationship exists. Slight variations in the interpretation of their criteria and slight variations in the data result in appreciably different predictions. I wonder if there are better ways to analyze these two correlated time series.

The Harmonograph

18 July, 2014

Anita Chowdry is an artist based in London. While many are exploring electronic media and computers, she’s going in the opposite direction, exploring craftsmanship and the hands-on manipulation of matter. I find this exciting, perhaps because I spend most of my days working on my laptop, becoming starved for richer sensations. She writes:

Today, saturated as we are with the ephemeral intangibility of virtual objects and digital functions, there is a resurgence of interest in the ingenious mechanical contraptions of pre-digital eras, and in the processes of handcraftsmanship and engagement with materials. The solid corporality of analogue machines, the perceivable workings of their kinetic energy, and their direct invitation to experience their science through hands-on interaction brings us back in touch with our humanity.

The ‘steampunk’ movement is one way people are expressing this renewed interest, but Anita Chowdry goes a bit deeper than some of that. For starters, she’s studied all sorts of delightful old-fashioned crafts, like silverpoint, a style of drawing used before the invention of graphite pencils. The tool is just a piece of silver wire mounted on a writing implement; a bit of silver rubs off and creates a gray line. The effect is very subtle:

In January she went to Cairo and worked with a master calligrapher, Ahmed Fares, to recreate the title page of a 16th-century copy of Avicenna’s Canon of Medicine, or al-Qanun fi’l Tibb:

This required making gold ink:

The secret is actually pure hard work; rubbing it by hand with honey for hours on end to break up the particles of gold into the finest powder, and then washing it thoroughly in distilled water to remove all impurities.

The results:

I met her in Oxford this March, and we visited the Museum of the History of Science together. This was a perfect place, because it’s right next to the famous Bodleian, and it’s full of astrolabes, sextants, ancient slide rules and the like…

… and one of Anita Chowdry’s new projects involves another piece of romantic old technology: the harmonograph!

The harmonograph

A harmonograph is a mechanical apparatus that uses pendulums to draw a geometric image. The simplest so-called ‘lateral’ or ‘rectilinear’ harmonograph uses two pendulums: one moves a pen back and forth along one axis, while the other moves the drawing surface back and forth along a perpendicular axis. By varying their amplitudes, frequencies and the phase difference, we can get quite a number of different patterns. In the linear approximation where the pendulums don’t swing to high, we get Lissajous curves:

$x(t) = A \sin(a t + \delta)$

$y(t) = B \sin(b t)$

For example, when the amplitudes $A$ and $B$ are both 1, the frequencies are $a = 3$ and $b = 4,$ and the phase difference $\delta$ is $\pi/2,$ we get this:

Harmonographs don’t serve any concrete practical purpose that I know; they’re a diversion, an educational device, or a form of art for art’s sake. They go back to the mid-1840s.

It’s not clear who invented the harmonograph. People often credit Hugh Blackburn, a professor of mathematics at the University of Glasgow who was a friend of the famous physicist Kelvin. He is indeed known for studying a pendulum hanging on a V-shaped string, in 1844. This is now called the Blackburn pendulum. But it’s not used in any harmonograph I know about.

On the other hand, Anita Chowdry has a book called The Harmonograph. Illustrated by Designs actually Drawn by the Machine, written in 1893 by one H. Irwine Whitty. This book says the harmonograph

was first constructed by Mr. Tisley, of the firm Tisley and Spiller, the well-known opticians…

So, it remains mysterious.

The harmonograph peaked in popularity in the 1890s. I have no idea how popular it ever was; it seems a rather cerebral form of entertainment. As the figures from Whitty’s book show, it was sometimes used to illustrate the Pythagorean theory of chords as frequency ratios. Indeed, this explains the name ‘harmomograph’:

At left the frequencies are exactly $a = 3, b = 2,$ just as we’d have in two notes making a major fifth. Three choices of phase difference are shown. In the pictures at right, actually drawn by the machine, the frequencies aren’t perfectly tuned, so we get more complicated Lissajous curves.

How big was the harmonograph craze, and how long did it last? It’s hard for me to tell, but this book published in 1918 gives some clues:

• Archibald Williams, Things to Make: Home-made harmonographs (part 1, part 2, part 3), Thomas Nelson and Sons, Ltd., 1918.

It discusses the lateral harmonograph. Then it treats Joseph Goold’s ‘twin elliptic pendulum harmonograph’, which has a pendulum free to swing in all directions connected to a pen, and second pendulum free to swing in all directions affecting the motion of the paper. It also shows a miniature version of the same thing, and how to build it yourself. It explains the connection with harmony theory. And it explains the value of the harmonograph:

Value of the harmonograph

A small portable harmonograph will be found to be a good means of entertaining friends at home or elsewhere. The gradual growth of the figure, as the card moves to and fro under the pen, will arouse the interest of the least scientifically inclined person; in fact, the trouble is rather to persuade spectators that they have had enough than to attract their attention. The cards on which designs have been drawn are in great request, so that the pleasure of the entertainment does not end with the mere exhibition. An album filled with picked designs, showing different harmonies and executed in inks of various colours, is a formidable rival to the choicest results of the amateur photographer’s skill.

“In great request”—this makes it sound like harmonographs were all the rage! On the other hand, I smell a whiff of desperate salesmanship, and he begins the chapter by saying:

Have you ever heard of the harmonograph? If not, or if at the most you have very hazy ideas as to what it is, let me explain.

So even at its height of popularity, I doubt most people knew what a harmonograph was. And as time passed, more peppy diversions came along and pushed it aside. The phonograph, for example, began to catch on in the 1890s. But the harmonograph never completely disappeared. If you look on YouTube, you’ll find quite a number.

The harmonograph project

Anita Chowdry got an M.A. from Central Saint Martin’s college of Art and Design. That’s located near St. Pancras Station in London.

She built a harmonograph as part of her course work, and it worked well, but she wanted to make a more elegant, polished version. Influenced by the Victorian engineering of St. Pancras Station, she decided that “steel would be the material of choice.”

So, starting in 2013, she began designing a steel harmonograph with the help of her tutor Eleanor Crook and the engineering metalwork technician Ricky Lee Brawn.

Artist and technician David Stewart helped her make the steel parts. Learning to work with steel was a key part of this art project:

The first stage of making the steel harmonograph was to cut out and prepare all the structural components. In a sense, the process is a bit like tailoring—you measure and cut out all the pieces, and then put them together an a logical order, investing each stage with as much care and craftsmanship as you can muster. For the flat steel components I had medium-density fibreboard forms cut on the college numerical control machine, which David Stewart used as patterns to plasma-cut the shapes out of mild carbon-steel. We had a total of fifteen flat pieces for the basal structure, which were to be welded to a large central cylinder.

My job was to ‘finish’ the plasma-cut pieces: I refined the curves with an angle-grinder, drilled the holes that created the delicate openwork patterns, sanded everything to smooth the edges, then repeatedly heated and quenched each piece at the forge to darken and strengthen them. When Dave first placed the angle-grinder in my hands I was terrified—the sheer speed and power and noise of the monstrous thing connecting with the steel with a shower of sparks had a brutality and violence about it that I had never before experienced. But once I got used to the heightened energy of the process it became utterly enthralling. The grinder began to feel as fluent and expressive as a brush, and the steel felt responsive and alive. Like all metalwork processes, it demands a total, immersive concentration—you can get lost in it for hours!

Ricky Lee Brawn worked with her to make the brass parts:

Below you can see the brass piece he’s making, called a finial, among the steel legs of the partially finished harmonograph:

There are three legs, each with three feet.

The groups of three look right, because I conceived the entire structure on the basis of the three pendulums working at angles of 60 degrees in relation to one another (forming an equilateral triangle)—so the magic number is three and its multiples.

With three pendulums you can generate more complicated generalizations of Lissajous curves. In the language of music, three frequencies gives you a triplet!

Things become still more complex if we leave the linear regime, where motions are described by sines and cosines. I don’t understand Anita Chowdry’s harmonograph well enough to know if nonlinearity plays a crucial role. But it gives patterns like these:

Here is the completed harmonograph, called the ‘Iron Genie’, in action in the crypt of the St. Pancras Church:

And now, I’m happy to say, it’s on display at the Museum of the History of Science, where we met in Oxford. If you’re in the area, give it a look! She’s giving free public talks about it at 3 pm on

• Saturday July 19th
• Saturday August 16th
• Saturday September 20th

in 2014. And if you can’t visit Oxford, you can still visit her blog!

The mathematics

I think the mathematics of harmonographs deserves more thought. The basic math required for this was developed by the Irish mathematician William Rowan Hamilton around 1834. Hamilton was just the sort of character who would have enjoyed the harmonograph. But other crucial ideas were contributed by Jacobi, Poincaré and many others.

In a simple ‘lateral’ device, the position and velocity of the machine takes 4 numbers to describe: the two pendulum’s angles and angular velocities. In the language of classical mechanics, the space of states of the harmonograph is a 4-dimensional symplectic manifold, say $X.$ Ignoring friction, its motion is described by Hamilton’s equations. These equations can give behavior ranging from completely integrable (as orderly as possible) to chaotic.

For small displacements our lateral harmonograph about the state of rest, I believe its behavior will be completely integrable. If so, for any initial conditions, its motion will trace out a spiral on some 2-dimensional torus $T$ sitting inside $X.$ The position of pen on paper provides a map

$f : X \to \mathbb{R}^2$

and so the spiral is mapped to some curve on the paper!

We can ask what sort of curves can arise. Lissajous curves are the simplest, but I don’t know what to say in general. We might be able to understand their qualitative features without actually solving Hamilton’s equations. For example, there are two points where the curves seem to ‘focus’ here:

That’s the kind of thing mathematical physicists can try to understand, a bit like caustics in optics.

If we have a ‘twin elliptic pendulum harmonograph’, the state space $X$ becomes 8-dimensional, and $T$ becomes 4-dimensional if the system is completely integrable. I don’t know the dimension of the state space for Anita Chowdry’s harmonograph, because I don’t know if her 3 pendulum can swing in just one direction each, or two!

But the big question is whether a given harmonograph is completely integrable… in which case the story I’m telling goes through… or whether it’s chaotic, in which case we should expect it to make very irregular pictures. A double pendulum—that is, a pendulum hanging on another pendulum—will be chaotic if it starts far enough from its point of rest.

Here is a chaotic ‘double compound pendulum’, meaning that it’s made of two rods:

Acknowledgements

Almost all the pictures here were taken by Anita Chowdry, and I thank her for letting me use them. The photo of her harmonograph in the Museum of the History of Science was taken by Keiko Ikeuchi, and the copyright for this belongs to the Museum of the History of Science, Oxford. The video was made by Josh Jones. The image of a Lissajous curve was made by Alessio Damato and put on Wikicommons with a Creative Commons Attribution-Share Alike license. The double compound pendulum was made by Catslash and put on Wikicommons in the public domain.

El Niño Project (Part 5)

12 July, 2014

And now for some comic relief.

Last time I explained how to download some weather data and start analyzing it, using programs written by Graham Jones. When you read that, did you think “Wow, that’s easy!” Or did you think “Huh? Run programs in R? How am I supposed to do that?”

If you’re in the latter group, you’re like me. But I managed to do it. And this is the tale of how. It’s a blow-by-blow account of my first steps, my blunders, my fears.

I hope that if you’re intimidated by programming, my tale will prove that you too can do this stuff… provided you have smart friends, or read this article.

More precisely, this article explains how to:

• download and run software that runs the programming language R;

• download temperature data from the National Center for Atmospheric Research;

• use R to create a file of temperature data for a given latitude/longitude rectangle for a given time interval.

I will not attempt to explain how to program in R.

If you want to copy what I’m doing, please remember that a few details depend on the operating system. Since I don’t care about operating systems, I use a Windows PC. If you use something better, some details will differ for you.

Also: at the end of this article there are some very basic programming puzzles.

First, let me explain a bit about my relation to computers.

I first saw a computer at the Lawrence Hall of Science in Berkeley, back when I was visiting my uncle in the summer of 1978. It was really cool! They had some terminals where you could type programs in BASIC and run them.

I got especially excited when he gave me the book Computer Lib/Dream Machines by Ted Nelson. It espoused the visionary idea that people could write texts on computers all around the world—”hypertexts” where you could click on a link in one and hop to another!

I did more programming the next year in high school, sitting in a concrete block room with a teletype terminal that was connected to a mainframe somewhere far away. I stored my programs on paper tape. But my excitement gradually dwindled, because I was having more fun doing math and physics using just pencil and paper. My own brain was more easy to program than the machine. I did not start a computer company. I did not get rich. I learned quantum mechanics, and relativity, and Gödel’s theorem.

Later I did some programming in APL in college, and still later I did a bit in Mathematica in the early 1990s… but nothing much, and nothing sophisticated. Indeed, none of these languages would be the ones you’d choose to explore sophisticated ideas in computation!

I’ve just never been very interested… until now. I now want to do a lot of data analysis. It will be embarrassing to keep asking other people to do all of it for me. I need to learn how to do it myself.

Maybe you’d like to do this stuff too—or at least watch me make a fool of myself. So here’s my tale, from the start.

To use the programs written by Graham, I need to use R, a language currently popular among statisticians. It is not the language my programmer friends would want me to learn—they’d want me to use something like Python. But tough! I can learn that later.

To download R to my Windows PC, I cleverly type download R into Google, and go to the top website it recommends:

I click the big fat button on top saying

Download R 3.1.0 for Windows

and get asked to save a file R-3.1.0-win.exe. I save it in my Downloads folder; it takes a while to download since it was 57 megabytes. When I get it, I click on it and follow the easy default installation instructions. My Desktop window now has a little icon on it that says R.

Clicking this, I get an interface where I can type commands after a red

>

symbol. Following Graham’s advice, I start by trying

> 2^(1:8)

which generates a list of powers of 2 from 21 to 28, like this:

[1] 2 4 8 16 32 64 128 256

Then I try

> mean(2^(1:8))

which gives the arithmetic mean of this list. Somewhat more fun is

> plot(rnorm(20))

which plots a bunch of points, apparently 20 standard normal deviates.

When I hear “20 standard normal deviates” I think of the members of a typical math department… but no, those are deviants. Standard normal deviates are random numbers chosen from a Gaussian distribution of mean zero and variance 1.

To do something more interesting, I need to input data.

The papers by Ludescher et al use surface air temperatures in a certain patch of the Pacific, so I want to get ahold of those. They’re here:

NCEP is the National Centers for Environmental Prediction, and NCAR is the National Center for Atmospheric Research. They have a bunch of files here containing worldwide daily average temperatures on a 2.5 degree latitude × 2.5 degree longitude grid (that’s 144 × 73 grid points), from 1948 to 2010. And if you go here, the website will help you get data from within a chosen rectangle in a grid, for a chosen time interval.

These are NetCDF files. NetCDF stands for Network Common Data Form:

NetCDF is a set of software libraries and self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data.

According to my student Blake Pollard:

I know about ftp: I’m so old that I know this was around before the web existed. Back then it meant “faster than ponies”. But I need to get R to accept data from these NetCDF files: that’s what scares me!

Graham said that R has a “package” called RNetCDF for using NetCDF files. So, I need to get ahold of this package, download some files in the NetCDF format, and somehow get R to eat those files with the help of this package.

At first I was utterly clueless! However, after a bit of messing around, I notice that right on top of the R interface there’s a menu item called Packages. I boldly click on this and choose Install Package(s).

I am rewarded with an enormous alphabetically ordered list of packages… obviously statisticians have lots of stuff they like to do over and over! I find RNetCDF, click on that and click something like “OK”.

I’m asked if I want to use a “personal library”. I click “no”, and get an error message. So I click “yes”. The computer barfs out some promising text:

utils:::menuInstallPkgs() trying URL 'http://cran.stat.nus.edu.sg/bin/windows/contrib/3.1/RNetCDF_1.6.2-3.zip' Content type 'application/zip' length 548584 bytes (535 Kb) opened URL downloaded 535 Kb

 package ‘RNetCDF’ successfully unpacked and MD5 sums checked 

The downloaded binary packages are in C:\Users\JOHN\AppData\Local\Temp\Rtmp4qJ2h8\downloaded_packages

Success!

But now I need to figure out how to download a file and get R to eat it and digest it with the help of RNetCDF.

At this point my deus ex machina, Graham, descends from the clouds and says:

You can download the files from your browser. It is probably easiest to do that for starters. Put
ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/surface/
into the browser, then right-click a file and Save link as…

This code will download a bunch of them:
 for (year in 1950:1979) { download.file(url=paste0("ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/surface/air.sig995.", year, ".nc"), destfile=paste0("air.sig995.", year, ".nc"), mode="wb") } 

It will put them into the “working directory”, probably C:\Users\JOHN\Documents. You can find the working directory using getwd(), and change it with setwd(). But you must use / not \ in the filepath.

Compared to UNIX, the Windows operating system has the peculiarity of using \ instead of / in path names, but R uses the UNIX conventions even on Windows.

So, after some mistakes, in the R interface I type

> setwd("C:/Users/JOHN/Documents/My Backups/azimuth/el nino")

and then type

> getwd()

to see if I’ve succeeded. I’m rewarded with

[1] "C:/Users/JOHN/Documents/My Backups/azimuth/el nino"

Good!

Then, following Graham’s advice, I cut-and-paste this into the R interface:

for (year in 1950:1979) { download.file(url=paste0("ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/surface/air.sig995.", year, ".nc"), destfile=paste0("air.sig995.", year, ".nc"), mode="wb") }

It seems to be working! A little bar appears showing how each year’s data is getting downloaded. It chugs away, taking a couple minutes for each year’s worth of data.

Using R to process NetCDF files

Okay, now I’ve got all the worldwide daily average temperatures on a 2.5 degree latitude × 2.5 degree longitude grid from 1950 to 1970.

The world is MINE!

But what do I do with it? Graham’s advice is again essential, along with a little R program, or script, that he wrote:

The R script netcdf-convertor.R from

https://github.com/azimuth-project/el-nino/tree/master/R

will eat the file, digest it, and spit it out again. It contains instructions.

I go to this URL, which is on GitHub, a popular free web-based service for software development. You can store programs here, edit them, and GitHub will help you keep track of the different versions. I know almost nothing about this stuff, but I’ve seen it before, so I’m not intimidated.

I click on the blue thing that says netcdf-convertor.R and see something that looks like the right script. Unfortunately I can’t see how to download it! I eventually see a button I’d overlooked, cryptically labelled “Raw”. I realize that since I don’t want a roasted or oven-broiled piece of software, I should click on this. I indeed succeed in downloading netcdf-convertor.R this way. Graham later says I could have done something better, but oh well. I’m just happy nothing has actually exploded yet.

Once I’ve downloaded this script, I open it using an text processor and look at it. At top are a bunch of comments written by Graham:

 ###################################################### ######################################################

 # You should be able to use this by editing this # section only. setwd("C:/Users/Work/AAA/Programming/ProgramOutput/Nino") lat.range <- 13:14 lon.range <- 142:143 firstyear <- 1957 lastyear <- 1958 outputfilename <- paste0("Scotland-", firstyear, "-", lastyear, ".txt") ###################################################### ###################################################### # Explanation # 1. Use setwd() to set the working directory # to the one containing the .nc files such as # air.sig995.1951.nc. # Example: # setwd("C:/Users/Work/AAA/Programming/ProgramOutput/Nino") # 2. Supply the latitude and longitude range. The # NOAA data is every 2.5 degrees. The ranges are # supplied as the number of steps of this size. # For latitude, 1 means North Pole, 73 means South # Pole. For longitude, 1 means 0 degrees East, 37 # is 90E, 73 is 180, 109 is 90W or 270E, 144 is # 2.5W. # These roughly cover Scotland. # lat.range <- 13:14 # lon.range <- 142:143 # These are the area used by Ludescher et al, # 2013. It is 27x69 points which are then # subsampled to 9 by 23. # lat.range <- 24:50 # lon.range <- 48:116 # 3. Supply the years # firstyear <- 1950 # lastyear <- 1952 # 4. Supply the output name as a text string. # paste0() concatenates strings which you may find # handy: # outputfilename <- paste0("Pacific-", firstyear, "-", lastyear, ".txt") ###################################################### ###################################################### # Example of output # S013E142 S013E143 S014E142 S014E143 # Y1950P001 281.60000272654 281.570002727211 281.60000272654 280.970002740622 # Y1950P002 280.740002745762 280.270002756268 281.070002738386 280.49000275135 # Y1950P003 280.100002760068 278.820002788678 281.120002737269 280.070002760738 # Y1950P004 281.070002738386 279.420002775267 281.620002726093 280.640002747998 # ... # Y1950P193 285.450002640486 285.290002644062 285.720002634451 285.75000263378 # Y1950P194 285.570002637804 285.640002636239 286.070002626628 286.570002615452 # Y1950P195 285.92000262998 286.220002623275 286.200002623722 286.620002614334 # ... # Y1950P364 276.100002849475 275.350002866238 276.37000284344 275.200002869591 # Y1950P365 276.990002829581 275.820002855733 276.020002851263 274.72000288032 # Y1951P001 278.220002802089 277.470002818853 276.700002836064 275.870002854615 # Y1951P002 277.750002812594 276.890002831817 276.650002837181 275.520002862439 # ... # Y1952P365 280.35000275448 280.120002759621 280.370002754033 279.390002775937 # There is one row for each day, and 365 days in # each year (leap days are omitted). In each row, # you have temperatures in Kelvin for each grid # point in a rectangle. # S13E142 means 13 steps South from the North Pole # and 142 steps East from Greenwich. The points # are in reading order, starting at the top-left # (Northmost, Westmost) and going along the top # row first. # Y1950P001 means year 1950, day 1. (P because # longer periods might be used later.) 

###################################################### ###################################################### 

The instructions are admirably detailed concerning what I should do, but they don't say where the output will appear when I do it. This makes me nervous. I guess I should just try it. After all, the program is not called DestroyTheWorld.

Unfortunately, at this point a lot of things start acting weird.

It's too complicated and boring to explain in detail, but basically, I keep getting a file missing error message. I don't understand why this happens under some conditions and not others. I try lots of experiments.

Eventually I discover that one year of temperature data failed to download—the year 1949, right after the first year available! So, I'm getting the error message whenever I try to do anything involving that year of data.

To fix the problem, I simply download the 1949 data by hand from here:

(You can open ftp addresses in a web browser just like http addresses.) I put it in my working directory for R, and everything is fine again. Whew!

By the time things I get this file, I sort of know what to do—after all, I've spent about an hour trying lots of different things.

I decide to create a file listing temperatures near where I live in Riverside from 1948 to 1979. To do this, I open Graham's script netcdf-convertor.R in a word processor and change this section:
 setwd("C:/Users/Work/AAA/Programming/ProgramOutput/Nino") lat.range <- 13:14 lon.range <- 142:143 firstyear <- 1957 lastyear <- 1958 outputfilename <- paste0("Scotland-", firstyear, "-", lastyear, ".txt") 

to this:
 setwd("C:/Users/JOHN/Documents/My Backups/azimuth/el nino") lat.range <- 23:23 lon.range <- 98:98 firstyear <- 1948 lastyear <- 1979 outputfilename <- paste0("Riverside-", firstyear, "-", lastyear, ".txt") 

Why? Well, I want it to put the file in my working directory. I want the years from 1948 to 1979. And I want temperature data from where I live!

Googling the info, I see Riverside, California is at 33.9481° N, 117.3961° W. 34° N is about 56 degrees south of the North Pole, which is 22 steps of size 2.5°. And because some idiot decided everyone should count starting at 1 instead of 0 even in contexts like this, the North Pole itself is step 1, not step 0… so Riverside is latitude step 23. That's why I write:

lat.range <- 23:23

Similarly, 117.5° W is 242.5° E, which is 97 steps of size 2.5°… which counts as step 98 according to this braindead system. That's why I write:

lon.range <- 98:98

Having done this, I save the file netcdf-convertor.R under another name, Riverside.R.

And then I do some stuff that it took some fiddling around to discover.

First, in my R interface I go to the menu item File, at far left, and click on Open script. It lets me browse around, so I go to my working directory for R and choose Riverside.R. A little window called R editor opens up in my R interface, containing this script.

I'm probably not doing this optimally, but I can now right-click on the R editor and see a menu with a choice called Select all. If I click this, everything in the window turns blue. Then I can right-click again and choose Run line or selection. And the script runs!

Voilà!

It huffs and puffs, and then stops. I peek in my working directory, and see that a file called

Riverside.1948-1979.txt

has been created. When I open it, it has lots of lines, starting with these:
 S023E098 Y1948P001 279.95 Y1948P002 280.14 Y1948P003 282.27 Y1948P004 283.97 Y1948P005 284.27 Y1948P006 286.97 

As Graham promised, each line has a year and day label, followed by a vector… which in my case is just a single number, since I only wanted the temperature in one location. I’m hoping this is the temperature near Riverside, in kelvin.

A small experiment

To see if this is working, I’d like to plot these temperatures and see if they make sense. Unfortunately I have no idea how to get R to take a file containing data of the sort I have and plot it! I need to learn how, but right now I’m exhausted, so I use another method to get the job done— a method that’s too suboptimal and embarrassing to describe here. (Hint: it involves the word “Excel”.)

I do a few things, but here’s the most interesting one—namely, not very interesting. I plot the temperatures for 1963:

I compare it to some publicly available data, not from Riverside, but from nearby Los Angeles:

As you can see, there was a cold day on January 13th, when the temperature dropped to 33°F. That seems to be visible on the graph I made, and looking at the data from which I made the graph, I see the temperature dropped to 251.4 kelvin on the 13th: that’s -7°F, very cold for here. It does get colder around Riverside than in Los Angeles in the winter, since it’s a desert, with temperatures not buffered by the ocean. So, this does seem compatible with the public records. That’s mildly reassuring.

But other features of the graph don’t match, and I’m not quite sure if they should or not. So, all this very tentative and unimpressive. However, I’ve managed to get over some of my worst fears, download some temperature data, and graph it! Now I need to learn how to use R to do statistics with this data, and graph it in a better way.

Puzzles

You can help me out by answering these puzzles. Later I might pose puzzles where you can help us write really interesting programs. But for now it’s just about learning R.

Puzzle 1. Given a text file with lots of lines of this form:
 S023E098 Y1948P001 279.95 Y1948P002 280.14 Y1948P003 282.27 Y1948P004 283.97 

write an R program that creates a huge vector, or list of numbers, like this:
 279.95, 280.14, 282.27, 283.97, ... 

Puzzle 2: Extend the above program so that it plots this list of numbers, or outputs it to a new file.

If you want to test your programs, here’s the actual file:

More puzzles

If those puzzles are too easy, here are two more. I gave these last time, but everyone was too wimpy to tackle them.

Puzzle 3. Modify the software so that it uses the same method to predict El Niños from 1980 to 2013. You’ll have to adjust two lines in netcdf-convertor-ludescher.R:

firstyear <- 1948
lastyear <- 1980


should become

firstyear <- 1980
lastyear <- 2013


or whatever range of years you want. You’ll also have to adjust names of years in ludescher-replication.R. Search the file for the string 19 and make the necessary changes. Ask me if you get stuck.

Puzzle 4. Right now we average the link strength over all pairs $(i,j)$ where $i$ is a node in the El Niño basin defined by Ludescher et al and $j$ is a node outside this basin. The basin consists of the red dots here:

What happens if you change the definition of the El Niño basin? For example, can you drop those annoying two red dots that are south of the rest, without messing things up? Can you get better results if you change the shape of the basin?

To study these questions you need to rewrite ludescher-replication.R a bit. Here’s where Graham defines the El Niño basin:

ludescher.basin <- function() {
lats <- c( 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6)
lons <- c(11,12,13,14,15,16,17,18,19,20,21,22,16,22)
stopifnot(length(lats) == length(lons))
list(lats=lats,lons=lons)
}


These are lists of latitude and longitude coordinates: (5,11), (5,12), (5,13), etc. A coordinate like (5,11) means the little circle that’s 5 down and 11 across in the grid on the above map. So, that’s the leftmost point in Ludescher’s El Niño basin. By changing these lists, you can change the definition of the El Niño basin.

Next time I’ll discuss some criticisms of Ludescher et al’s paper, but later we will return to analyzing temperature data, looking for interesting patterns.

El Niño Project (Part 4)

8 July, 2014

As the first big step in our El Niño prediction project, Graham Jones replicated the paper by Ludescher et al that I explained last time. Let’s see how this works!

Graham did it using R, a programming language that’s good for statistics. If you prefer another language, go ahead and write software for that… and let us know! We can add it to our repository.

Today I’ll explain this stuff to people who know their way around computers. But I’m not one of those people! So, next time I’ll explain the nitty-gritty details in a way that may be helpful to people more like me.

Getting temperature data

Say you want to predict El Niños from 1950 to 1980 using Ludescher et al’s method. To do this, you need daily average surface air temperatures in this grid in the Pacific Ocean:

Each square here is 7.5° × 7.5°. To compute these temperatures, you have to start with temperatures on a grid with smaller squares that are 2.5° × 2.5° in size:

• Earth System Research Laboratory, NCEP Reanalysis Daily Averages Surface Level, or ftp site.

This website will give you daily average surface air temperatures in whatever rectangle and time interval you want. It delivers this data in a format called NetCDF, meaning Network Common Data Form.

We’ll take a different approach. We’ll download all the temperatures in this database, and then extract the data we need using R scripts. That way, when we play other games with temperature data later, we’ll already have it.

So, go ahead and download all the files from air.sig995. 1948.nc to air.sig995.2013.nc. It will take a while… but you’ll own the world.

There are different ways to do this. If you have R fired up, just cut-and-paste this into the console:

for (year in 1950:1979) {
"ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/surface/air.sig995.",
year, ".nc"),
destfile=paste0("air.sig995.", year, ".nc"), mode="wb")
}


Getting the temperatures you need

Now you have files of daily average temperatures on a 2.5° by 2.5° grid from 1948 to 2013. Make sure all these files are in your working directory for R, and download this R script from GitHub:

You can use this to get the temperatures in any time interval and any rectangle of grid points you want. The details are explained in the script. But the defaults are set to precisely what you need now!

So, just run this script. You should get a file called Pacific-1948-1980.txt. This has daily average temperatures in the region we care about, from 1948 to 1980. It should start with a really long line listing locations in a 27 × 69 grid, starting with S024E48 and ending with S248E116. I’ll explain this coordinate scheme at the end of this post. Then come hundreds of lines listing temperatures in kelvin at those locations on successive days. The first of these lines should start with Y1948P001, meaning the first day of 1948.

And I know what you’re dying to ask: yes, leap days are omitted! This annoys the perfectionist in me… but leap years make data analysis more complicated, so Ludescher et al ignore leap days, and we will too.

Getting the El Niño data

You’ll use this data to predict El Niños, so you also want a file of the Niño 3.4 index. Remember from last time, this says how much hotter the surface of this patch of seawater is than usual for this time of year:

You can download the file from here:

This is a copy of the monthly Niño 3.4 index from the US National Weather Service, which I discussed last time. It has monthly Niño 3.4 data in the column called ANOM.

Put this file in your working directory.

Predicting El Niños

Now for the cool part. Last time I explained the average link strength’, which Ludescher et al use to predict El Niños. Now you’ll compute it.

You’ve got Pacific-1948-1980.txt and nino3.4-anoms.txt in your working directory. Download this R script written by Graham Jones, and run it:

It takes about 45 minutes on my laptop. It computes the average link strength $S$ at ten-day intervals. Then it plots $S$ in red and the Niño 3.4 index in blue, like this:

(Click to enlarge.) The shaded region is where the Niño 3.4 index is below 0.5°C. When the blue curve escapes this region and then stays above 0.5°C for at least 5 months, Ludescher et al say that there’s an El Niño.

The horizontal red line shows the threshold $\theta = 2.82.$ When $S$ exceeds this, and the Niño 3.4 index is not already over 0.5°C, Ludescher et al predict that there will be an El Niño in the next calendar year!

Our graph almost agrees with theirs:

Here the green arrows show their successful predictions, dashed arrows show false alarms, and a little letter n appears next to each El Niño they failed to predict.

The graphs don’t match perfectly. For the blue curves, we could be using Niño 3.4 from different sources. Differences in the red curves are more interesting, since that’s where all the work is involved, and we’re starting with the same data. Besides actual bugs, which are always possible, I can think of various explanations. None of them are extremely interesting, so I’ll stick them in the last section!

If you want to get ahold of our output, you can do so here:

This has the average link strength $S$ at 10-day intervals, starting from day 730 and going until day 12040, where day 1 is the first of January 1948.

So, you don’t actually have to run all these programs to get our final result. However, these programs will help you tackle some programming challenges which I’ll list now!

Programming challenges

There are lots of variations on the Ludescher et al paper which we could explore. Here are a few easy ones to get you started. If you do any of these, or anything else, let me know!

I’ll start with a really easy one, and work on up.

Challenge 1. Repeat the calculation with temperature data from 1980 to 2013. You’ll have to adjust two lines in netcdf-convertor-ludescher.R:

firstyear <- 1948
lastyear <- 1980


should become

firstyear <- 1980
lastyear <- 2013


or whatever range of years you want. You’ll also have to adjust names of years in ludescher-replication.R. Search the file for the string 19 and make the necessary changes. Ask me if you get stuck.

Challenge 2. Repeat the calculation with temperature data on a 2.5° × 2.5° grid instead of the coarser 7.5° × 7.5° grid Ludescher et al use. You’ve got the data you need. Right now, the program ludescher-replication.R averages out the temperatures over little 3 × 3 squares. It starts with temperatures on a 27 × 69 grid and averages them out to obtain temperatures on the 9 × 23 grid shown here:

Here’s where that happens:

# the data per day is reduced from e.g. 27x69 to 9x23.

subsample.3x3 <- function(vals) {
stopifnot(dim(vals)[2] %% 3 == 0)
stopifnot(dim(vals)[3] %% 3 == 0)
n.sslats <- dim(vals)[2]/3
n.sslons <- dim(vals)[3]/3
ssvals <- array(0, dim=c(dim(vals)[1], n.sslats, n.sslons))
for (d in 1:dim(vals)[1]) {
for (slat in 1:n.sslats) {
for (slon in 1:n.sslons) {
ssvals[d, slat, slon] <- mean(vals[d, (3*slat-2):(3*slat), (3*slon-2):(3*slon)])
}
}
}
ssvals
}


So, you need to eliminate this and change whatever else needs to be changed. What new value of the threshold $\theta$ looks good for predicting El Niños now? Most importantly: can you get better at predicting El El Niños this way?

The calculation may take a lot longer, since you’ve got 9 times as many grid points and you’re calculating correlations between pairs. So if this is too tough, you can go the other way: use a coarser grid and see how much that degrades your ability to predict El Niños.

Challenge 3. Right now we average the link strength over all pairs $(i,j)$ where $i$ is a node in the El Niño basin defined by Ludescher et al and $j$ is a node outside this basin. The basin consists of the red dots here:

What happens if you change the definition of the El Niño basin? For example, can you drop those annoying two red dots that are south of the rest, without messing things up? Can you get better results if you change the shape of the basin?

To study these questions you need to rewrite ludescher-replication.R a bit. Here’s where Graham defines the El Niño basin:

ludescher.basin <- function() {
lats <- c( 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6)
lons <- c(11,12,13,14,15,16,17,18,19,20,21,22,16,22)
stopifnot(length(lats) == length(lons))
list(lats=lats,lons=lons)
}


These are lists of latitude and longitude coordinates: (5,11), (5,12), (5,13), etc. A coordinate like (5,11) means the little circle that’s 5 down and 11 across in the grid on the above map. So, that’s the leftmost point in Ludescher’s El Niño basin. By changing these lists, you can change the definition of the El Niño basin. You’ll also have to change these lists if you tackle Challenge 2.

There’s a lot more you can do… the sky’s the limit! In the weeks to come, I’ll show you lots of things we’ve actually done.

Annoying nuances

Here are two reasons our average link strengths could differ from Ludescher’s.

Last time I mentioned that Ludescher et al claim to normalize their time-delayed cross-covariances in a sort of complicated way. I explained why I don’t think they could have actually used this method. In ludescher-replication.R, Graham used the simpler normalization described last time: namely, dividing by

$\sqrt{\langle T_i(t)^2 \rangle - \langle T_i(t) \rangle^2} \; \sqrt{\langle T_j(t-\tau)^2 \rangle - \langle T_j(t-\tau) \rangle^2}$

$\sqrt{ \langle (T_i(t) - \langle T_i(t)\rangle)^2 \rangle} \; \sqrt{ \langle (T_j(t-\tau) - \langle T_j(t-\tau)\rangle)^2 \rangle}$

Since we don’t really know what Ludescher et al did, they might have done something else.

We might also have used a different ‘subsampling’ procedure. That’s a name for how we get from the temperature data on a 9 × 69 grid to temperatures on a 3 × 23 grid. While the original data files give temperatures named after grid points, each is really an area-averaged temperature for a 2.5° × 2.5° square. Is this square centered at the grid point, or is the square having that grid point as its north-west corner, or what? I don’t know.

This data is on a grid where the coordinates are the number of steps of 2.5 degrees, counting from 1. So, for latitude, 1 means the North Pole, 73 means the South Pole. For longitude, 1 means the prime meridian, 37 means 90° east, 73 means 180° east, 109 means 270°E or 90°W, and 144 means 2.5° west. It’s an annoying system, as far as I’m concerned.

In ludescher-replication.R we use this range of coordinates:

lat.range <- 24:50
lon.range <- 48:116


That’s why your file Pacific-1948-1980.txt` has locations starting with S024E48 and ending with S248E116. Maybe Ludescher et al used a slightly different range or subsampling procedure!

There are probably lots of other nuances I haven’t noticed. Can you think of some?

Entropy and Information in Biological Systems (Part 2)

4 July, 2014

John Harte, Marc Harper and I are running a workshop! Now you can apply here to attend:

Information and entropy in biological systems, National Institute for Mathematical and Biological Synthesis, Knoxville Tennesee, Wednesday-Friday, 8-10 April 2015.

Click the link, read the stuff and scroll down to “CLICK HERE” to apply. The deadline is 12 November 2014.

Financial support for travel, meals, and lodging is available for workshop attendees who need it. We will choose among the applicants and invite 10-15 of them.

The idea

Information theory and entropy methods are becoming powerful tools in biology, from the level of individual cells, to whole ecosystems, to experimental design, model-building, and the measurement of biodiversity. The aim of this investigative workshop is to synthesize different ways of applying these concepts to help systematize and unify work in biological systems. Early attempts at “grand syntheses” often misfired, but applications of information theory and entropy to specific highly focused topics in biology have been increasingly successful. In ecology, entropy maximization methods have proven successful in predicting the distribution and abundance of species. Entropy is also widely used as a measure of biodiversity. Work on the role of information in game theory has shed new light on evolution. As a population evolves, it can be seen as gaining information about its environment. The principle of maximum entropy production has emerged as a fascinating yet controversial approach to predicting the behavior of biological systems, from individual organisms to whole ecosystems. This investigative workshop will bring together top researchers from these diverse fields to share insights and methods and address some long-standing conceptual problems.

So, here are the goals of our workshop:

• To study the validity of the principle of Maximum Entropy Production (MEP), which states that biological systems – and indeed all open, non-equilibrium systems – act to produce entropy at the maximum rate.

• To familiarize all the participants with applications to ecology of the MaxEnt method: choosing the probabilistic hypothesis with the highest entropy subject to the constraints of our data. We will compare MaxEnt with competing approaches and examine whether MaxEnt provides a sufficient justification for the principle of MEP.

• To clarify relations between known characterizations of entropy, the use of entropy as a measure of biodiversity, and the use of MaxEnt methods in ecology.

• To develop the concept of evolutionary games as “learning” processes in which information is gained over time.

• To study the interplay between information theory and the thermodynamics of individual cells and organelles.

For more details, go here.

If you’ve got colleagues who might be interested in this, please let them know. You can download a PDF suitable for printing and putting on a bulletin board by clicking on this:

El Niño Project (Part 3)

1 July, 2014

In February, this paper claimed there’s a 75% chance the next El Niño will arrive by the end of 2014:

• Josef Ludescher, Avi Gozolchiani, Mikhail I. Bogachev, Armin Bunde, Shlomo Havlin, and Hans Joachim Schellnhuber, Very early warning of next El Niño, Proceedings of the National Academy of Sciences, February 2014. (Click title for free version, journal name for official version.)

Since it was published in a reputable journal, it created a big stir! Being able to predict an El Niño more than 6 months in advance would be a big deal. El Niños can cause billions of dollars of damage.

But that’s not the only reason we at the Azimuth Project want to analyze, criticize and improve this paper. Another reason is that it uses a climate network—and we like network theory.

Very roughly, the idea is this. Draw a big network of dots representing different places in the Pacific Ocean. For each pair of dots, compute a number saying how strongly correlated the temperatures are at those two places. The paper claims that when a El Niño is getting ready to happen, the average of these numbers is big. In other words, temperatures in the Pacific tend to go up and down in synch!

Whether this idea is right or wrong, it’s interesting—and it’s not very hard for programmers to dive in and study it.

Two Azimuth members have done just that: David Tanzer, a software developer who works for financial firms in New York, and Graham Jones, a self-employed programmer who also works on genomics and Bayesian statistics. These guys have really brought new life to the Azimuth Code Project in the last few weeks, and it’s exciting! It’s even gotten me to do some programming myself.

Soon I’ll start talking about the programs they’ve written, and how you can help. But today I’ll summarize the paper by Ludescher et al. Their methodology is also explained here:

• Josef Ludescher, Avi Gozolchiani, Mikhail I. Bogachev, Armin Bunde, Shlomo Havlin, and Hans Joachim Schellnhuber, Improved El Niño forecasting by cooperativity detection, Proceedings of the National Academy of Sciences, 30 May 2013.

The basic idea

The basic idea is to use a climate network. There are lots of variants on this idea, but here’s a simple one. Start with a bunch of dots representing different places on the Earth. For any pair of dots $i$ and $j,$ compute the cross-correlation of temperature histories at those two places. Call some function of this the ‘link strength’ for that pair of dots. Compute the average link strength… and get excited when this gets bigger than a certain value.

The papers by Ludescher et al use this strategy to predict El Niños. They build their climate network using correlations between daily temperature data for 14 grid points in the El Niño basin and 193 grid points outside this region, as shown here:

The red dots are the points in the El Niño basin.

Starting from this temperature data, they compute an ‘average link strength’ in a way I’ll describe later. When this number is bigger than a certain fixed value, they claim an El Niño is coming.

How do they decide if they’re right? How do we tell when an El Niño actually arrives? One way is to use the ‘Niño 3.4 index’. This the area-averaged sea surface temperature anomaly in the yellow region here:

Anomaly means the temperature minus its average over time: how much hotter than usual it is. When the Niño 3.4 index is over 0.5°C for at least 5 months, Ludescher et al say there’s an El Niño. (By the way, this is not the standard definition. But we will discuss that some other day.)

Here is what they get:

The blue peaks are El Niños: episodes where the Niño 3.4 index is over 0.5°C for at least 5 months.

The red line is their ‘average link strength’. Whenever this exceeds a certain threshold $\Theta = 2.82,$ and the Niño 3.4 index is not already over 0.5°C, they predict an El Niño will start in the following calendar year.

The green arrows show their successful predictions. The dashed arrows show their false alarms. A little letter n appears next to each El Niño that they failed to predict.

You’re probably wondering where the number $2.82$ came from. They get it from a learning algorithm that finds this threshold by optimizing the predictive power of their model. Chart A here shows the ‘learning phase’ of their calculation. In this phase, they adjusted the threshold $\Theta$ so their procedure would do a good job. Chart B shows the ‘testing phase’. Here they used the value of $\Theta$ chosen in the learning phase, and checked to see how good a job it did. I’ll let you read their paper for more details on how they chose $\Theta.$

But what about their prediction now? That’s the green arrow at far right here:

On 17 September 2013, the red line went above the threshold! So, their scheme predicts an El Niño sometime in 2014. The chart at right is a zoomed-in version that shows the red line in August, September, October and November of 2013.

The details

Now I mainly need to explain how they compute their ‘average link strength’.

Let $i$ stand for any point in this 9 × 23 grid:

For each day $t$ between June 1948 and November 2013, let $\tilde{T}_i(t)$ be the average surface air temperature at the point $i$ on day $t.$

Let $T_i(t)$ be $\tilde{T}_i(t)$ minus its climatological average. For example, if $t$ is June 1st 1970, we average the temperature at location $i$ over all June 1sts from 1948 to 2013, and subtract that from $\tilde{T}_i(t)$ to get $T_i(t).$

They call $T_i(t)$ the temperature anomaly.

(A subtlety here: when we are doing prediction we can’t know the future temperatures, so the climatological average is only the average over past days meeting the above criteria.)

For any function of time, denote its moving average over the last 365 days by:

$\displaystyle{ \langle f(t) \rangle = \frac{1}{365} \sum_{d = 0}^{364} f(t - d) }$

Let $i$ be a point in the El Niño basin, and $j$ be a point outside it. For any time lag $\tau$ between 0 and 200 days, define the time-delayed cross-covariance by:

$\langle T_i(t) T_j(t - \tau) \rangle - \langle T_i(t) \rangle \langle T_j(t - \tau) \rangle$

Note that this is a way of studying the linear correlation between the temperature anomaly at node $i$ and the temperature anomaly a time $\tau$ earlier at some node $j.$ So, it’s about how temperature anomalies inside the El Niño basin are correlated to temperature anomalies outside this basin at earlier times.

Ludescher et al then normalize this, defining the time-delayed cross-correlation $C_{i,j}^{t}(-\tau)$ to be the time-delayed cross-covariance divided by

$\sqrt{\Big{\langle} (T_i(t) - \langle T_i(t)\rangle)^2 \Big{\rangle}} \; \sqrt{\Big{\langle} (T_j(t-\tau) - \langle T_j(t-\tau)\rangle)^2 \Big{\rangle}}$

This is something like the standard deviation of $T_i(t)$ times the standard deviation of $T_j(t - \tau).$ Dividing by standard deviations is what people usually do to turn covariances into correlations. But there are some potential problems here, which I’ll discuss later.

They define $C_{i,j}^{t}(\tau)$ in a similar way, by taking

$\langle T_i(t - \tau) T_j(t) \rangle - \langle T_i(t - \tau) \rangle \langle T_j(t) \rangle$

and normalizing it. So, this is about how temperature anomalies outside the El Niño basin are correlated to temperature anomalies inside this basin at earlier times.

Next, for nodes $i$ and $j,$ and for each time point $t,$ they determine the maximum, the mean and the standard deviation of $|C_{i,j}^t(\tau)|,$ as $\tau$ ranges from -200 to 200 days.

They define the link strength $S_{i j}(t)$ as the difference between the maximum and the mean value, divided by the standard deviation.

Finally, they let $S(t)$ be the average link strength, calculated by averaging $S_{i j}(t)$ over all pairs $(i,j)$ where $i$ is a node in the El Niño basin and $j$ is a node outside.

They compute $S(t)$ for every 10th day between January 1950 and November 2013. When $S(t)$ goes over 2.82, and the Niño 3.4 index is not already over 0.5°C, they predict an El Niño in the next calendar year.

There’s more to say about their methods. We’d like you to help us check their work and improve it. Soon I want to show you Graham Jones’ software for replicating their calculations! But right now I just want to conclude by:

• mentioning a potential problem in the math, and

• telling you where to get the data used by Ludescher et al.

Mathematical nuances

Ludescher et al normalize the time-delayed cross-covariance in a somewhat odd way. They claim to divide it by

$\sqrt{\Big{\langle} (T_i(t) - \langle T_i(t)\rangle)^2 \Big{\rangle}} \; \sqrt{\Big{\langle} (T_j(t-\tau) - \langle T_j(t-\tau)\rangle)^2 \Big{\rangle}}$

This is a strange thing, since it has nested angle brackets. The angle brackets are defined as a running average over the 365 days, so this quantity involves data going back twice as long: 730 days. Furthermore, the ‘link strength’ involves the above expression where $\tau$ goes up to 200 days.

So, taking their definitions at face value, Ludescher et al could not actually compute their ‘link strength’ until 930 days after the surface temperature data first starts at the beginning of 1948. That would be late 1950. But their graph of the link strength starts at the beginning of 1950!

Perhaps they actually normalized the time-delayed cross-covariance by dividing it by this:

$\sqrt{\big{\langle} T_i(t)^2 \big{\rangle} - \big{\langle} T_i(t)\big{\rangle}^2} \; \sqrt{\big{\langle} T_j(t-\tau)^2 \big{\rangle} - \big{\langle} T_j(t-\tau)\big{\rangle}^2}$

This simpler expression avoids nested angle brackets, and it makes more sense conceptually. It is the standard deviation of $T_i(t)$ over the last 365 days, times of the standard deviation of $T_i(t-\tau)$ over the last 365 days.

As Nadja Kutz noted, the expression written by Ludescher et al does not equal this simpler expression, since:

$\Big{\langle} T_i(t) \; \langle T_i(t) \rangle \Big{\rangle} \neq \big{\langle} T_i(t) \big{\rangle} \; \big{\langle} T_i(t) \big{\rangle}$

The reason is that

$\begin{array}{ccl} \Big{\langle} T_i(t) \; \langle T_i(t) \rangle \Big{\rangle} &=& \displaystyle{ \frac{1}{365} \sum_{d = 0}^{364} T_i(t-d) \langle T_i(t-d) \rangle} \\ \\ &=& \displaystyle{ \frac{1}{365} \sum_{d = 0}^{364} \frac{1}{365} \sum_{D = 0}^{364} T_i(t-d) T_i(t-d-D)} \end{array}$

which is generically different from

$\Big{\langle} \langle T_i(t) \rangle \;\langle T_i(t) \rangle \Big{\rangle} =$

$\displaystyle{ \frac{1}{365} \sum_{D = 0}^{364} (\frac{1}{365} \sum_{d = 0}^{364} T_i(t-d-D))(\frac{1}{365} \sum_{d = 0}^{364} T_i(t-d-D) ) }$

since the terms in the latter expression contain products $T_i(t-364-364)T_i(t-364-364)$ that can’t appear in the former.

Moreover:

$\begin{array}{ccl} \Big{\langle} (T_i(t) - \langle T_i(t) \rangle)^2 \Big{\rangle} &=& \Big{\langle} T_i(t)^2 - 2 T_i(t) \langle T_i(t) \rangle + \langle T_i(t) \rangle^2 \Big{\rangle} \\ \\ &=& \langle T_i(t)^2 \rangle - 2 \big{\langle} T_i(t) \langle T_i(t) \rangle \big{\rangle} + \big{\langle} \langle T_i(t) \rangle^2 \big{\rangle} \end{array}$

But since $\big{\langle} T_i(t) \langle T_i(t) \rangle \big{\rangle} \neq \big{\langle} \langle T_i(t) \rangle \; \langle T_i(t) \rangle \big{\rangle},$ as was just shown, those terms do not cancel out in the above expression. In particular, this means that

$-2 \big{\langle} T_i(t) \langle T_i(t) \rangle \big{\rangle} + \big{\langle} \langle T_i(t) \rangle \langle T_i(t) \rangle \big{\rangle}$

contains terms $T_i(t-364-364)$ which do not appear in $\langle T_i(t)\rangle^2,$ hence

$\Big{\langle} (T_i(t) - \langle T_i(t) \rangle)^2 \Big{\rangle} \neq \langle T_i(t)^2\rangle - \langle T_i(t)\rangle^2$

So at least for the case of the standard deviation it is clear that those two definitions are not the same for a running mean. For the covariances this would still need to be shown.

Surface air temperatures

Remember that $\tilde{T}_i(t)$ is the average surface air temperature at the grid point $i$ on day $t.$ You can get these temperatures from here:

• Earth System Research Laboratory, NCEP Reanalysis Daily Averages Surface Level, or ftp site.

These sites will give you worldwide daily average temperatures on a 2.5° latitude × 2.5° longitude grid (144 × 73 grid points), from 1948 to now. Ihe website will help you get data from within a chosen rectangle in a grid, for a chosen time interval. Alternatively, you can use the ftp site to download temperatures worldwide one year at a time. Either way, you’ll get ‘NetCDF files’—a format we will discuss later, when we get into more details about programming!

Niño 3.4

Niño 3.4 is the area-averaged sea surface temperature anomaly in the region 5°S-5°N and 170°-120°W. You can get Niño 3.4 data here:

You can get Niño 3.4 data here:

Niño 3.4 data since 1870 calculated from the HadISST1, NOAA. Discussed in N. A. Rayner et al, Global analyses of sea surface temperature, sea ice, and night marine air temperature since the late nineteenth century, J. Geophys. Res. 108 (2003), 4407.

You can also get Niño 3.4 data here:

Monthly Niño 3.4 index, Climate Prediction Center, National Weather Service.

The actual temperatures in Celsius are close to those at the other website, but the anomalies are rather different, because they’re computed in a way that takes global warming into account. See the website for details.

Niño 3.4 is just one of several official regions in the Pacific:

• Niño 1: 80°W-90°W and 5°S-10°S.

• Niño 2: 80°W-90°W and 0°S-5°S

• Niño 3: 90°W-150°W and 5°S-5°N.

• Niño 3.4: 120°W-170°W and 5°S-5°N.

• Niño 4: 160°E-150°W and 5°S-5°N.

For more details, read this:

• Kevin E. Trenberth, The definition of El Niño, Bulletin of the American Meteorological Society 78 (1997), 2771–2777.

Chemical Reaction Network Talks

26 June, 2014

A while ago I blogged about David Soloveichik’s talk at this workshop:

Programming with Chemical Reaction Networks: Mathematical Foundations, Banff International Research Station, 8-13 June 2014.

Now the slides for his talk are available:

• David Soloveichik, U.C. San Francisco, The computational power of chemical reaction networks.

And now I’d like to tell you about three more talks!

The first two are about ways one chemical reaction network can simulate another. This is important for a couple of reasons. First, in biology, a bunch of different chemical reactions can ‘accomplish the same task’—and we’d like to make this idea precise. That’s what Luca Cardelli spoke about. Second, people trying to do computation with chemistry are starting to simulate quite general reactions using DNA! That’s what Sheung Woo Shin spoke about.

Luca Cardelli

Luca Cardelli was at Oxford when I was visiting this spring, but unfortunately I didn’t meet him! He works for Microsoft on the interface of biology and computation. At Banff, he talked about ways one chemical reaction network can simulate another. His slides are here:

• Luca Cardelli, Morphisms of reaction networks.

He has a paper that gives a more detailed explanation of these ideas:

• Luca Cardelli, Morphisms of reaction networks that couple structure to function.

Here is my own disorganized explanation… with lots of informative but confusing digressions. A population protocol is a chemical reaction with only 2-in, 2-out reactions. For example, this paper presents a population protocol that does ‘approximate majority detection’:

• Dana Angluin, James Aspnes, and David Eisenstat, A simple population protocol for fast robust approximate majority, Distributed Computing 21 (2008), 87–102.

What’s the idea? We start with two kinds of molecules, say $x$’s and $y$’s, and we want to see which one is in the majority, so we run these chemical reactions:

$x + y \to x + b$

$x + y \to y + b$

$x + b \to 2x$

$y + b \to 2y$

See? All the reactions have 2 molecules going in and 2 going out. The $b$ molecules act as ‘undecided voters’ who become either an $x$ or a $y,$ depending on who they meet first.

If we start with about $n$ molecules, in $O(n \log n)$ time these reactions are very likely to convert all $x$’s and $y$’s to whatever kind of molecule was in the majority initially… at least if the gap in the number of $x$’s and $y$’s is big enough.

Here’s another population protocol that also does the job:

$x + y \to 2b$

$x + b \to 2x$

$y + b \to 2y$

And here’s a proof that one of these algorithms actually works—most of the time, when the initial difference in populations is big enough:

• Etienne Perron, Dinkar Vasudevan, and Milan Vojonvic, Using three states for binary consensus on complete graphs, Technical Report, MSR-TR-2008-114, Microsoft, September 2008.

If we use a discrete-time formalism to describe the dynamics, the proof seems to get harder. See the paper by Angluin, Aspnes, and Eisenstat for the only known proof!

Anyway, Luca Cardelli is interested in chemical reaction networks actually found in biology. This approximate majority algorithm is seen quite clearly in a certain biological system: a certain ‘epigenetic switch’. However, it is usually ‘obfuscated’ or ‘encrypted’: hidden in a bigger, more complicated chemical reaction network. For example, see:

• Luca Cardelli and Attila Csikász-Nagy, The cell cycle switch is approximate majority obfuscated, Scientific Reports 2 (2012).

This got him interested in developing a theory of morphisms between reaction networks, which could answer questions like: When can one CRN emulate another? But these questions turn out to be much easier if we use the rate equation than with the master equation. So, he asks: when can one CRN give the same rate equation as another?

He found a sufficient condition that’s ‘purely syntactic’: you can tell if it holds by looking at the reaction networks, regardless of the rate constants.

Here’s the idea. We say one network emulates another if for any rate constants of the second, we can find rate constants for the first that makes its rate equation have solutions exactly mimicking that of the second, but where several species in the first correspond to one in the second.

For this to make sense, we assume there is a map sending:

• species to species
• reactions to reactions

In a chemical reaction network homomorphism, the map on reactions is determined by the map on species in the obvious way. For example, if species $A$ is sent to $f(A)$ and species $B$ is sent to $f(B)$ then the reaction

$2A + B \to 3B$

is sent to the reaction

$2f(A) + f(B) \to 3 f(B)$

In this situation, to make the first network emulate the second, we need to set equal the initial concentrations of all species in the inverse image of a given species.

A reactant homomorphism from one chemical reaction network to another is more general: it sends species to species, and for any reaction in the first chemical reaction network with input

$A + B + C \cdots$

there’s a reaction in the second with input

$f(A) + f(B) + f(C) + \cdots$

(Reactant is another name for input.)

A stoichiomorphism is a kind of morphism that takes rate constants into account. See Cardelli’s paper for the definition.

The main theorem: given a stoichiomorphism from one chemical reaction network to another that’s also a reactant homomorphism, then the first emulates the second.

For a better explanation, read his paper! Here’s a cool picture from his paper showing a bunch of chemical reaction networks including the approximate majority network (labelled AM), many of which show up in biology, and morphisms between them:

Click to enlarge! These chemical reaction networks are drawn in a special style: as influence networks, consisting of ‘gates’ where process activates or deactivates another. Each gate is a chemical reaction network of a certain form, schematically like this:

$\mathrm{off} \leftrightarrow \mathrm{intermediate} \leftrightarrow \mathrm{on}$

where the forward reactions are catalyzed by one chemical and the reverse reactions are catalyzed by another. A gate is like a switch that can be turned on or off.

While listening to this talk, I thought the way in which one CRN emulates another in Cardelli’s formalism looks awfully similar to the way one dynamical system emulates another in Eugene Lerman’s formalism:

• Eugene Lerman, Networks of dynamical systems, Azimuth, 18 March 2014.

The following picture from Cardelli’s paper shows that one of his morphisms of reaction networks is like ‘covering map’. This reminds me a lot of what’s happening in Lerman’s work.

Again, click to enlarge!

Seung Woo Shin

Seung Woo Shin was actually Brendan Fong’s roommate at the National University of Singapore while Brendan was working with me on chemical reaction networks. Apparently they never talked about their work!

Shin spoke about some other concepts of ‘morphism’ between chemical reaction networks. These other concepts do not involve reaction rates, just which chemicals can turn into which. You can see his slides here:

• Seung Woo Shin, Verifying CRN implementations.

and read his thesis for more details:

• Seung Woo Shin, Compiling and verifying DNA-based chemical reaction network implementations, Masters thesis, Caltech, 2012.

Abstract: One goal of molecular programming and synthetic biology is to build chemical circuits that can control chemical processes at the molecular level. Remarkably, it has been shown that synthesized DNA molecules can be used to construct complex chemical circuits that operate without any enzyme or cellular component. However, designing DNA molecules at the individual nucleotide base level is often difficult and laborious, and thus chemical reaction networks (CRNs) have been proposed as a higher-level programming language. So far, several general-purpose schemes have been described for designing synthetic DNA molecules that simulate the behavior of arbitrary CRNs, and many more are being actively investigated.

Here, we solve two problems related to this topic. First, we present a general-purpose CRN-to-DNA compiler that can apply user-defined compilation schemes for translating formal CRNs to domain-level specifications for DNA molecules. In doing so, we develop a language in which such schemes can be concisely and precisely described. This compiler can greatly reduce the amount of tedious manual labor faced by researchers working in the field. Second, we present a general method for the formal verification of the correctness of such compilation. We first show that this problem reduces to testing a notion of behavioral equivalence between two CRNs, and then we construct a mathematical formalism in which that notion can be precisely defined. Finally, we provide algorithms for testing that notion. This verification process can be thought of as an equivalent of model checking in molecular computation, and we hope that the generality of our verification techniques will eventually allow us to apply them not only to DNA-based CRN implementations but to a wider class of molecular programs.

His thesis built on this earlier paper:

• David Soloveichik, Georg Seelig and Erik Winfree, DNA as a universal substrate for chemical kinetics, Proceedings of the National Academy of Sciences (2010).

I think this work is fascinating and deeply related to category theory, so I talked to Shin and Winfree about it, and this is what we came up with:

CRN equivalences: progress report.

This is one of several reports on progress people at the workshop made on various open problems.

David Anderson

Brendan Fong and I wrote about David Anderson’s work in Part 9 of the network theory series. It’s so impressive that I expected him to be older… older than me, I guess. He’s not!

In his tutorial, he gave an overview of chemical reaction networks with an emphasis on the deficiency zero theorem. Since many people were puzzled by the ‘deficiency’ concept, they asked lots of questions. But I’ve already explained that idea in Part 21. So, I’ll just mention a couple of cool theorems he told us about!

Theorem (Horn and Jackson). If a reaction network has a complex balanced equilibrium, then:

1. It has no equilibria that are not complex balanced.

2. The reaction network must be weakly reversible.

3. Every stochiometric compatibility class contains precisely one complex balanced equilibrium.

I should have known this, since this work is classic. But I don’t think I knew that the existence of one complex balanced equilibrium implied all this stuff!

He also mentioned this paper:

• Guy Shinar and Martin Feinberg, Structural sources of robustness in biochemical reaction networks, Science (2010).

which contains this amazing theorem:

Theorem (Shinar and Feinberg). Suppose there is a chemical reaction network such that:

1. its deficiency equals one;

2. it has a positive steady state;

3. it has two “non-terminal complexes” that differ only in one species $S.$ (“Non-terminal” is a concept that’s easier to explain with a picture of a reaction network).

Then the species $S$ is absolutely robust: with any initial conditions, the rate equation will approach an equilibrium where the concentration of $S$ approaches a specific fixed value $c,$ independent of the initial conditions!

However, things work very differently if we treat the system stochastically, using the master equation:

• David F. Anderson, German A. Enciso and Matthew D. Johnston, Stochastic analysis of biochemical reaction networks with absolute concentration robustness.

More

A lot more happened at this workshop! There was a huge amount of discussion of the leader election problem, which is about how to cook up chemical reactions that create a ‘leader’: a single molecule of some sort.

Leader election: the problem, and references.

Leader election: progress report.

As I explained before, David Soloveichik talked about various forms of digital computation with chemical reaction networks. David Doty talked about the very important flip side of the coin: analog computation.

• David Doty, Rate-independent computation by real-valued chemistry.

There were also great talks by Lulu Qian and Erik Winfree, which I won’t try to summarize. Qian does a lot of work in the lab making things actually happen, so if you’re a practical sort this is the talk to look at:

All in all, a very stimulating workshop. The diversity of things one can ask about chemical reaction networks is quite exciting!