I have been learning to make animations using R. This is an animation of the profile of the surface air temperature at the equator. So, the *x* axis here is the longitude, approximately from 120° E to 280° E. I pulled the data from the region that Graham Jones specified in his code on github: it’s equatorial line in the region that Ludescher *et al.* used:

For this animation I tried to show the 1997-1998 El Niño. Typically the Pacific is much cooler near South America, due to the upwelling of deep cold water:

(Click for more information.) That part of the Pacific gets even cooler during La Niña:

But it warms up during El Niños:

You can see that in the surface air temperature during the 1997-1998 El Niño, although by summer of 1998 things seem to be getting back to normal:

I want to practice making animations like this. I could make a much prettier and better-labelled animation that ran all the way from 1948 to today, but I wanted to think a little about what exactly is best to plot if we want to use it as an aid to understanding some of this El Niño business.

]]>

There’s something about logic that’s both fascinated and terrified me ever since I was a kid: it’s how we can’t fully pin down infinite structures, like the real or complex number systems, using a language with finitely many symbols and a theory with finitely many axioms.

It’s terrifying that we don’t fully know what we’re talking about when we’re talking about numbers! But it’s fascinating that we can understand a lot about the limitations.

There are many different things to say about this, depending on what features of these number systems we want to describe, and what kind of logic we want to use.

Maybe I should start with the natural numbers, since that story is more famous. This can also serve as a lightning review of some basic concepts which I’ll pretend you already vaguely know: first-order versus second-order logic, proofs versus models, and so on. If you don’t know these, you can either fake it or read some of the many links in this article!

When Peano originally described the natural numbers he did so using axioms phrased in second-order logic. In first-order logic we can quantify over variables: for example, we can say

which means that if the predicate holds for all it holds for any variable In second-order logic we can also quantify over predicates: for example, we can say

which says that if and only if for every predicate is true precisely when is true. Leibniz used this principle, called the identity of indiscernibles, to *define* equality… and this is a nice example of the greater power of second-order logic. In first-order logic we typically include equality as part of the language and add axioms describing its properties, like

In second-order logic we can *define* equality and *prove* these properties starting from the properties we already have for

Anyway, in his axioms for the natural numbers, Peano used second-order logic to formulate the principle of mathematical induction in this sort of way:

This says that if you’ve got any predicate that’s true for and is true for whenever it’s true for then it’s true for all natural numbers.

In 1888, Dedekind showed that Peano’s original axioms for the natural numbers are **categorical**, meaning all its models are isomorphic.

The concept of ‘model’ involves set theory. In a model you pick a set for your variables to range over, pick a subset of for each predicate—namely the subset where that predicate is true —and so on, in such a way that all the axioms in that theory are satisfied. If two models are isomorphic, they’re the same for all practical purposes.

So, in simple rough terms, a categorical theory is one that gives a *full* description of the mathematical structure it’s talking about.

This makes Dedekind’s result sound like great news. It sounds like Peano’s original second-order axioms for arithmetic completely describe the natural numbers.

However, there’s an important wrinkle. There are many inherently undetermined things about set theory! So in fact, a categorical theory only gives a full description of the mathematical structure it’s talking about *relative to a choice of what sets are like*.

So, Dedekind’s result just shoves everything mysterious and undetermined about the natural numbers under the carpet: they become mysterious and undetermined things about set theory. This became clear much later, thanks to Gödel and others. And in the process, it became clear that second-order logic is a bit problematic compared to first-order logic.

You see, first-order logic has a set of deduction rules that are:

• **sound**: Every provable sentence holds in every model.

• **semantically complete:** Every sentence that holds in every model is provable.

• **effective:** There is an algorithm that can correctly decide whether any given sequence of symbols is a proof.

Second-order logic does not! It’s ‘too powerful’ to also have all three of these nice properties.

So, these days people often work with a first-order version of Peano’s axioms for arithmetic. Instead of writing down a single axiom for mathematical induction:

we write down an axiom schema—an infinite list of axioms—with one axiom like this:

for each formula that we can actually write down using the language of arithmetic.

This first-order version of Peano arithmetic is *not* categorical: it has lots of nonisomorphic models. People often pretend there’s one ‘best’ model: they call it the ‘standard’ natural numbers, and call all the others ‘nonstandard’. But there’s something a bit fishy about this.

Indeed, Gödel’s first incompleteness theorem says there are many statements about natural numbers that can neither be proved nor disproved starting from Peano’s axioms. It follows that for any such statement we can find a model of the Peano axioms in which that statement holds, and also a model in which it does not.

Furthermore, this remains true even if we add any list of extra axioms to Peano arithmetic, as long as there’s some algorithm that can list all these axioms.

So, I’d prefer to say there are many different ‘versions’ of the natural numbers, just as there are many different groups.

We can study these different versions, and it’s a fascinating subject:

• Wikipedia, Nonstandard models of arithmetic.

However, I want to talk about the situation for other number systems!

The situation is better for the real numbers—at least if we are willing to think about them in a ‘purely algebraic’ way, leaving most analysis behind.

To do this, we can use the theory of a ‘real closed field’. This is a list of axioms, formulated in first-order logic, which describe how and work for the real numbers. You can think of these axioms as consisting of three parts:

• the **field** axioms: the usual algebraic identities involving and together with laws saying that everything has an additive inverse and everything except has a multiplicative inverse.

• the **formally real field** axiom, saying that is not the square of anything. This implies that we can equip the field with a concept of that makes it into an ordered field—but not necessarily in a unique way.

• the **real closed field** axioms, which says that also for any number either or has a square root, and every polynomial of odd degree has a root. Among other things this implies our field can be made into an ordered field in a unique way. To do this, we say if and only if has a square root.

Tarski showed this theory is **complete**: any first-order sentence involving only the operations and the relation can either be proved or disproved starting from the above axioms.

Nonetheless, the theory of real closed fields is not categorical: besides the real numbers, there are many other models! These models are all **elementarily equivalent**: any sentence involving just and first-order logic that holds in one model holds in all the rest. But these models are not all isomorphic: we can’t get a bijection between them that preserves and

Indeed, only finite-sized mathematical structures can be ‘nailed down’ up to isomorphism by theories in first-order logic. You see, the **Löwenheim–Skolem theorem** says that if a first-order theory in a countable language has an infinite model, it has at least one model of each infinite cardinality. So, if we’re trying to use this kind of theory to describe an infinitely big mathematical structure, the most we can hope for is that *after we specify its cardinality*, the axioms completely determine it.

However, the real closed field axioms aren’t even this good. For starters, they have infinitely many nonisomorphic *countable* models. Here are a few:

• the **algebraic real numbers**: these are the real numbers that obey polynomial equations with integer coefficients.

• the **computable real numbers**: these are the real numbers that can be computed to arbitrary precision by a computer program.

• the **arithmetical real numbers**: these are the numbers definable in the language of arithmetic. More precisely, a real number is **arithmetical** if there is a formula in the language of first-order Peano arithmetic, with two free variables, such that

Every computable real number is arithmetical, but not vice versa: just because you can define a real number in the above way does not mean you can actually compute it to arbitrary precision!

And indeed, there are other even bigger countable real closed fields, consisting of real numbers that are definable using more powerful methods, like second-order Peano arithmetic.

We can also get countable real closed fields using tricks like this: take the algebraic real numbers and throw in the number along with just enough other numbers to get a real closed field again. Or, we could throw in both and This probably gives a bigger real closed field—but nobody knows, because for all we know, could equal plus some rational number! Everyone *believes* this is false, but nobody has proved it.

There are also lots of nonisomorphic *uncountable* real closed fields, including ones that include the usual real numbers.

For example, we can take the real numbers and throw in an element that is bigger than and so on—and then do what it takes to get another real closed field. This involves throwing in elements like

and so on. So, we get lots of infinities and infinitesimals.

It gets a bit confusing here, trying to figure out what equals what. But there’s another real closed field containing an infinite element that seems easier to manage. It’s called the field of **real Puiseux series**. These are series of the form

where is any integer, perhaps negative, is any

positive integer, and the coefficients are real.

What’s It’s just a formal variable. But the real Puiseux series are real closed field, and acts like it’s positive, but smaller than any positive real number.

With considerably more work, we can make up a real closed field that:

• contains the real numbers,

• contains an element bigger than and

• obeys the **transfer principle**, which says that a first-order statement phrased in the usual language of set theory holds for the real numbers if and only if it holds for this other number system.

Any real closed field with these properties is called a system of **hyperreal numbers**. In the 1960s, the logician Abraham Robinson used them to make Leibniz’s old idea of infinitesimals in calculus fully rigorous. The resulting theory is called **nonstandard analysis**.

So, I hope you see there’s an exciting—or perhaps appalling—diversity of real closed fields. But don’t forget: they’re all elementarily equivalent. If a sentence involving just and first-order logic holds in any one of these real closed fields, it holds in all of them!

You might wonder what second-order logic has to say about this.

Here the situation looks very different. In second-order logic we can do analysis, because we can quantify over *predicates*, which allows us to talk about subsets of real numbers. And in second-order logic we can write down a theory of real numbers that’s categorical! It’s called the theory of a **Dedekind-complete ordered field**. Again, we can group the axioms in three bunches:

• the **field** axioms: the usual algebraic identities involving and together with laws saying that everything has an additive inverse and everything except has a multiplicative inverse.

• the **ordered field** axiom, saying there is a total ordering such that and implies and implies

• the **Dedekind completeness** axiom, which says that every nonempty subset with an upper bound has a least upper bound. But instead of talking about subsets, we talk about the predicates that hold on those subsets, so we say “for all predicates such that…”

Because they’re categorical, people often use these axioms to define the real numbers. But because they’re second-order, the problem of many nonisomorphic models has really just been swept under the rug. If we use second-order logic, we won’t have a concept of ‘proof’ that’s sound, semantically complete and effective. And if we use first-order axioms for set theory to explicitly talk about subsets instead of predicates, then our set theory will have many models! *Each model* will have a version of the real numbers in it that’s unique up to isomorphism… but the versions in different models will be really different.

In fact, there’s a precise sense in which the ‘standard real numbers’ in one model of set theory can be the ‘hyperreals’ in another. This was first shown by Abraham Robinson.

I mentioned that when we’re studying an infinite mathematical structure using first-order logic, the best we can hope for is to have one model *of each size* (up to isomorphism). The real numbers are far from being this nice… but the complex numbers come much closer!

More precisely, say is some cardinal. A first-order theory describing structure on a single set is called **κ-categorical** if it has a unique model of cardinality And 1965, a logician named Michael Morley showed that if a list of axioms is -categorical for *some* uncountable it’s -categorical for *every* uncountable I haven’t worked my way through the proof, which seems to be full of interesting ideas. But such theories are called **uncountably categorical**.

A great example is the ‘purely algebraic’ theory of the complex numbers. By this I mean we only write down axioms involving and We don’t include anything about this time, nor anything about complex conjugation. You see, if we start talking about complex conjugation we can pick out the real numbers inside the complex numbers, and then we’re more or less back to the story we had for real numbers.

This theory is called the theory of an **algebraically closed field of characteristic zero**. Yet again, the axioms come in three bunches:

• the **field** axioms.

• the **characteristic zero** axioms: these are an infinite list of axioms saying that

• the **algebraically closed** axioms: these say that every non-constant polynomial has a root.

Pretty much any mathematician worth their salt knows that the complex numbers are a model of these axioms, whose cardinality is that of the continuum. There are lots of different countable models: the algebraic complex numbers, the computable complex numbers, and so on. But because the above theory is uncountably categorical, there is *exactly one* algebraically closed field of characteristic zero of *each* uncountable cardinality… up to isomorphism.

This implies some interesting things.

For example, we can take the complex numbers, throw in an extra element, and let it freely generate a bigger algebraically closed field. It’s ‘bigger’ in the sense that it contains the complex numbers as a proper subset, indeed a subfield. But since it has the same cardinality as the complex numbers, it’s *isomorphic* to the complex numbers!

And then, because this ‘bigger’ field is isomorphic to the complex numbers, we can turn this argument around. We can take the complex numbers, remove a lot of carefully chosen elements, and get a subfield that’s isomorphic to the complex numbers.

Or, if we like, we can take the complex numbers, adjoin a *really huge* set of extra elements, and let them freely generate an algebraically closed field of characteristic zero. The cardinality of this field can be as big as we want. It will be determined up to isomorphism by its cardinality.

One piece of good news is that thanks to a result of Tarski, the theory of an algebraically closed field of characteristic zero is complete, and thus, all its models are elementarily equivalent. In other words, all the same first-order sentences written in the language of and hold in every model.

But here’s a piece of *strange* news.

As I already mentioned, the theory of a real closed field is *not* uncountably categorical. This implies something really weird. Besides the ‘usual’ real numbers we can choose another real closed field not isomorphic to with the same cardinality. We can build the complex numbers using pairs of real numbers. We can use the same trick to build a field using pairs of guys in But it’s easy to check that this funny field is algebraically closed and of characteristic zero. Since it has the same cardinality as it must be isomorphic to

In short, different ‘versions’ of the real numbers can give rise to the *same* version of the complex numbers!

So, I hope you see that the logical foundations of the real and complex number systems are quite slippery… yet with work, we can understand a lot about this slipperiness.

Besides the references I’ve given, I just want to mention two more. First, here’s a free introductory calculus textbook based on nonstandard analysis:

• H. Jerome Keisler, *Elementary Calculus: an Infinitesimal Approach*, available as a website or in PDF.

And here’s an expository paper that digs deeper into uncountably categorical theories:

• Nick Ramsey, Morley’s categoricity theorem.

]]>

The members of the Azimuth Project have been working on both predicting and understanding the El Niño phenomenon, along with writing expository articles. So far we’ve mostly talked about the physics and data of the El Niño, along with looking at one method of actually trying to predict El Niño events. Since there’s going to more data exploration using methods more typical of machine learning, it’s a good time to briefly describe the mindset and highlight some of differences between different kinds of predictive models. Here we’ll concentrate on the concepts rather than the fine details and particular techniques.

We also stress there’s not a fundamental distinction between **machine learning (ML)** and **statistical modelling and inference**. There are certainly differences in culture, background and terminology, but in terms of the actual algorithms and mathematics used there’s a great commonality. Throughout the rest of the article we’ll talk about ‘machine learning models’, but could equally have used ‘statistical models’.

For our purposes here, a **model** is any object which provides a systematic procedure for taking some input data and producing a prediction of some output. There’s a spectrum of models, ranging from physically based models at one end to purely data driven models at the other. As a very simple example, suppose you commute by car from your place of work to your home and you want to leave work in order to arrive homeat 6:30 pm. You can tackle this by building a model which takes as input the day of the week and gives you back a time to leave.

• There’s the data driven approach, where you try various leaving times on various days and record whether or not you get home by 6:30 pm. You might find that the traffic is lighter on weekend days so you can leave at 6:10 pm while on weekdays you have to leave at 5:45 pm, except on Wednesdays when you have to leave at 5:30pm. Since you’ve just crunched on the data you have no idea why this works, but it’s a very reliable rule when you use it to predict when you need to leave.

• There’s the physical model approach, where you attempt to infer how many people are doing what on any given day and then figure out what that implies for the traffic levels and thus what time you need to leave. In this case you find out that there’s a mid-week sports game on Wednesday evenings which leads to even higher traffic. This not only predicts that you’ve got to leave at 5:30 pm on Wednesdays but also lets you understand why. (Of course this is just an illustrative example; in climate modelling a physical model would be based upon actual physical laws such as conservation of energy, conservation of momentum, Boyle’s law, etc.)

There are trade-offs between the two types of approach. Data driven modelling is a relatively simple process. In contrast, by proceeding from first principles you’ve got a more detailed framework which is equally predictive but at the cost of having to investigate a lot of complicated underlying effects. Physical models have one interesting advantage: nothing in the data driven model prevents it violating physical laws (e.g., not conserving energy, etc) whereas a physically based model obeys the physical laws by design. This is seldom a problem in practice, but worth keeping in mind.

The situation with data driven techniques is analogous to one

of those American medication adverts: there’s the big message about how “using a data driven technique can change your life for the better” while the voiceover gabbles out all sorts of small print. The remainder of this post will describe some of the basic principles in that small print.

There’s a popular misconception that machine learning works well when you simply collect some data and throw it into a machine learning algorithm. In practice that kind of approach often yields a model that is quite poor. Almost all successful machine learning applications are preceded by some form of data preprocessing. Sometimes this is simply rescaling so that different variables have similar magnitudes, are zero centred, etc.

However, there are often steps that are more involved. For example, many machine learning techniques have what are called ‘kernel variants’ which involve (in a way whose details don’t matter here) using a nonlinear mapping from the original data to a new space which is more amenable to the core algorithm. There are various kernels with the right mathematical properties, and frequently the choice of a good kernel is made either by experimentation or knowledge of the physical principles. Here’s an example (from Wikipedia’s entry on the support vector machine) of how a good choice of kernel can convert a not linearly separable dataset into a linearly separable one:

An extreme example of preprocessing is explicitly extracting **features** from the data. In ML jargon, a feature “boils down” some observed data into a value directly useful. For example, in the work by Ludescher *et al* that we’ve been looking at, they don’t feed all the daily time series values into their classifier but take the correlation between different points over a year as the basic features to consider. Since the individual days’ temperatures are incredibly noisy and there are *so* many of them, extracting features from them gives more useful input data. While these extraction functions could theoretically be learned by the ML algorithm, this is quite a complicated function to learn. By explicitly choosing to represent the data using this feature the amount the algorithm has to discover is reduced and hence the likelihood of it finding an excellent model is dramatically increased.

Some of the problems that we describe below would vanish if we had unlimited amounts of data to use for model development. However, in real cases we often have a strictly limited amount of data we can obtain. Consequently we need methodologies to address the issues that arise when data is limited.

The most common way to work with collected data is to split it into a **training set** and a **test set**. The training set is used in the process of determining the best model parameters, while the test set—which is not used in *any* way in determining the those model parameters—is then used to see how effective the model is likely to be on new, unseen data. (The test and training sets need not be equally sized. There are some fitting techniques which need to further subdivide the training set, so that having more training than test data works out best.) This division of data acts to further reduce the effective amount of data used in determining the model parameters.

After we’ve made this split we have to be careful how much of the test data we scrutinise in any detail, since once it has been investigated it can’t meaningfully be used for testing again, although it can still be used for future training. (Examining the test data is often informally known as **burning data**.) That only applies to detailed inspection however; one common way to develop a model is to look at some training data and then **train the model** (also known as **fitting the model**) on that training data. It can then be evaluated on the test data to see how well it does. It’s also then okay to purely mechanically train the model on the test data and evaluate it on the training data to see how “stable” the performance is. (If you get dramatically different scores then your model is probably flaky!) However, once we start to look at precisely *why* the model failed on the test data—in order to change the form of the model—the test data has now become training data and can’t be used as test data for future variants of that model. (Remember, the real goal is to accurately predict the outputs for *new, unseen* inputs!)

Suppose we’re modelling a system which has a *true* probability distribution . We can’t directly observe this, but we have some samples obtained from observation of the system and hence come from . Clearly there are problems if we generate this sample in a way that will bias the area of the distribution we sample from: it wouldn’t be a good idea to get training data featuring heights in the American population by only handing out surveys in the locker rooms of basketball facilities. But if we take care to avoid sampling bias as much as possible, then we can make various kinds of estimates of the distribution that we think comes from.

Let’s consider the estimate implied for by some particular technique. It would be nice if , wouldn’t it? And indeed many good estimators have the property that as the size of tends to infinity will tend to . However, for finite sizes of , and especially for *small sizes*, may have some spurious detail that’s not present in .

As a simple illustration of this, my computer has a pseudorandom number generator which generates essentially uniformly distributed random numbers between 0 and 32767. I just asked for 8 numbers and got

2928, 6552, 23979, 1672, 23440, 28451, 3937, 18910.

Note that we’ve got one subset of 4 values (2928, 6552, 1672, 3937) within the interval of length 5012 between 1540 and 6552 and another subset of 3 values (23440, 23979 and 28451) in the interval of length 5012 between 23440 and 28451. For this uniform distribution the *expectation* of the number of values falling within a given range of that size is about 1.2. Readers will be familiar with how the expectation of a random quantity for a *small* sample will have a large amount of variation around its value that only reduces as the sample size increases, so this isn’t a surprise. However, it does highlight that even *completely unbiased* sampling from the *true distribution* will typically give rise to extra ‘structure’ within the distribution implied by the samples.

For example, here are the results from one way of estimating the probability from the samples:

The green line is the true density while the red curve shows the probability density obtained from the samples, with clearly spurious extra structure.

Almost all modelling techniques, while not necessarily estimating an explicit probability distribution from the training samples, can be seen as building functions that are related to that probability distribution.

For example, a ‘thresholding classifier’ for dividing input into two output classes will place the threshold at the optimal point for the distribution implied by the samples. As a consequence, one important aim in building machine learning models is to estimate the features that are present in the true probability distribution while not learning such fine details that they are likely to be spurious features due to the small sample size. If you think about this, it’s a bit counter-intuitive: you *deliberately don’t want to perfectly reflect every single pattern in the training data*. Indeed, specialising a model too closely to the training data is given the name **over-fitting**.

This brings us to **generalization**. Strictly speaking generalization is the ability of a model to work well upon unseen instances of the problem (which may be difficult for a variety of reasons). In practice however one tries hard to get representative training data so that the main issue in generalization is in preventing overfitting, and the main way to do that is—as discussed above—to split the data into a set for training and a set *only* used for testing.

One factor that’s often related to generalization is **regularization**, which is the general term for adding constraints to the model to prevent it being too flexible. One particularly useful kind of regularization is **sparsity**. Sparsity refers to the degree to which a model has empty elements, typically represented as 0 coefficients. It’s often possible to incorporate a *prior* into the modelling procedure which will encourage the model to be sparse. (Recall that in **Bayesian inference** the **prior** represents our initial ideas of how likely various different parameter values are.) There are some cases where we have various detailed priors about sparsity for problem specific reasons. However the more common case is having a ‘general modelling’ belief, based upon experience in doing modelling, that sparser models have a better generalization performance.

As an example of using sparsity promoting priors, we can look at linear regression. For standard regression with examples of against dimensional vectors we’re considering the total error

while with the prior we’ve got

where are the coefficients to be fitted and is the prior weight. We can see how the prior weight affects the sparsity of the s:

On the -axis is while the -axis is the coefficient value. Each line represents the value of one particular coefficient as increases. You can see that for very small – corresponding to a very weak prior – all the weights are non-zero, but as it increases – corresponding to the prior becoming stronger – more and more of them have a value of 0.

There are a couple of other reasons for wanting sparse models. The obvious one is speed of model evaluation, although this is much less significant with modern computing power. A less obvious reason is that one can often only *effectively utilise* a sparse model, e.g., if you’re attempting to see how the input factors should be physically modified in order to affect the real system in a particular way. In this case we might want a good sparse model rather than an excellent dense model.

While there are some situations where a model is sought purely to develop knowledge of the universe, in many cases we are interested in models in order to direct actions. For example, having forewarning of El Niño events would enable all sorts of mitigation actions. However, these actions are costly so they shouldn’t be undertaken when there *isn’t* an upcoming El Niño. When presented with an unseen input the model can either match the actual output (i.e., be right) or differ from the actual output (i.e., be wrong). While it’s impossible to know in advance if a single output will be right or wrong – if we could tell that we’d be better off using *that* in our model – from the training data it’s generally possible to estimate the fractions of predictions that will be right and will be wrong in a large number of uses. So we want to link these probabilities with the effects of actions taken in response to model predictions.

We can do this using a **utility function** and a **loss
function**. The utility maps each possible output to a numerical value proportional to the benefit from taking actions when that output was correctly anticipated. The loss maps outputs to a number proportional to the loss from the actions when the output was incorrectly predicted by the model. (There is evidence that human beings often have inconsistent utility and loss functions, but that’s a story for another day…)

There are three common ways the utility and loss functions are used:

• Maximising the expected value of the utility (for the fraction where the prediction is correct) minus the

expected value of the loss (for the fraction where the prediction is incorrect).

• Minimising the expected loss while ensuring that the expected utility is at least some

value

• Maximising the expected utility while ensuring that the expected loss is at most some

value.

Once we’ve chosen which one we want, it’s often possible to actually tune the fitting of the model to optimize with respect to that criterion.

Of course sometimes when building a model we don’t know enough details of how it will be used to get accurate utility and loss functions (or indeed know how it will be used at all).

It is certainly possible to take a predictive model obtained by machine learning and use it to figure out a physically based model; this is one way of performing **data mining**. However in practice there are a couple of reasons why it’s necessary to take some care when doing this:

• The variables in the training set may be related by some

non-observed **latent variables** which may be difficult to reconstruct without knowledge of the physical laws that are in play. (There are machine learning techniques which attempt to reconstruct unknown latent variables but this is a much more difficult problem than estimating known but unobserved latent variables.)

• Machine learning models have a maddening ability to find variables that are predictive due to the way the data was gathered. For example, in a vision system aimed at finding tanks all the images of tanks were taken during one day on a military base when there was accidentally a speck of grime on the camera lens, while all the images of things that weren’t tanks were taken on other days. A neural net cunningly learned that to decide if it was being shown a tank it should look for the shadow from the grime.

• It’s common to have some groups of *very highly correlated* input variables. In that case a model will generally learn a function which utilises an arbitrary linear combination of the correlated variables and an equally good model would result from using any other linear combination. (This is an example of the statistical problem of ‘identifiability’.) Certain sparsity encouraging priors have the useful property of encouraging the model to select only one representative from a group of correlated variables. However, even in that case it’s still important not to assign too much significance to the particular division of model parameters in groups of correlated variables.

• One can often come up with good machine learning models even when physically important variables haven’t been collected in the training data. A related issue is that if all the training data is collected from a particular subspace factors that aren’t important there won’t be found. For example, if in a collision system to be modelled all data is collected about low speeds the machine learning model won’t learn about relativistic effects that only have a big effect at a substantial fraction of the speed of light.

All of the ideas discussed above are really just ways of making sure that work developing statistical/machine learning models for a real problem is producing meaningful results. As Bob Dylan (almost) sang, “to live outside the physical law, you must be honest; I know you always say that you agree”.

]]>

Last time we introduced the concept of stochastic resonance. Briefly, it’s a way that noise can amplify a signal, by giving an extra nudge that helps a system receiving that signal make the jump from one state to another. Today we’ll describe a program that demonstrates this concept. But first, check it out:

No installation required! It runs as a web page which allows you to set the parameters of the model and observe the resulting output signal. It has a responsive behavior, because it runs right in your browser, as javascript.

There are sliders for controlling the amounts of sine wave and noise involved in the mix. As explained in the previous article, when we set the wave to a level not quite sufficient to cause the system to oscillate between states, and we add in the right amount of noise, stochastic resonance should kick in:

The program implements a mathematical model that runs in discrete time. It has two stable states, and is driven by a combination of a sine forcing function and a noise source.

The code builds on top of a library called JSXGraph, which supports function plotting, interactive graphics, and data visualization.

If you haven’t already, go try the program. On one plot it shows a sine wave, called the **forcing signal**, and a chaotic time-series, called the **output signal**.

There are four sliders, which we’ll call Amplitude, Frequency, Noise and Sample-Path.

• The Amplitude and Frequency sliders control the sine wave. Try them out.

• The output signal depends, in a complex way, on the sine wave. Vary Amplitude and Frequency to see how they affect the output signal.

• The amount of randomization involved in the process is controlled by the Noise slider. Verify this.

• Change the Sample-Path slider to alter the sequence of random numbers that are fed to the process. This will cause a different instance of the process to be displayed.

Now try to get stochastic resonance to kick in…

Time to look at the blueprints. It’s easy.

• Open the model web page. The code is now running in your browser.

• While there, run your browser’s view-source function. For Firefox on the Mac, click Apple-U. For Firefox on the PC, click Ctrl-U.

• You should see the html file for the web page itself.

• See the “script” directives at the head of this file. Each one refers to javascript program on the internet. When the browser sees it, the program is fetched and loaded into the browser’s internal javascript interpreter. Here are the directives:

<script src= "http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=default"> </script> <script src= "http://cdnjs.cloudflare.com/ajax/libs/jsxgraph/0.93/jsxgraphcore.js"> </script> <script src="./StochasticResonanceEuler.js"></script> <script src="./normals.js"></script>

The first one loads MathJax, which is a formula-rendering engine. Next comes JSXGraph, a library that provides support for plotting and interactive graphics. Next, StochchasticResonanceEuler.js is the *main code* for the model, and finally, normals.js provides random numbers.

• In the source window, click on the link for StochasticResonanceEuler.js — and you’ve reached the source!

The program implements a stochastic difference equation, which defines the changes in the output signal as a function of its current value and a random noise value.

It consists of the following components:

1. Interactive controls to set parameters

2. Plot of the forcing signal

3. Plot of the output signal

4. A function that defines a particular SDE

5. A simulation loop, which renders the output signal.

The program contains seven functions. The top-level function is initCharts. It dispatches to initControls, which builds the sliders, and initSrBoard, which builds the curve objects for the forcing function and the output signal (called “position curve” in the program). Each curve object is assigned a function that computes the (x,t) values for the time series, which gets called whenever the input parameters change. The function that is assigned to the forcing curve computes the sine wave, and reads the amplitude and frequency values from the sliders.

The calculation method for the output signal is set to the function mkSrPlot, which performs the simulation. It begins by defining a function for the deterministic part of the derivative:

deriv = Deriv(t,x) = SineCurve(t) + BiStable(x),

Then it constructs a “stepper” function, through the call Euler(deriv, tStep). A stepper function maps the current point (t,x) and a noise sample to the next point (t’,x’). The Euler stepper maps

((t,x), noiseSample)

to

(t + tStep, x + tStep * Deriv(t,x) + noiseSample).

The simulation loop is then performed by the function sdeLoop, which is given:

• The stepper function

• The noise amplitude (“dither”)

• The initial point (t0,x0)

• A randomization offset

• The number of points to generate

The current point is initialized to (t0,x0), and then the stepper is repeatedly applied to the current point and the current noise sample. The output returned is the sequence of (t,x) values.

The noise samples are normally distributed random numbers stored in an array. They get scaled by the noise amplitude when they are used. The array contains more values than are needed. By changing the starting point in the array, different instances of the process are obtained.

Now let’s tweak the program to do new things.

First let’s make a local copy of the program on your local machine, and get it to run there. Make a directory, say /Users/macbookpro/stochres. Open the html file in the view source window. Paste it into the file /Users/macbookpro/stochres/stochres.html. Next, in the view source window, click on the link to StochasticResonanceEuler.js. Paste the text into /Users/macbookpro/stochres/StochasticResonanceEuler.js.

Now point your browser to the file, with the URL file:///Users/macbookpro/stochres/stochres.html. To prove that you’re really executing the local copy, make a minor edit to the html text, and check that it shows up when you reload the page. Then make a minor edit to StochasticResonanceEuler.js, say by changing the label text on the slider from “forcing function” to “forcing signal.”

Now let’s get warmed up with some bite-sized programming exercises.

1. Change the color of the sine wave.

2. Change the exponent in the bistable polynomial to values other than 2, to see how this affects the output.

3. Add an integer-valued slider to control this exponent.

4. Modify the program to perform two runs of the process, and show the output signals in different colors.

5. Modify it to perform ten runs, and change the output signal to display the point-wise average of these ten runs.

6. Add an input slider to control the number of runs.

7. Add another plot, which shows the standard deviation of the output signals, at each point in time.

8. Replace the precomputed array of normally distributed random numbers with a run-time computation that uses a random number generator. Use the Sample-Path slider to seed the random number generator.

9. When the sliders are moved, explain the flow of events that causes the recalculation to take place.

What is the impact of the frequency of the forcing signal on its transmission through stochastic resonance?

• Make a hypothesis about the relationship.

• Check your hypothesis by varying the Frequency slider.

• Write a function to measure the strength of the output signal at the forcing frequency. Let sinwave be a discretely sampled sine wave at the forcing frequency, and coswave be a discretely sampled cosine wave. Let sindot = the dot product of sinwave and the output signal, and similarly for cosdot. Then the power measure is sindot^{2} + cosdot^{2}.

• Modify the program to perform N trials at each frequency over some specified range of frequency, and measure the average power over all the N trials. Plot the power as a function of frequency.

• The above plot required you to fix a wave amplitude and noise level. Choose five different noise levels, and plot the five curves in one figure. Choose your noise levels in order to explore the range of qualitative behaviors.

• Produce several versions of this five-curve plot, one for each sine amplitude. Again, choose your amplitudes in order to explore the range of qualitative behaviors.

]]>

Why do ostriches stick their heads under the sand when they’re scared?

They don’t. So why do people say they do? A Roman named Pliny the Elder might be partially to blame. He wrote that ostriches “imagine, when they have thrust their head and neck into a bush, that the whole of their body is concealed.”

That would be silly—birds aren’t that dumb. But people will actually *pay* to avoid learning unpleasant facts. It seems irrational to avoid information that could be useful. But people do it. It’s called **information aversion**.

Here’s a new experiment on information aversion:

In order to gauge how information aversion affects health care, one group of researchers decided to look at how college students react to being tested for a sexually transmitted disease.

That’s a subject a lot of students worry about, according to Josh Tasoff, an economist at Claremont Graduate University who led the study along with Ananda Ganguly, an associate professor of accounting at Claremont McKenna College.

The students were told they could get tested for the herpes simplex virus. It’s a common disease that spreads via contact. And it has two forms: HSV1 and HSV2.

The type 1 herpes virus produces cold sores. It’s unpleasant, but not as unpleasant as type 2, which targets the genitals. Ganguly says the college students were given information — graphic information — that made it clear which kind of HSV was worse.

“There were pictures of male and female genitalia with HSV2, guaranteed to kind of make them really not want to have the disease,” Ganguly says.

Once the students understood what herpes does, they were told a blood test could find out if they had either form of the virus.

Now, in previous studies on information aversion it wasn’t always clear why people declined information. So Tasoff and Ganguly designed the experiment to eliminate every extraneous reason someone might decline to get information.

First, they wanted to make sure that students weren’t declining the test because they didn’t want to have their blood drawn. Ganguly came up with a way to fix that: All of the students would have to get their blood drawn. If a student chose not to get tested, “we would draw 10 cc of their blood and in front of them have them pour it down the sink,” Ganguly says.

The researchers also assured the students that if they elected to get the blood tested for HSV1 and HSV2, they would receive the results confidentially.

And to make triply sure that volunteers who said they didn’t want the test were declining it to avoid the information, the researchers added one final catch. Those who didn’t want to know if they had a sexually transmitted disease had to pay $10 to not have their blood tested.

So what did the students choose? Quite a few declined a test.

And while only 5 percent avoided the HSV1 test, three times as many avoided testing for the nastier form of herpes.

For those who didn’t want to know, the most common explanation was that they felt the results might cause them unnecessary stress or anxiety.

Let’s try extrapolating from this. Global warming is pretty scary. What would people do to avoid learning more about it? You can’t exactly pay scientists to not tell you about it. But you can do lots of other things: not listen to them, pay people to contradict what they’re saying, and so on. And guess what? People do all these things.

So, don’t expect that scaring people about global warming will make them take action. If a problem seems scary and hard to solve, many people will just avoid thinking about it.

Maybe a better approach is to tell people things they can do about global warming. Even if these things aren’t big enough to solve the problem, they can keep people engaged.

There’s a tricky issue here. I don’t want people to think turning off the lights when they leave the room is enough to stop global warming. That’s a dangerous form of complacency. But it’s even worse if they decide global warming is such a big problem that there’s no point in doing anything about it.

There are also lots of subtleties worth exploring in further studies. What, *exactly*, are the situations where people seek to avoid unpleasant information? What are the situations where they will accept it? This is something we need to know.

The quote is from here:

• Shankar Vedantham, Why we think ignorance Is bliss, even when It hurts our health, *Morning Edition*, National Public Radio, 28 July 2014.

Here’s the actual study:

• Ananda Ganguly and Joshua Tasoff, Fantasy and dread: the demand for information and the consumption utility of the future.

Abstract.Understanding the properties of intrinsic information preference is important for predicting behavior in many domains including finance and health. We present evidence that intrinsic demand for information about the future is increasing in expected future consumption utility. In the first experiment subjects may resolve a lottery now or later. The information is useless for decision making but the larger the reward, the more likely subjects are to pay to resolve the lottery early. In the second experiment subjects may pay to avoid being tested for HSV-1 and the more highly feared HSV-2. Subjects are three times more likely to avoid testing for HSV-2, suggesting that more aversive outcomes lead to more information avoidance. We also find that intrinsic information demand is negatively correlated with positive affect and ambiguity aversion.

Here’s an attempt by economists to explain information aversion:

• Marianne Andries and Valentin Haddad, Information aversion, 27 February 2014.

Abstract.We propose a theory of inattention solely based on preferences, absent any cognitive limitations and external costs of acquiring information. Under disappointment aversion, information decisions and risk attitude are intertwined, and agents are intrinsically information averse. We illustrate this link between attitude towards risk and information in a standard portfolio problem, in which agents balance the costs, endogenous in our framework, and benefits of information. We show agents never choose to receive information continuously in a diffusive environment: they optimally acquire information at infrequent intervals only. We highlight a novel channel through which the optimal frequency of information acquisition decreases when risk increases, consistent with empirical evidence. Our framework accommodates a broad range of applications, suggesting our approach can explain many observed features of decision under uncertainty.

The photo, probably fake, is from here.

]]>

Precisely defining a complicated climate phenomenon like El Niño is a tricky business. Lots of different things tend to happen when an El Niño occurs. In 1997-1998, we saw these:

But what if just *some* of these things happen? Do we still have an El Niño or not? Is there a right answer to this question, or is it partially a matter of taste?

A related puzzle: is El Niño a single phenomenon, or several? Could there be several *kinds* of El Niño? Some people say there are.

Sometime I’ll have to talk about this. But today let’s start with the basics: the standard definition of El Niño. Let’s see how this differs from Ludescher *et al’*s definition.

The most standard definitions use the **Oceanic Niño Index** or **ONI**, which is the running 3-month mean of the Niño 3.4 index:

• An **El Niño** occurs when the ONI is over 0.5 °C for at least 5 months in a row.

• A **La Niña** occurs when the ONI is below -0.5 °C for at least 5 months in a row.

Of course I should also say exactly what the ‘Niño 3.4 index’ is, and what the ‘running 3-month mean’ is.

The **Niño 3.4 index** is the area-averaged, time-averaged sea surface temperature anomaly for a given month in the region 5°S-5°N and 170°-120°W:

Here **anomaly** means that we take the area-averaged, time-averaged sea surface temperature for a given month — say February — and subtract off the historical average of this quantity — that is, for Februaries of other years on record.

If you’re clever you can already see room for subtleties and disagreements. For example, you can get sea surface temperatures in the Niño 3.4 region here:

• Niño 3.4 data since 1870 calculated from the HadISST1, NOAA. Discussed in N. A. Rayner *et al*, Global analyses of sea surface temperature, sea ice, and night marine air temperature since the late nineteenth century, *J. Geophys. Res.* **108** (2003), 4407.

However, they don’t actually provide the Niño 3.4 index.

You can get the Niño 3.4 index here:

• TNI (Trans-Niño Index) and N3.4 (Niño 3.4 Index), NCAR.

You can also get it from here:

• Monthly Niño 3.4 index, Climate Prediction Center, National Weather Service.

The actual temperatures in Celsius on the two websites are quite close — but the anomalies are rather different, because the second one ‘subtracts off the historical average’ in a way that takes global warming into account. For example, to compute the Niño 3.4 index in June 1952, instead of taking the average temperature that month and subtracting off the average temperature for *all* Junes on record, they subtract off the average for Junes in the period 1936-1965. Averages for different periods are shown here:

You can see how these curves move up over time: that’s global warming! It’s interesting that they go up fastest during the cold part of the year. It’s also interesting to see how gentle the seasons are in this part of the world. In the old days, the average monthly temperatures ranged from 26.2 °C in the winter to 27.5 °C in the summer — a mere 1.3 °C fluctuation.

Finally, to compute the ONI in a given month, we take the average of the Niño 3.4 index in that month, the month before, and the month after. This definition of **running 3-month mean** has a funny feature: we can’t know the ONI for this month until *next* month!

You can get a table of the ONI here:

• Cold and warm episodes by season, Climate Prediction Center, National Weather Service.

It’s not particularly computer-readable.

Now let’s compare Ludescher *et al*. They say there’s an El Niño when the Niño 3.4 index is over 0.5°C for at least 5 months in a row. By not using the ONI — by using the Niño 3.4 index instead of its 3-month running mean — they could be counting some short ‘spikes’ in the Niño 3.4 index as El Niños, that wouldn’t count as El Niños by the usual definition.

I haven’t carefully checked to see how much changing the definition would affect the success rate of their predictions. To be fair, we should also let them change the value of their parameter θ, which is tuned to be good for predicting El Niños in their setup. But we can see that there could be some ‘spike El Niños’ in this graph of theirs, that might go away with a different definition. These are places where the red line goes over the horizontal line for more than 5 months, but no more:

Let’s see look at the spike around 1975. See that green arrow at the beginning of 1975? That means Ludescher *et al* are claiming to successfully predict an El Niño sometime the next calendar year. We can zoom in for a better look:

The tiny blue bumps are where the Niño 3.4 index exceeds 0.5.

Let’s compare the ONI as computed by the National Weather Service, month by month, with El Niños in red and La Niñas in blue:

1975: 0.5, -0.5, -0.6, -0.7, -0.8, -1.0, -1.1, -1.2, -1.4, -1.5, -1.6, -1.7

1976: -1.5, -1.1, -0.7, -0.5, -0.3, -0.1, 0.2, 0.4, 0.6, 0.7, 0.8, 0.8

1977: 0.6, 0.6, 0.3, 0.3, 0.3, 0.4, 0.4, 0.4, 0.5, 0.7, 0.8, 0.8

1978: 0.7, 0.5, 0.1, -0.2, -0.3, -0.3, -0.3, -0.4, -0.4, -0.3, -0.1, -0.1

So indeed an El Niño started in September 1976. The ONI only stayed above 0.5 for 6 months, but that’s enough. Ludescher and company luck out!

Just for fun, let’s look at the National Weather service Niño 3.4 index to see what that’s like:

1975: -0.33, -0.48, -0.72, -0.54, -0.68, -1.17, -1.07, -1.19, -1.36, -1.69 -1.45, -1.76

1976: -1.78, -1.10, -0.55, -0.53, -0.33, -0.10, 0.20, 0.39, 0.49, 0.88, 0.85, 0.63

So, this exceeded 0.5 in October 1976. That’s when Ludescher *et al* would say the El Niño starts, if they used the National Weather Service data.

Let’s also compare the NCAR Niño 3.4 index:

1975: -0.698, -0.592, -0.579, -0.801, -1.025, -1.205, -1.435, -1.620, -1.699 -1.855, -2.041, -1.960

1976: -1.708, -1.407, -1.026, -0.477, -0.095, 0.167, 0.465, 0.805, 1.039, 1.137, 1.290, 1.253

It’s pretty different! But it also gives an El Niño in 1976 according to Ludescher *et al’* definition: the Niño 3.4 index exceeds 0.5 starting in August 1976.

This time we didn’t get into the interesting question of *why* one definition of El Niño is better than another. For that, try:

• Kevin E. Trenberth, The definition of El Niño, *Bulletin of the American Meteorological Society* **78** (1997), 2771–2777.

There could also be fundamentally different *kinds* of El Niño. For example, besides the usual sort where high sea surface temperatures are centered in the Niño 3.4 region, there could be another kind centered farther west near the International Date Line. This is called the **dateline El Niño** or **El Niño Modoki**. For more, try this:

• Nathaniel C. Johnson, How many ENSO flavors can we distinguish?, *Journal of Climate* **26** (2013), 4816-4827.

which has lots of references to earlier work. Here, to whet your appetite, is his picture showing the 9 most common patterns of sea surface temperature anomalies in the Pacific:

At the bottom of each is a percentage showing how frequently that pattern has occurred from 1950 to 2011. To get these pictures Johnson used something called a ‘self-organizing map analysis’ – a fairly new sort of cluster analysis done using neural networks. This is the sort of thing I hope we get into as our project progresses!

Just in case you want to get to old articles, here’s the story so far:

• El Niño project (part 1): basic introduction to El Niño and our project here.

• El Niño project (part 2): introduction to the physics of El Niño.

• El Niño project (part 3): summary of the work of Ludescher *et al*.

• El Niño project (part 4): how Graham Jones replicated the work by Ludescher *et al*, using software written in R.

• El Niño project (part 5): how to download R and use it to get files of climate data.

• El Niño project (part 6): Steve Wenner’s statistical analysis of the work of Ludescher *et al*.

• El Niño project (part 7): the definition of El Niño.

]]>

Emboldened by our experiments in El Niño analysis and prediction, people in the Azimuth Code Project have been starting to analyze weather and climate data. A lot of this work is exploratory, with no big conclusions. But it’s still interesting! So, let’s try some blog articles where we present this work.

This one will be about the air pressure on the island of Tahiti and in a city called Darwin in Australia: how they’re correlated, and how each one varies. This article will also be a quick introduction to some basic statistics, as well as ‘continuous wavelet transforms’.

The El Niño Southern Oscillation is often studied using the air pressure in Darwin, Australia versus the air pressure in Tahiti. When there’s an El Niño, it gets stormy in the eastern Pacific so the air temperatures tend to be lower in Tahiti and higher in Darwin. When there’s a La Niña, it’s the other way around:

The Southern Oscillation Index or **SOI** is a normalized version of the monthly mean air pressure anomaly in Tahiti minus that in Darwin. Here **anomaly** means we subtract off the mean, and **normalized** means that we divide by the standard deviation.

So, the SOI tends to be negative when there’s an El Niño. On the other hand, when there’s an El Niño the Niño 3.4 index tends to be positive—this says it’s hotter than usual in a certain patch of the Pacific.

Here you can see how this works:

When the Niño 3.4 index is positive, the SOI tends to be negative, and vice versa!

It might be fun to explore precisely how well correlated they are. You can get the data to do that by clicking on the links above.

But here’s another question: how similar are the air pressure anomalies in Darwin and in Tahiti? Do we really need to take their difference, or are they so strongly anticorrelated that either one would be enough to detect an El Niño?

You can get the data to answer such questions here:

• Southern Oscillation Index based upon annual standardization, Climate Analysis Section, NCAR/UCAR. This includes links to monthly sea level pressure anomalies in Darwin and Tahiti, in either ASCII format (click the second two links) or netCDF format (click the first one and read the explanation).

In fact this website has some nice graphs already made, which I might as well show you! Here’s the SOI and also the *sum* of the air pressure anomalies in Darwin and Tahiti, normalized in some way:

(Click to enlarge.)

If the sum were zero, the air pressure anomalies in Darwin and Tahiti would contain the same information and life would be simple. But it’s not!

How similar in character are the air pressure anomalies in Darwin and Tahiti? There are many ways to study this question. Dara tackled it by taking the air pressure anomaly data from 1866 to 2012 and computing some ‘continuous wavelet transforms’ of these air pressure anomalies. This is a good excuse for explaining how a continuous wavelet transform works.

It helps to start with some very basic statistics. Suppose you have a list of numbers

You probably know how to take their **mean**, or average. People often write this with angle brackets:

You can also calculate the mean of their squares:

If you were naive you might think but in fact we have:

and they’re equal only if all the are the same. The point is that if the numbers are spread out, the squares of the big ones (positive or negative) contribute more to the average of the squares than if we had averaged them out before squaring. The difference

is called the **variance**; it says how spread out our numbers are. The square root of the variance is the **standard deviation**:

and this has the slight advantage that if you multiply all the numbers by some constant the standard deviation gets multiplied by (The variance gets multiplied by )

We can generalize the variance to a situation where we have two lists of numbers:

Namely, we can form the **covariance**

This reduces to the variance when It measures how much and vary together — ‘hand in hand’, as it were. A bit more precisely: if is greater than its mean value mainly for such that is greater than *its* mean value, the covariance is positive. On the other hand, if tends to be greater than average when is *smaller* than average — like with the air pressures at Darwin and Tahiti — the covariance will be negative.

For example, if

then they ‘vary hand in hand’, and the covariance

is positive. But if

then one is positive when the other is negative, so the covariance

is negative.

Of course the covariance will get bigger if we multiply both and by some big number. If we don’t want this effect, we can normalize the covariance and get the **correlation**:

which will always be between and

For example, if we compute the correlation between the air pressure anomalies at Darwin and Tahiti, measured monthly from 1866 to 2012, we get

-0.253727. This indicates that when one goes up, the other tends to go down. But since we’re not getting -1, it means they’re not completely locked into a linear relationship where one is some negative number times the other.

Okay, we’re almost ready for continuous wavelet transforms! Here is the main thing we need to know. If the mean of either or is zero, the formula for covariance simplifies a lot, to

So, this quantity says how much the numbers ‘vary hand in hand’ with the numbers in the special case when one (or both) has mean zero.

We can do something similar if are functions of time defined for all real numbers The sum becomes an integral, and we have to give up on dividing by We get:

This is called the **inner product** of the functions and and often it’s written but it’s a lot like the covariance.

What are continuous wavelet transforms, and why should we care?

People have lots of tricks for studying ‘signals’, like series of numbers or functions One method is to ‘transform’ the signal in a way that reveals useful information. The Fourier transform decomposes a signal into sines and cosines of different frequencies. This lets us see how much power the signal has at different frequencies, but it doesn’t reveal how the power at different frequencies *changes with time*. For that we should use something else, like the Gabor transform explained by Blake Pollard in a previous post.

Sines and cosines are great, but we might want to look for other patterns in a signal. A ‘continuous wavelet transform’ lets us scan a signal for appearances of a given pattern at different *times* and also at different *time scales*: a pattern could go by quickly, or in a stretched out slow way.

To implement the continuous wavelet transform, we need a signal and a pattern to look for. The signal could be a function The pattern would then be another function usually called a **wavelet**.

Here’s an example of a wavelet:

If we’re in a relaxed mood, we could call *any* function that looks like a bump with wiggles in it a wavelet. There are lots of famous wavelets, but this particular one is the fourth derivative of a certain Gaussian. Mathematica calls this particular wavelet DGaussianWavelet[4], and you can look up the formula under ‘Details’ on their webpage.

However, the exact formula doesn’t matter at all now! If we call this wavelet all that matters is that it’s a bump with wiggles on it, and that its mean value is 0, or more precisely:

As we saw in the last section, this fact lets us take our function and the wavelet and see how much they ‘vary hand it hand’ simply by computing their inner product:

Loosely speaking, this measures the ‘amount of -shaped wiggle in the function ’. It’s amazing how hard it is to say something in plain English that perfectly captures the meaning of a simple formula like the above one—so take the quoted phrase with a huge grain of salt. But it gives a rough intuition.

Our wavelet happens to be centered at However, we might be interested in -shaped wiggles that are centered not at zero but at some other number We could detect these by shifting the function before taking its inner product with :

We could also be interested in measuring the amount of some stretched-out or squashed version of a -shaped wiggle in the function Again we could do this by changing before taking its inner product with :

When is big, we get a stretched-out version of People sometimes call the **period**, since the period of the wiggles in will be proportional to this (though usually not equal to it).

Finally, we can combine these ideas, and compute

This is a function of the shift and period which says how much of the -shifted, -stretched wavelet is lurking in the function It’s a version of the continuous wavelet transform!

Mathematica implements this idea for **time series**, meaning lists of numbers instead of functions The idea is that we think of the numbers as samples of a function :

and so on, where is some time step, and replace the integral above by a suitable sum. Mathematica has a function ContinuousWaveletTransform that does this, giving

The factor of in front is a useful extra trick: it’s the right way to compensate for the fact that when you stretch out out your wavelet by a factor of it gets bigger. So, when we’re doing integrals, we should define the **continuous wavelet transform** of by:

Dara Shayda started with the air pressure anomaly at Darwin and Tahiti, measured monthly from 1866 to 2012. Taking DGaussianWavelet[4] as his wavelet, he computed the continuous wavelet transform as above. To show us the answer, he created a **scalogram**:

This is a 2-dimensional color plot showing roughly how big the continuous wavelet transform is for different shifts and periods Blue means it’s very small, green means it’s bigger, yellow means even bigger and red means very large.

Tahiti gave this:

You’ll notice that the patterns at Darwin and Tahiti are similar in character, but notably different in detail. For example, the red spots, where our chosen wavelet shows up strongly with period of order ~100 months, occur at different times.

**Puzzle 1.** What is the meaning of the ‘spikes’ in these scalograms? What sort of signal would give a spike of this sort?

**Puzzle 2.** Do a Gabor transform, also known as a ‘windowed Fourier transform’, of the same data. Blake Pollard explained the Gabor transform in his article Milankovitch vs the Ice Ages. This is a way to see how much a signal wiggles at a given frequency at a given time: we multiply the signal by a shifted Gaussian and then takes its Fourier transform.

**Puzzle 3.** Read about continuous wavelet transforms. If we want to reconstruct our signal from its continuous wavelet transform, why should we use a wavelet with

In fact we want a somewhat stronger condition, which is implied by the above equation when the Fourier transform of is smooth and integrable:

• Continuous wavelet transform, Wikipedia.

David Tweed mentioned another approach from signal processing to understanding the quantity

If we’ve got two lists of data and that we want to compare to see if they behave similarly, the first thing we ought to do is multiplicatively scale each one so they’re of comparable magnitude. There are various possibilities for assigning a scale, but a reasonable one is to ensure they have equal ‘energy’

(This can be achieved by dividing each list by its standard deviation, which is equivalent to what was done in the main derivation above.) Once we’ve done that then it’s clear that looking at

gives small values when they have a very good match and progressively bigger values as they become less similar. Observe that

Since we’ve scaled things so that and are constants, we can see that when becomes bigger,

becomes smaller. So,

serves as a measure of how close the lists are, under these assumptions.

]]>

Hi, I’m Steve Wenner.

I’m an industrial statistician with over 40 years of experience in a wide range applications (quality, reliability, product development, consumer research, biostatistics); but, somehow, time series only rarely crossed my path. Currently I’m working for a large consumer products company.

My undergraduate degree is in physics, and I also have a master’s in pure math. I never could reconcile how physicists used math (explain that Dirac delta function to me again in math terms? Heaviside calculus? On the other hand, I thought category theory was abstract nonsense until John showed me otherwise!). Anyway, I had to admit that I lacked the talent to pursue pure math or theoretical physics, so I became a statistician. I never regretted it—statistics has provided a very interesting and intellectually challenging career.

I got interested in Ludescher *et al*’s paper on El Niño prediction by reading Part 3 of this series. I have no expertise in climate science, except for an intense interest in the subject as a concerned citizen. So, I won’t talk about things like how Ludescher *et al* use a nonstandard definition of ‘El Niño’—that’s a topic for another time. Instead, I’ll look at some statistical aspects of their paper:

• Josef Ludescher, Avi Gozolchiani, Mikhail I. Bogachev, Armin Bunde, Shlomo Havlin, and Hans Joachim Schellnhuber, Very early warning of next El Niño, *Proceedings of the National Academy of Sciences*, February 2014. (Click title for free version, journal name for official version.)

I downloaded the NOAA adjusted monthly temperature anomaly data and compared the El Niño periods with the charts in this paper. I found what appear to be two errors (“phantom” El Niños) and noted some interesting situations. Some of these are annotated on the images below. Click to enlarge them:

I also listed for each year whether an El Niño initiation was predicted, or not, and whether one actually happened. I did the predictions five ways: first, I listed the author’s “arrows” as they appeared on their charts, and then I tried to match their predictions by following in turn four sets of rules. Nevertheless, I could not come up with any detailed rules that exactly reproduced the author’s results.

These were the rules I used:

An El Niño initiation is predicted for a calendar year if during the preceding year the average link strength crossed above the 2.82 threshold. However, we could also invoke additional requirements. Two possibilities are:

1. Preemption rule: the prediction of a new El Niño is canceled if the preceding year ends in an El Niño period.

2. End-of-year rule: the link strength must be above 2.82 at year’s end.

I counted the predictions using all four combinations of these two rules and compared the results to the arrows on the charts.

I defined an “El Niño initiation month” to be a month where the monthly average adjusted temperature anomaly rises up to at least 0.5 C and remains above or equal to 0.5 °C for at least five months. Note that the NOAA El Niño monthly temperature estimates are rounded to hundredths; and, on occasion, the anomaly is reported as exactly 0.5 °C. I found slightly better agreement with the authors’ El Niño periods if I counted an anomaly of exactly 0.5 °C as satisfying the threshold criterion, instead of using the strictly “greater than” condition.

Anyway, I did some formal hypothesis testing and estimation under all five scenarios. The good news is that under most scenarios the prediction method gave better results than merely guessing. (But, I wonder how many things the authors tried before they settled on their final method? Also, did they do all their work on the learning series, and then only at the end check the validation series—or were they checking both as they went about their investigations?)

The bad news is that the predictions varied with the method, and the methods were rather weak. For instance, in the training series there were 9 El Niño periods in 30 years; the authors’ rules (whatever they were, exactly) found five of the nine. At the same time, they had three false alarms in the 21 years that did not have an El Niño initiated.

I used Fisher’s exact test to compute some p-values. Suppose (as our ‘null hypothesis’) that Ludescher *et al’*s method does not improve the odds of a successful prediction of an El Nino initiation. What’s the probability of that method getting at least as many predictions right just by chance? Answer: 0.032 – this is marginally more significant than the conventional 1 in 20 chance that is the usual threshold for rejecting a null hypothesis, but still not terribly convincing. This was, by the way, the most significant of the five p-values for the alternative rule sets applied to the learning series.

I also computed the “relative risk” statistics for all scenarios; for instance, we are more than three times as likely to see an El Niño initiation if Ludescher *et al* predict one, than if they predict otherwise (the 90% confidence interval for that ratio is 1.2 to 9.7, with the point estimate 3.4). Here is a screen shot of some statistics for that case:

Here is a screen shot of part of the spreadsheet list I made. In the margin on the right I made comments about special situations of interest.

Again, click to enlarge—but my whole working spreadsheet is available with more details for anyone who wishes to see it. I did the statistical analysis with a program called JMP, a product of the SAS corporation.

My overall impression from all this is that Ludescher *et al* are suggesting a somewhat arbitrary (and not particularly well-defined) method for revealing the relationship between link strength and El Niño initiation, if, indeed, a relationship exists. Slight variations in the interpretation of their criteria and slight variations in the data result in appreciably different predictions. I wonder if there are better ways to analyze these two correlated time series.

]]>

Anita Chowdry is an artist based in London. While many are exploring electronic media and computers, she’s going in the opposite direction, exploring craftsmanship and the hands-on manipulation of matter. I find this exciting, perhaps because I spend most of my days working on my laptop, becoming starved for richer sensations. She writes:

Today, saturated as we are with the ephemeral intangibility of virtual objects and digital functions, there is a resurgence of interest in the ingenious mechanical contraptions of pre-digital eras, and in the processes of handcraftsmanship and engagement with materials. The solid corporality of analogue machines, the perceivable workings of their kinetic energy, and their direct invitation to experience their science through hands-on interaction brings us back in touch with our humanity.

The ‘steampunk’ movement is one way people are expressing this renewed interest, but Anita Chowdry goes a bit deeper than some of that. For starters, she’s studied all sorts of delightful old-fashioned crafts, like silverpoint, a style of drawing used before the invention of graphite pencils. The tool is just a piece of silver wire mounted on a writing implement; a bit of silver rubs off and creates a gray line. The effect is very subtle:

In January she went to Cairo and worked with a master calligrapher, Ahmed Fares, to recreate the title page of a 16th-century copy of Avicenna’s *Canon of Medicine*, or *al-Qanun fi’l Tibb*:

This required making gold ink:

The secret is actually pure hard work; rubbing it by hand with honey for hours on end to break up the particles of gold into the finest powder, and then washing it thoroughly in distilled water to remove all impurities.

The results:

I met her in Oxford this March, and we visited the Museum of the History of Science together. This was a perfect place, because it’s right next to the famous Bodleian, and it’s full of astrolabes, sextants, ancient slide rules and the like…

… and one of Anita Chowdry’s new projects involves another piece of romantic old technology: the harmonograph!

A **harmonograph** is a mechanical apparatus that uses pendulums to draw a geometric image. The simplest so-called ‘lateral’ or ‘rectilinear’ harmonograph uses two pendulums: one moves a pen back and forth along one axis, while the other moves the drawing surface back and forth along a perpendicular axis. By varying their amplitudes, frequencies and the phase difference, we can get quite a number of different patterns. In the linear approximation where the pendulums don’t swing to high, we get Lissajous curves:

For example, when the amplitudes and are both 1, the frequencies are and and the phase difference is we get this:

Harmonographs don’t serve any concrete practical purpose that I know; they’re a diversion, an educational device, or a form of art for art’s sake. They go back to the mid-1840s.

It’s not clear who invented the harmonograph. People often credit Hugh Blackburn, a professor of mathematics at the University of Glasgow who was a friend of the famous physicist Kelvin. He is indeed known for studying a pendulum hanging on a V-shaped string, in 1844. This is now called the Blackburn pendulum. But it’s not used in any harmonograph I know about.

On the other hand, Anita Chowdry has a book called *The Harmonograph. Illustrated by Designs actually Drawn by the Machine*, written in 1893 by one H. Irwine Whitty. This book says the harmonograph

was first constructed by Mr. Tisley, of the firm Tisley and Spiller, the well-known opticians…

So, it remains mysterious.

The harmonograph peaked in popularity in the 1890s. I have no idea how popular it ever was; it seems a rather cerebral form of entertainment. As the figures from Whitty’s book show, it was sometimes used to illustrate the Pythagorean theory of chords as frequency ratios. Indeed, this explains the name ‘harmomograph':

At left the frequencies are exactly just as we’d have in two notes making a major fifth. Three choices of phase difference are shown. In the pictures at right, actually drawn by the machine, the frequencies aren’t perfectly tuned, so we get more complicated Lissajous curves.

How big was the harmonograph craze, and how long did it last? It’s hard for me to tell, but this book published in 1918 gives some clues:

• Archibald Williams, *Things to Make*: Home-made harmonographs (part 1, part 2, part 3), Thomas Nelson and Sons, Ltd., 1918.

It discusses the lateral harmonograph. Then it treats Joseph Goold’s ‘twin elliptic pendulum harmonograph’, which has a pendulum free to swing in all directions connected to a pen, and second pendulum free to swing in all directions affecting the motion of the paper. It also shows a miniature version of the same thing, and how to build it yourself. It explains the connection with harmony theory. And it explains the *value* of the harmonograph:

## Value of the harmonograph

A small portable harmonograph will be found to be a good means of entertaining friends at home or elsewhere. The gradual growth of the figure, as the card moves to and fro under the pen, will arouse the interest of the least scientifically inclined person; in fact, the trouble is rather to persuade spectators that they have had enough than to attract their attention. The cards on which designs have been drawn are in great request, so that the pleasure of the entertainment does not end with the mere exhibition. An album filled with picked designs, showing different harmonies and executed in inks of various colours, is a formidable rival to the choicest results of the amateur photographer’s skill.

“In great request”—this makes it sound like harmonographs were all the rage! On the other hand, I smell a whiff of desperate salesmanship, and he begins the chapter by saying:

Have you ever heard of the harmonograph? If not, or if at the most you have very hazy ideas as to what it is, let me explain.

So even at its height of popularity, I doubt most people knew what a harmonograph was. And as time passed, more peppy diversions came along and pushed it aside. The phonograph, for example, began to catch on in the 1890s. But the harmonograph never completely disappeared. If you look on YouTube, you’ll find quite a number.

Anita Chowdry got an M.A. from Central Saint Martin’s college of Art and Design. That’s located near St. Pancras Station in London.

She built a harmonograph as part of her course work, and it worked well, but she wanted to make a more elegant, polished version. Influenced by the Victorian engineering of St. Pancras Station, she decided that “steel would be the material of choice.”

So, starting in 2013, she began designing a steel harmonograph with the help of her tutor Eleanor Crook and the engineering metalwork technician Ricky Lee Brawn.

Artist and technician David Stewart helped her make the steel parts. Learning to work with steel was a key part of this art project:

The first stage of making the steel harmonograph was to cut out and prepare all the structural components. In a sense, the process is a bit like tailoring—you measure and cut out all the pieces, and then put them together an a logical order, investing each stage with as much care and craftsmanship as you can muster. For the flat steel components I had medium-density fibreboard forms cut on the college numerical control machine, which David Stewart used as patterns to plasma-cut the shapes out of mild carbon-steel. We had a total of fifteen flat pieces for the basal structure, which were to be welded to a large central cylinder.

My job was to ‘finish’ the plasma-cut pieces: I refined the curves with an angle-grinder, drilled the holes that created the delicate openwork patterns, sanded everything to smooth the edges, then repeatedly heated and quenched each piece at the forge to darken and strengthen them. When Dave first placed the angle-grinder in my hands I was terrified—the sheer speed and power and noise of the monstrous thing connecting with the steel with a shower of sparks had a brutality and violence about it that I had never before experienced. But once I got used to the heightened energy of the process it became utterly enthralling. The grinder began to feel as fluent and expressive as a brush, and the steel felt responsive and alive. Like all metalwork processes, it demands a total, immersive concentration—you can get lost in it for hours!

Ricky Lee Brawn worked with her to make the brass parts:

Below you can see the brass piece he’s making, called a finial, among the steel legs of the partially finished harmonograph:

There are three legs, each with three feet.

The groups of three look right, because I conceived the entire structure on the basis of the three pendulums working at angles of 60 degrees in relation to one another (forming an equilateral triangle)—so the magic number is three and its multiples.

With three pendulums you can generate more complicated generalizations of Lissajous curves. In the language of music, three frequencies gives you a triplet!

Things become still more complex if we leave the linear regime, where motions are described by sines and cosines. I don’t understand Anita Chowdry’s harmonograph well enough to know if nonlinearity plays a crucial role. But it gives patterns like these:

Here is the completed harmonograph, called the ‘Iron Genie’, in action in the crypt of the St. Pancras Church:

And now, I’m happy to say, it’s on display at the Museum of the History of Science, where we met in Oxford. If you’re in the area, give it a look! She’s giving free public talks about it at 3 pm on

• Saturday July 19th

• Saturday August 16th

• Saturday September 20th

in 2014. And if you can’t visit Oxford, you can still visit her blog!

I think the mathematics of harmonographs deserves more thought. The basic math required for this was developed by the Irish mathematician William Rowan Hamilton around 1834. Hamilton was just the sort of character who would have enjoyed the harmonograph. But other crucial ideas were contributed by Jacobi, Poincaré and many others.

In a simple ‘lateral’ device, the position and velocity of the machine takes 4 numbers to describe: the two pendulum’s angles and angular velocities. In the language of classical mechanics, the space of states of the harmonograph is a 4-dimensional symplectic manifold, say Ignoring friction, its motion is described by Hamilton’s equations. These equations can give behavior ranging from completely integrable (as orderly as possible) to chaotic.

For small displacements our lateral harmonograph about the state of rest, I believe its behavior will be completely integrable. If so, for any initial conditions, its motion will trace out a spiral on some 2-dimensional torus sitting inside The position of pen on paper provides a map

and so the spiral is mapped to some curve on the paper!

We can ask what sort of curves can arise. Lissajous curves are the simplest, but I don’t know what to say in general. We might be able to understand their qualitative features without actually solving Hamilton’s equations. For example, there are two points where the curves seem to ‘focus’ here:

That’s the kind of thing mathematical physicists can try to understand, a bit like caustics in optics.

If we have a ‘twin elliptic pendulum harmonograph’, the state space becomes 8-dimensional, and becomes 4-dimensional if the system is completely integrable. I don’t know the dimension of the state space for Anita Chowdry’s harmonograph, because I don’t know if her 3 pendulum can swing in just one direction each, or two!

But the big question is whether a given harmonograph is completely integrable… in which case the story I’m telling goes through… or whether it’s chaotic, in which case we should expect it to make very irregular pictures. A double pendulum—that is, a pendulum hanging on another pendulum—will be chaotic if it starts far enough from its point of rest.

Here is a chaotic ‘double compound pendulum’, meaning that it’s made of two rods:

Almost all the pictures here were taken by Anita Chowdry, and I thank her for letting me use them. The photo of her harmonograph in the Museum of the History of Science was taken by Keiko Ikeuchi, and the copyright for this belongs to the Museum of the History of Science, Oxford. The video was made by Josh Jones. The image of a Lissajous curve was made by Alessio Damato and put on Wikicommons with a Creative Commons Attribution-Share Alike license. The double compound pendulum was made by Catslash and put on Wikicommons in the public domain.

]]>

And now for some comic relief.

Last time I explained how to download some weather data and start analyzing it, using programs written by Graham Jones. When you read that, did you think “Wow, that’s easy!” Or did you think “Huh? Run programs in R? How am I supposed to do that?”

If you’re in the latter group, you’re like me. But I managed to do it. And this is the tale of how. It’s a blow-by-blow account of my first steps, my blunders, my fears.

I hope that if you’re intimidated by programming, my tale will prove that *you too* can do this stuff… provided you have smart friends, or read this article.

More precisely, this article explains how to:

• download and run software that runs the programming language R;

• download temperature data from the National Center for Atmospheric Research;

• use R to create a file of temperature data for a given latitude/longitude rectangle for a given time interval.

I will not attempt to explain how to program in R.

If you want to copy what I’m doing, please remember that a few details depend on the operating system. Since I don’t care about operating systems, I use a Windows PC. If you use something better, some details will differ for you.

Also: at the end of this article there are some very basic programming puzzles.

First, let me explain a bit about my relation to computers.

I first saw a computer at the Lawrence Hall of Science in Berkeley, back when I was visiting my uncle in the summer of 1978. It was really cool! They had some terminals where you could type programs in BASIC and run them.

I got especially excited when he gave me the book *Computer Lib/Dream Machines* by Ted Nelson. It espoused the visionary idea that people could write texts on computers all around the world—“hypertexts” where you could click on a link in one and hop to another!

I did more programming the next year in high school, sitting in a concrete block room with a teletype terminal that was connected to a mainframe somewhere far away. I stored my programs on paper tape. But my excitement gradually dwindled, because I was having more fun doing math and physics using just pencil and paper. My own brain was more easy to program than the machine. I did not start a computer company. I did not get rich. I learned quantum mechanics, and relativity, and Gödel’s theorem.

Later I did some programming in APL in college, and still later I did a bit in Mathematica in the early 1990s… but nothing much, and nothing sophisticated. Indeed, none of these languages would be the ones you’d choose to explore sophisticated ideas in computation!

I’ve just never been very interested… until now. I now want to do a lot of data analysis. It will be embarrassing to keep asking other people to do all of it for me. I need to learn how to do it myself.

Maybe you’d like to do this stuff too—or at least watch me make a fool of myself. So here’s my tale, from the start.

To use the programs written by Graham, I need to use R, a language currently popular among statisticians. It is not the language my programmer friends would want me to learn—they’d want me to use something like Python. But tough! I can learn that later.

To download R to my Windows PC, I cleverly type `download R`

into Google, and go to the top website it recommends:

• http://cran.r-project.org/bin/windows/base/

I click the big fat button on top saying

`Download R 3.1.0 for Windows`

and get asked to save a file `R-3.1.0-win.exe`

. I save it in my Downloads folder; it takes a while to download since it was 57 megabytes. When I get it, I click on it and follow the easy default installation instructions. My Desktop window now has a little icon on it that says `R`

.

Clicking this, I get an interface where I can type commands after a red

`>`

symbol. Following Graham’s advice, I start by trying

`> 2^(1:8)`

which generates a list of powers of 2 from 2^{1} to 2^{8}, like this:

`[1] 2 4 8 16 32 64 128 256`

Then I try

`> mean(2^(1:8))`

which gives the arithmetic mean of this list. Somewhat more fun is

`> plot(rnorm(20))`

which plots a bunch of points, apparently 20 standard normal deviates.

When I hear “20 standard normal deviates” I think of the members of a typical math department… but no, those are *deviants*. Standard normal *deviates* are random numbers chosen from a Gaussian distribution of mean zero and variance 1.

To do something more interesting, I need to input *data*.

The papers by Ludescher *et al* use surface air temperatures in a certain patch of the Pacific, so I want to get ahold of those. They’re here:

• NCEP/NCAR Reanalysis 1: Surface.

**NCEP** is the National Centers for Environmental Prediction, and **NCAR** is the National Center for Atmospheric Research. They have a bunch of files here containing worldwide daily average temperatures on a 2.5 degree latitude × 2.5 degree longitude grid (that’s 144 × 73 grid points), from 1948 to 2010. And if you go here, the website will help you get data from within a chosen rectangle in a grid, for a chosen time interval.

These are NetCDF files. **NetCDF** stands for Network Common Data Form:

NetCDF is a set of software libraries and self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data.

According to my student Blake Pollard:

… the method of downloading a bunch of raw data via

ftp(file transfer protocol) is a great one to become familiar with. If you poke around on ftp://ftp.cdc.noaa.gov/Datasets or some other ftp servers maintained by government agencies you will find all the data you could ever want. Examples of things you can download for free: raw multispectral satellite images, processed data products, ‘re-analysis’ data (which is some way of combining analysis/simulation to assimilate data), sea surface temperature anomalies at resolutions much higher than 2.5 degrees (although you pay for that in file size). Also, believe it or not, people actually use NetCDF files quite widely, so once you know how to play around with those you’ll find the world quite literally at your fingertips!

I know about ftp: I’m so old that I know this was around before the web existed. Back then it meant “faster than ponies”. But I need to get R to accept data from these NetCDF files: that’s what scares me!

Graham said that R has a “package” called RNetCDF for using NetCDF files. So, I need to get ahold of this package, download some files in the NetCDF format, and somehow get R to eat those files with the help of this package.

At first I was utterly clueless! However, after a bit of messing around, I notice that right on top of the R interface there’s a menu item called `Packages`

. I boldly click on this and choose `Install Package(s)`

.

I am rewarded with an enormous alphabetically ordered list of packages… obviously statisticians have lots of stuff they like to do over and over! I find `RNetCDF`

, click on that and click something like “OK”.

I’m asked if I want to use a “personal library”. I click “no”, and get an error message. So I click “yes”. The computer barfs out some promising text:

`utils:::menuInstallPkgs()`

trying URL 'http://cran.stat.nus.edu.sg/bin/windows/contrib/3.1/RNetCDF_1.6.2-3.zip'

Content type 'application/zip' length 548584 bytes (535 Kb)

opened URL

downloaded 535 Kb

```
```package ‘RNetCDF’ successfully unpacked and MD5 sums checked

`The downloaded binary packages are in`

C:\Users\JOHN\AppData\Local\Temp\Rtmp4qJ2h8\downloaded_packages

Success!

But now I need to figure out how to download a file and get R to eat it and digest it with the help of `RNetCDF`

.

At this point my *deus ex machina*, Graham, descends from the clouds and says:

You can download the files from your browser. It is probably easiest to do that for starters. Put

`ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/surface/`

into the browser, then right-click a file and Save link as…This code will download a bunch of them:

for (year in 1950:1979) {

download.file(url=paste0("ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/surface/air.sig995.", year, ".nc"), destfile=paste0("air.sig995.", year, ".nc"), mode="wb")

}

It will put them into the “working directory”, probably

`C:\Users\JOHN\Documents`

. You can find the working directory using`getwd()`

, and change it with`setwd()`

. But you must use / not \ in the filepath.

Compared to UNIX, the Windows operating system has the peculiarity of using \ instead of / in path names, but R uses the UNIX conventions even on Windows.

So, after some mistakes, in the R interface I type

`> setwd("C:/Users/JOHN/Documents/My Backups/azimuth/el nino")`

and then type

`> getwd()`

to see if I’ve succeeded. I’m rewarded with

`[1] "C:/Users/JOHN/Documents/My Backups/azimuth/el nino"`

Good!

Then, following Graham’s advice, I cut-and-paste this into the R interface:

`for (year in 1950:1979) {`

download.file(url=paste0("ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/surface/air.sig995.", year, ".nc"), destfile=paste0("air.sig995.", year, ".nc"), mode="wb")

}

It seems to be working! A little bar appears showing how each year’s data is getting downloaded. It chugs away, taking a couple minutes for each year’s worth of data.

Okay, now I’ve got all the worldwide daily average temperatures on a 2.5 degree latitude × 2.5 degree longitude grid from 1950 to 1970.

*The world is MINE!*

But what do I do with it? Graham’s advice is again essential, along with a little R program, or **script**, that he wrote:

The R script

`netcdf-convertor.R`

fromhttps://github.com/azimuth-project/el-nino/tree/master/R

will eat the file, digest it, and spit it out again. It contains instructions.

I go to this URL, which is on GitHub, a popular free web-based service for software development. You can store programs here, edit them, and GitHub will help you keep track of the different versions. I know almost nothing about this stuff, but I’ve seen it before, so I’m not intimidated.

I click on the blue thing that says `netcdf-convertor.R`

and see something that looks like the right script. Unfortunately I can’t see how to download it! I eventually see a button I’d overlooked, cryptically labelled “Raw”. I realize that since I don’t want a roasted or oven-broiled piece of software, I should click on this. I indeed succeed in downloading `netcdf-convertor.R`

this way. Graham later says I could have done something better, but oh well. I’m just happy nothing has actually exploded yet.

Once I’ve downloaded this script, I open it using an text processor and look at it. At top are a bunch of comments written by Graham:

######################################################

######################################################

```
```# You should be able to use this by editing this

# section only.

setwd("C:/Users/Work/AAA/Programming/ProgramOutput/Nino")

lat.range <- 13:14

lon.range <- 142:143

firstyear <- 1957

lastyear <- 1958

outputfilename <- paste0("Scotland-", firstyear, "-", lastyear, ".txt")

######################################################

######################################################

# Explanation

# 1. Use setwd() to set the working directory

# to the one containing the .nc files such as

# air.sig995.1951.nc.

# Example:

# setwd("C:/Users/Work/AAA/Programming/ProgramOutput/Nino")

# 2. Supply the latitude and longitude range. The

# NOAA data is every 2.5 degrees. The ranges are

# supplied as the number of steps of this size.

# For latitude, 1 means North Pole, 73 means South

# Pole. For longitude, 1 means 0 degrees East, 37

# is 90E, 73 is 180, 109 is 90W or 270E, 144 is

# 2.5W.

# These roughly cover Scotland.

# lat.range <- 13:14

# lon.range <- 142:143

# These are the area used by Ludescher et al,

# 2013. It is 27x69 points which are then

# subsampled to 9 by 23.

# lat.range <- 24:50

# lon.range <- 48:116

# 3. Supply the years

# firstyear <- 1950

# lastyear <- 1952

# 4. Supply the output name as a text string.

# paste0() concatenates strings which you may find

# handy:

# outputfilename <- paste0("Pacific-", firstyear, "-", lastyear, ".txt")

######################################################

######################################################

# Example of output

# S013E142 S013E143 S014E142 S014E143

# Y1950P001 281.60000272654 281.570002727211 281.60000272654 280.970002740622

# Y1950P002 280.740002745762 280.270002756268 281.070002738386 280.49000275135

# Y1950P003 280.100002760068 278.820002788678 281.120002737269 280.070002760738

# Y1950P004 281.070002738386 279.420002775267 281.620002726093 280.640002747998

# ...

# Y1950P193 285.450002640486 285.290002644062 285.720002634451 285.75000263378

# Y1950P194 285.570002637804 285.640002636239 286.070002626628 286.570002615452

# Y1950P195 285.92000262998 286.220002623275 286.200002623722 286.620002614334

# ...

# Y1950P364 276.100002849475 275.350002866238 276.37000284344 275.200002869591

# Y1950P365 276.990002829581 275.820002855733 276.020002851263 274.72000288032

# Y1951P001 278.220002802089 277.470002818853 276.700002836064 275.870002854615

# Y1951P002 277.750002812594 276.890002831817 276.650002837181 275.520002862439

# ...

# Y1952P365 280.35000275448 280.120002759621 280.370002754033 279.390002775937

# There is one row for each day, and 365 days in

# each year (leap days are omitted). In each row,

# you have temperatures in Kelvin for each grid

# point in a rectangle.

# S13E142 means 13 steps South from the North Pole

# and 142 steps East from Greenwich. The points

# are in reading order, starting at the top-left

# (Northmost, Westmost) and going along the top

# row first.

# Y1950P001 means year 1950, day 1. (P because

# longer periods might be used later.)

`######################################################`

######################################################

The instructions are admirably detailed concerning what I should do, but they don't say where the output will appear when I do it. This makes me nervous. I guess I should just try it. After all, the program is not called DestroyTheWorld.

Unfortunately, at this point a lot of things start acting weird.

It's too complicated and boring to explain in detail, but basically, I keep getting a `file missing`

error message. I don't understand why this happens under some conditions and not others. I try lots of experiments.

Eventually I discover that one year of temperature data failed to download—the year 1949, right after the first year available! So, I'm getting the error message whenever I try to do anything involving that year of data.

To fix the problem, I simply download the 1949 data by hand from here:

• ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.dailyavgs/surface/

(You can open ftp addresses in a web browser just like http addresses.) I put it in my working directory for R, and everything is fine again. Whew!

By the time things I get this file, I sort of know what to do—after all, I've spent about an hour trying lots of different things.

I decide to create a file listing temperatures near where I live in Riverside from 1948 to 1979. To do this, I open Graham's script `netcdf-convertor.R`

in a word processor and change this section:

setwd("C:/Users/Work/AAA/Programming/ProgramOutput/Nino")

lat.range <- 13:14

lon.range <- 142:143

firstyear <- 1957

lastyear <- 1958

outputfilename <- paste0("Scotland-", firstyear, "-", lastyear, ".txt")

to this:

setwd("C:/Users/JOHN/Documents/My Backups/azimuth/el nino")

lat.range <- 23:23

lon.range <- 98:98

firstyear <- 1948

lastyear <- 1979

outputfilename <- paste0("Riverside-", firstyear, "-", lastyear, ".txt")

Why? Well, I want it to put the file in my working directory. I want the years from 1948 to 1979. And I want temperature data from where I live!

Googling the info, I see Riverside, California is at 33.9481° N, 117.3961° W. 34° N is about 56 degrees south of the North Pole, which is 22 steps of size 2.5°. And because some idiot decided everyone should count starting at 1 instead of 0 even in contexts like this, the North Pole itself is step 1, not step 0… so Riverside is latitude step 23. That's why I write:

`lat.range <- 23:23`

Similarly, 117.5° W is 242.5° E, which is 97 steps of size 2.5°… which counts as step 98 according to this braindead system. That's why I write:

`lon.range <- 98:98`

Having done this, I save the file `netcdf-convertor.R`

under another name, `Riverside.R`

.

And then I do some stuff that it took some fiddling around to discover.

First, in my R interface I go to the menu item `File`

, at far left, and click on `Open script`

. It lets me browse around, so I go to my working directory for R and choose `Riverside.R`

. A little window called `R editor`

opens up in my R interface, containing this script.

I'm probably not doing this optimally, but I can now right-click on the R editor and see a menu with a choice called `Select all`

. If I click this, everything in the window turns blue. Then I can right-click again and choose `Run line or selection`

. And the script runs!

*Voilà!*

It huffs and puffs, and then stops. I peek in my working directory, and see that a file called

`Riverside.1948-1979.txt`

has been created. When I open it, it has lots of lines, starting with these:

S023E098

Y1948P001 279.95

Y1948P002 280.14

Y1948P003 282.27

Y1948P004 283.97

Y1948P005 284.27

Y1948P006 286.97

As Graham promised, each line has a year and day label, followed by a vector… which in my case is just a single number, since I only wanted the temperature in one location. I’m hoping this is the temperature near Riverside, in kelvin.

To see if this is working, I’d like to plot these temperatures and see if they make sense. Unfortunately I have no idea how to get R to take a file containing data of the sort I have and plot it! I need to learn how, but right now I’m exhausted, so I use another method to get the job done— a method that’s too suboptimal and embarrassing to describe here. (Hint: it involves the word “Excel”.)

I do a few things, but here’s the most interesting one—namely, not very interesting. I plot the temperatures for 1963:

I compare it to some publicly available data, not from Riverside, but from nearby Los Angeles:

As you can see, there was a cold day on January 13th, when the temperature dropped to 33°F. That seems to be visible on the graph I made, and looking at the data from which I made the graph, I see the temperature dropped to 251.4 kelvin on the 13th: that’s -7°F, very cold for here. It *does* get colder around Riverside than in Los Angeles in the winter, since it’s a desert, with temperatures not buffered by the ocean. So, this *does* seem compatible with the public records. That’s mildly reassuring.

But other features of the graph don’t match, and I’m not quite sure if they should or not. So, all this very tentative and unimpressive. However, I’ve managed to get over some of my worst fears, download some temperature data, and graph it! Now I need to learn how to use R to do statistics with this data, and graph it in a better way.

You can help me out by answering these puzzles. Later I might pose puzzles where you can help us write really interesting programs. But for now it’s just about learning R.

**Puzzle 1.** Given a text file with lots of lines of this form:

S023E098

Y1948P001 279.95

Y1948P002 280.14

Y1948P003 282.27

Y1948P004 283.97

write an R program that creates a huge vector, or list of numbers, like this:

279.95, 280.14, 282.27, 283.97, ...

**Puzzle 2:** Extend the above program so that it plots this list of numbers, or outputs it to a new file.

If you want to test your programs, here’s the actual file:

If those puzzles are too easy, here are two more. I gave these last time, but everyone was too wimpy to tackle them.

**Puzzle 3.** Modify the software so that it uses the same method to predict El Niños from 1980 to 2013. You’ll have to adjust two lines in `netcdf-convertor-ludescher.R`

:

firstyear <- 1948 lastyear <- 1980

should become

firstyear <- 1980 lastyear <- 2013

or whatever range of years you want. You’ll also have to adjust names of years in `ludescher-replication.R`

. Search the file for the string `19`

and make the necessary changes. Ask me if you get stuck.

**Puzzle 4.** Right now we average the link strength over all pairs where is a node in the **El Niño basin** defined by Ludescher *et al* and is a node outside this basin. The basin consists of the red dots here:

What happens if you change the definition of the El Niño basin? For example, can you drop those annoying two red dots that are south of the rest, without messing things up? *Can you get better results if you change the shape of the basin?*

To study these questions you need to rewrite `ludescher-replication.R`

a bit. Here’s where Graham defines the El Niño basin:

ludescher.basin <- function() { lats <- c( 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6) lons <- c(11,12,13,14,15,16,17,18,19,20,21,22,16,22) stopifnot(length(lats) == length(lons)) list(lats=lats,lons=lons) }

These are lists of latitude and longitude coordinates: (5,11), (5,12), (5,13), etc. A coordinate like (5,11) means the little circle that’s 5 down and 11 across in the grid on the above map. So, that’s the leftmost point in Ludescher’s El Niño basin. By changing these lists, you can change the definition of the El Niño basin.

Next time I’ll discuss some criticisms of Ludescher *et al’*s paper, but later we will return to analyzing temperature data, looking for interesting patterns.

]]>