## Exploring Climate Data (Part 1)

1 August, 2014

joint with Dara O Shayda

Emboldened by our experiments in El Niño analysis and prediction, people in the Azimuth Code Project have been starting to analyze weather and climate data. A lot of this work is exploratory, with no big conclusions. But it’s still interesting! So, let’s try some blog articles where we present this work.

This one will be about the air pressure on the island of Tahiti and in a city called Darwin in Australia: how they’re correlated, and how each one varies. This article will also be a quick introduction to some basic statistics, as well as ‘continuous wavelet transforms’.

### Darwin, Tahiti and El Niños

The El Niño Southern Oscillation is often studied using the air pressure in Darwin, Australia versus the air pressure in Tahiti. When there’s an El Niño, it gets stormy in the eastern Pacific so the air temperatures tend to be lower in Tahiti and higher in Darwin. When there’s a La Niña, it’s the other way around:

The Southern Oscillation Index or SOI is a normalized version of the monthly mean air pressure anomaly in Tahiti minus that in Darwin. Here anomaly means we subtract off the mean, and normalized means that we divide by the standard deviation.

So, the SOI tends to be negative when there’s an El Niño. On the other hand, when there’s an El Niño the Niño 3.4 index tends to be positive—this says it’s hotter than usual in a certain patch of the Pacific.

Here you can see how this works:

When the Niño 3.4 index is positive, the SOI tends to be negative, and vice versa!

It might be fun to explore precisely how well correlated they are. You can get the data to do that by clicking on the links above.

But here’s another question: how similar are the air pressure anomalies in Darwin and in Tahiti? Do we really need to take their difference, or are they so strongly anticorrelated that either one would be enough to detect an El Niño?

You can get the data to answer such questions here:

Southern Oscillation Index based upon annual standardization, Climate Analysis Section, NCAR/UCAR. This includes links to monthly sea level pressure anomalies in Darwin and Tahiti, in either ASCII format (click the second two links) or netCDF format (click the first one and read the explanation).

In fact this website has some nice graphs already made, which I might as well show you! Here’s the SOI and also the sum of the air pressure anomalies in Darwin and Tahiti, normalized in some way:

(Click to enlarge.)

If the sum were zero, the air pressure anomalies in Darwin and Tahiti would contain the same information and life would be simple. But it’s not!

How similar in character are the air pressure anomalies in Darwin and Tahiti? There are many ways to study this question. Dara tackled it by taking the air pressure anomaly data from 1866 to 2012 and computing some ‘continuous wavelet transforms’ of these air pressure anomalies. This is a good excuse for explaining how a continuous wavelet transform works.

### Very basic statistics

It helps to start with some very basic statistics. Suppose you have a list of numbers

$x = (x_1, \dots, x_n)$

You probably know how to take their mean, or average. People often write this with angle brackets:

$\displaystyle{ \langle x \rangle = \frac{1}{n} \sum_{i = 1}^n x_i }$

You can also calculate the mean of their squares:

$\displaystyle{ \langle x^2 \rangle = \frac{1}{n} \sum_{i = 1}^n x_i^2 }$

If you were naive you might think $\langle x^2 \rangle = \langle x \rangle^2,$ but in fact we have:

$\langle x^2 \rangle \ge \langle x \rangle^2$

and they’re equal only if all the $x_i$ are the same. The point is that if the numbers $x_i$ are spread out, the squares of the big ones (positive or negative) contribute more to the average of the squares than if we had averaged them out before squaring. The difference

$\langle x^2 \rangle - \langle x \rangle^2$

is called the variance; it says how spread out our numbers are. The square root of the variance is the standard deviation:

$\sigma_x = \sqrt{\langle x^2 \rangle - \langle x \rangle^2 }$

and this has the slight advantage that if you multiply all the numbers $x_i$ by some constant $c,$ the standard deviation gets multiplied by $|c|.$ (The variance gets multiplied by $c^2.$)

We can generalize the variance to a situation where we have two lists of numbers:

$x = (x_1, \dots, x_n)$

$y = (y_1, \dots, y_n)$

Namely, we can form the covariance

$\langle x y \rangle - \langle x \rangle \langle y \rangle$

This reduces to the variance when $x = y.$ It measures how much $x$ and $y$ vary together — ‘hand in hand’, as it were. A bit more precisely: if $x_i$ is greater than its mean value mainly for $i$ such that $y_i$ is greater than its mean value, the covariance is positive. On the other hand, if $x_i$ tends to be greater than average when $y_i$ is smaller than average — like with the air pressures at Darwin and Tahiti — the covariance will be negative.

For example, if

$x = (1,-1), \quad y = (1,-1)$

then they ‘vary hand in hand’, and the covariance

$\langle x y \rangle - \langle x \rangle \langle y \rangle = 1 - 0 = 1$

is positive. But if

$x = (1,-1), \quad y = (-1,1)$

then one is positive when the other is negative, so the covariance

$\langle x y \rangle - \langle x \rangle \langle y \rangle = -1 - 0 = -1$

is negative.

Of course the covariance will get bigger if we multiply both $x$ and $y$ by some big number. If we don’t want this effect, we can normalize the covariance and get the correlation:

$\displaystyle{ \frac{ \langle x y \rangle - \langle x \rangle \langle y \rangle }{\sigma_x \sigma_y} }$

which will always be between $-1$ and $1.$

For example, if we compute the correlation between the air pressure anomalies at Darwin and Tahiti, measured monthly from 1866 to 2012, we get
-0.253727. This indicates that when one goes up, the other tends to go down. But since we’re not getting -1, it means they’re not completely locked into a linear relationship where one is some negative number times the other.

Okay, we’re almost ready for continuous wavelet transforms! Here is the main thing we need to know. If the mean of either $x$ or $y$ is zero, the formula for covariance simplifies a lot, to

$\displaystyle{ \langle x y \rangle = \frac{1}{n} \sum_{i = 1}^n x_i y_i }$

So, this quantity says how much the numbers $x_i$ ‘vary hand in hand’ with the numbers $y_i,$ in the special case when one (or both) has mean zero.

We can do something similar if $x, y : \mathbb{R} \to \mathbb{R}$ are functions of time defined for all real numbers $t.$ The sum becomes an integral, and we have to give up on dividing by $n.$ We get:

$\displaystyle{ \int_{-\infty}^\infty x(t) y(t)\; d t }$

This is called the inner product of the functions $x$ and $y,$ and often it’s written $\langle x, y \rangle,$ but it’s a lot like the covariance.

### Continuous wavelet transforms

What are continuous wavelet transforms, and why should we care?

People have lots of tricks for studying ‘signals’, like series of numbers $x_i$ or functions $x : \mathbb{R} \to \mathbb{R}.$ One method is to ‘transform’ the signal in a way that reveals useful information. The Fourier transform decomposes a signal into sines and cosines of different frequencies. This lets us see how much power the signal has at different frequencies, but it doesn’t reveal how the power at different frequencies changes with time. For that we should use something else, like the Gabor transform explained by Blake Pollard in a previous post.

Sines and cosines are great, but we might want to look for other patterns in a signal. A ‘continuous wavelet transform’ lets us scan a signal for appearances of a given pattern at different times and also at different time scales: a pattern could go by quickly, or in a stretched out slow way.

To implement the continuous wavelet transform, we need a signal and a pattern to look for. The signal could be a function $x : \mathbb{R} \to \mathbb{R}.$ The pattern would then be another function $y: \mathbb{R} \to \mathbb{R},$ usually called a wavelet.

Here’s an example of a wavelet:

If we’re in a relaxed mood, we could call any function that looks like a bump with wiggles in it a wavelet. There are lots of famous wavelets, but this particular one is the fourth derivative of a certain Gaussian. Mathematica calls this particular wavelet DGaussianWavelet[4], and you can look up the formula under ‘Details’ on their webpage.

However, the exact formula doesn’t matter at all now! If we call this wavelet $y,$ all that matters is that it’s a bump with wiggles on it, and that its mean value is 0, or more precisely:

$\displaystyle{ \int_{-\infty}^\infty y(t) \; d t = 0 }$

As we saw in the last section, this fact lets us take our function $x$ and the wavelet $y$ and see how much they ‘vary hand it hand’ simply by computing their inner product:

$\displaystyle{ \langle x , y \rangle = \int_{-\infty}^\infty x(t) y(t)\; d t }$

Loosely speaking, this measures the ‘amount of $y$-shaped wiggle in the function $x$’. It’s amazing how hard it is to say something in plain English that perfectly captures the meaning of a simple formula like the above one—so take the quoted phrase with a huge grain of salt. But it gives a rough intuition.

Our wavelet $y$ happens to be centered at $t = 0.$ However, we might be interested in $y$-shaped wiggles that are centered not at zero but at some other number $s.$ We could detect these by shifting the function $y$ before taking its inner product with $x$:

$\displaystyle{ \int_{-\infty}^\infty x(t) y(t-s)\; d t }$

We could also be interested in measuring the amount of some stretched-out or squashed version of a $y$-shaped wiggle in the function $x.$ Again we could do this by changing $y$ before taking its inner product with $x$:

$\displaystyle{ \int_{-\infty}^\infty x(t) \; y\left(\frac{t}{P}\right) \; d t }$

When $P$ is big, we get a stretched-out version of $y.$ People sometimes call $P$ the period, since the period of the wiggles in $y$ will be proportional to this (though usually not equal to it).

Finally, we can combine these ideas, and compute

$\displaystyle{ \int_{-\infty}^\infty x(t) \; y\left(\frac{t- s}{P}\right)\; dt }$

This is a function of the shift $s$ and period $P$ which says how much of the $s$-shifted, $P$-stretched wavelet $y$ is lurking in the function $x.$ It’s a version of the continuous wavelet transform!

Mathematica implements this idea for time series, meaning lists of numbers $x = (x_1,\dots,x_n)$ instead of functions $x : \mathbb{R} \to \mathbb{R}.$ The idea is that we think of the numbers as samples of a function $x$:

$x_1 = x(\Delta t)$

$x_2 = x(2 \Delta t)$

and so on, where $\Delta t$ is some time step, and replace the integral above by a suitable sum. Mathematica has a function ContinuousWaveletTransform that does this, giving

$\displaystyle{ w(s,P) = \frac{1}{\sqrt{P}} \sum_{i = 1}^n x_i \; y\left(\frac{i \Delta t - s}{P}\right) }$

The factor of $1/\sqrt{P}$ in front is a useful extra trick: it’s the right way to compensate for the fact that when you stretch out out your wavelet $y$ by a factor of $P,$ it gets bigger. So, when we’re doing integrals, we should define the continuous wavelet transform of $y$ by:

$\displaystyle{ w(s,P) = \frac{1}{\sqrt{P}} \int_{-\infty}^\infty x(t) y(\frac{t- s}{P})\; dt }$

### The results

Dara Shayda started with the air pressure anomaly at Darwin and Tahiti, measured monthly from 1866 to 2012. Taking DGaussianWavelet[4] as his wavelet, he computed the continuous wavelet transform $w(s,P)$ as above. To show us the answer, he created a scalogram:

This is a 2-dimensional color plot showing roughly how big the continuous wavelet transform $w(s,P)$ is for different shifts $s$ and periods $P.$ Blue means it’s very small, green means it’s bigger, yellow means even bigger and red means very large.

Tahiti gave this:

You’ll notice that the patterns at Darwin and Tahiti are similar in character, but notably different in detail. For example, the red spots, where our chosen wavelet shows up strongly with period of order ~100 months, occur at different times.

Puzzle 1. What is the meaning of the ‘spikes’ in these scalograms? What sort of signal would give a spike of this sort?

Puzzle 2. Do a Gabor transform, also known as a ‘windowed Fourier transform’, of the same data. Blake Pollard explained the Gabor transform in his article Milankovitch vs the Ice Ages. This is a way to see how much a signal wiggles at a given frequency at a given time: we multiply the signal by a shifted Gaussian and then takes its Fourier transform.

Puzzle 3. Read about continuous wavelet transforms. If we want to reconstruct our signal $x$ from its continuous wavelet transform, why should we use a wavelet $y$ with

$\displaystyle{\int_{-\infty}^\infty y(t) \; d t = 0 ? }$

In fact we want a somewhat stronger condition, which is implied by the above equation when the Fourier transform of $y$ is smooth and integrable:

Continuous wavelet transform, Wikipedia.

### Another way to understand correlations

David Tweed mentioned another approach from signal processing to understanding the quantity

$\displaystyle{ \langle x y \rangle = \frac{1}{n} \sum_{i = 1}^n x_i y_i }$

If we’ve got two lists of data $x$ and $y$ that we want to compare to see if they behave similarly, the first thing we ought to do is multiplicatively scale each one so they’re of comparable magnitude. There are various possibilities for assigning a scale, but a reasonable one is to ensure they have equal ‘energy’

$\displaystyle{ \sum_{i=1}^n x_i^2 = \sum_{i=1}^n y_i^2 }$

(This can be achieved by dividing each list by its standard deviation, which is equivalent to what was done in the main derivation above.) Once we’ve done that then it’s clear that looking at

$\displaystyle{ \sum_{i=1}^n (x_i-y_i)^2 }$

gives small values when they have a very good match and progressively bigger values as they become less similar. Observe that

$\begin{array}{ccl} \displaystyle{\sum_{i=1}^n (x_i-y_i)^2 } &=& \displaystyle{ \sum_{i=1}^n (x_i^2 - 2 x_i y_i + y_i^2) }\\ &=& \displaystyle{ \sum_{i=1}^n x_i^2 - 2 \sum_{i=1}^n x_i y_i + \sum_{i=1}^n y_i^2 } \end{array}$

Since we’ve scaled things so that $\sum_{i=1}^n x_i^2$ and $\sum_{i=1}^n y_i^2$ are constants, we can see that when $\sum_{i=1}^n x_i y_i$ becomes bigger,

$\displaystyle{ \sum_{i=1}^n (x_i-y_i)^2 }$

becomes smaller. So,

$\displaystyle{\sum_{i=1}^n x_i y_i}$

serves as a measure of how close the lists are, under these assumptions.

## The Stochastic Resonance Program (Part 1)

10 May, 2014

guest post by David Tanzer

At the Azimuth Code Project, we are aiming to produce educational software that is relevant to the Earth sciences and the study of climate. Our present software takes the form of interactive web pages, which allow you to experiment with the parameters of models and view their outputs. But to fully understand the meaning of a program, we need to know about the concepts and theories that inform it. So we will be writing articles to explain both the programs themselves and the math and science behind them.

In this two-part series, I’ll explain this program:

Check it out—it runs on your browser! It was created by Allan Erskine and Glyn Adgie. In the Azimuth blog article Increasing the Signal-to-Noise Ratio with More Noise, Glyn Adgie and Tim van Beek give a nice explanation of the idea of stochastic resonance, which includes some clear and exciting graphs.

My goal today is give a compact, developer-oriented introduction to stochastic resonance, which will set the context for the next blog article, where I’ll dissect the program itself. By way of introduction, I am a software developer with research training in computer science. It’s a new area for me, and any clarifications will be welcome!

### The concept of stochastic resonance

Stochastic resonance is a phenomenon, occurring under certain circumstances, in which a noise source may amplify the effect of a weak signal. This concept was used in an early hypothesis about the timing of ice-age cycles, and has since been applied to a wide range of phenomena, including neuronal detection mechanisms and patterns of traffic congestion.

Suppose we have a signal detector whose internal, analog state is driven by an input signal, and suppose the analog states are partitioned into two regions, called “on” and “off” — this is a digital state, abstracted from the analog state. With a light switch, we could take the force as the input signal, the angle as the analog state, and the up/down classification of the angle as the digital state.

Consider the effect of a periodic input signal on the digital state. Suppose the wave amplitude is not large enough to change the digital state, yet large enough to drive the analog state close to the digital state boundary. Then, a bit of random noise, occurring near the peak of an input cycle, may “tap” the system over to the other digital state. So we will see a probability of state-transitions that is synchronized with the input signal. In a complex way, the noise has amplified the input signal.

But it’s a pretty funky amplifier! Here is a picture from the Azimuth library article on stochastic resonance:

Stochastic resonance has been found in the signal detection mechanisms of neurons. There are, for example, cells in the tails of crayfish that are tuned to low-frequency signals in the water caused by predator motions. These signals are too weak to cross the firing threshold for the neurons, but with the right amount of noise, they do trigger the neurons.

See:

Stochastic resonance, Azimuth Library.

Stochastic resonance in neurobiology, David Lyttle.

### Bistable stochastic resonance and Milankovitch theories of ice-age cycles

Stochastic resonance was originally formulated in terms of systems that are bistable — where each digital state is the basin of attraction of a stable equilibrium.

An early application of stochastic resonance was to a hypothesis, within the framework of bistable climate dynamics, about the timing of the ice-age cycles. Although it has not been confirmed, it remains of interest (1) historically, (2) because the timing of ice-age cycles remains an open problem, and (3) because the Milankovitch hypothesis upon which it rests is an active part of the current research.

In the bistable model, the climate states are a cold, “snowball” Earth and a hot, iceless Earth. The snowball Earth is stable because it is white, and hence reflects solar energy, which keeps it frozen. The iceless Earth is stable because it is dark, and hence absorbs solar energy, which keeps it melted.

The Milankovitch hypothesis states that the drivers of climate state change are long-duration cycles in the insolation — the solar energy received in the northern latitudes — caused by periodic changes in the Earth’s orbital parameters. The north is significant because that is where the glaciers are concentrated, and so a sufficient “pulse” in northern temperatures could initiate a state change.

Three relevant astronomical cycles have been identified:

• Changing of the eccentricity of the Earth’s elliptical orbit, with a period of 100 kiloyears

• Changing of the obliquity (tilt) of the Earth’s axis, with a period of 41 kiloyears

• Precession (swiveling) of the Earth’s axis, with a period of 23 kiloyears

In the stochastic resonance hypothesis, the Milankovitch signal is amplified by random events to produce climate state changes. In more recent Milankovitch theories, a deterministic forcing mechanism is used. In a theory by Didier Paillard, the climate is modeled with three states, called interglacial, mild glacial and full glacial, and the state changes depend on the volume of ice as well as the insolation.

See:

Milankovitch cycle, Azimuth Library.

Mathematics of the environment (part 10), John Baez. This gives an exposition of Paillard’s theory.

### Bistable systems defined by a potential function

Any smooth function with two local minima can be used to define a bistable system. For instance, consider the function $V(x) = x^4/4 - x^2/2$:

To define the bistable system, construct a differential equation where the time derivative of x is set to the negative of the derivative of the potential at x:

$dx/dt = -V'(x) = -x^3 + x = x(1 - x^2)$

So, for instance, where the potential graph is sloping upward as $x$ increases, $-V'(x)$ is negative, and this sends $X(t)$ ‘downhill’ towards the minimum.

The roots of $V'(x)$ yield stable equilibria at 1 and -1, and an unstable equilibrium at 0. The latter separates the basins of attraction for the stable equilibria.

### Discrete stochastic resonance

Now let’s look at a discrete-time model which exhibits stochastic resonance. This is the model used in the Azimuth demo program.

We construct the discrete-time derivative, using the potential function, a sampled sine wave, and a normally distributed random number:

$\Delta X_t = -V'(X_t) * \Delta t + \mathrm{Wave}(t) + \mathrm{Noise}(t) =$
$X_t (1 - X_t^2) \Delta t + \alpha * \sin(\omega t) + \beta * \mathrm{GaussianSample}(t)$

where $\Delta t$ is a constant and $t$ is restricted to multiples of $\Delta t.$

This equation is the discrete-time counterpart to a continuous-time stochastic differential equation.

Next time, we will look into the Azimuth demo program itself.

## Noether’s Theorem: Quantum vs Stochastic

3 May, 2014

guest post by Ville Bergholm

In 1915 Emmy Noether discovered an important connection between the symmetries of a system and its conserved quantities. Her result has become a staple of modern physics and is known as Noether’s theorem.

The theorem and its generalizations have found particularly wide use in quantum theory. Those of you following the Network Theory series here on Azimuth might recall Part 11 where John Baez and Brendan Fong proved a version of Noether’s theorem for stochastic systems. Their result is now published here:

• John Baez and Brendan Fong, A Noether theorem for stochastic mechanics, J. Math. Phys. 54:013301 (2013).

One goal of the network theory series here on Azimuth has been to merge ideas appearing in quantum theory with other disciplines. John and Brendan proved their stochastic version of Noether’s theorem by exploiting ‘stochastic mechanics’ which was formulated in the network theory series to mathematically resemble quantum theory. Their result, which we will outline below, was different than what would be expected in quantum theory, so it is interesting to try to figure out why.

Recently Jacob Biamonte, Mauro Faccin and myself have been working to try to get to the bottom of these differences. What we’ve done is prove a version of Noether’s theorem for Dirichlet operators. As you may recall from Parts 16 and 20 of the network theory series, these are the operators that generate both stochastic and quantum processes. In the language of the series, they lie in the intersection of stochastic and quantum mechanics. So, they are a subclass of the infinitesimal stochastic operators considered in John and Brendan’s work.

The extra structure of Dirichlet operators—compared with the wider class of infinitesimal stochastic operators—provided a handle for us to dig a little deeper into understanding the intersection of these two theories. By the end of this article, astute readers will be able to prove that Dirichlet operators generate doubly stochastic processes.

Before we get into the details of our proof, let’s recall first how conservation laws work in quantum mechanics, and then contrast this with what John and Brendan discovered for stochastic systems. (For a more detailed comparison between the stochastic and quantum versions of the theorem, see Part 13 of the network theory series.)

### The quantum case

I’ll assume you’re familiar with quantum theory, but let’s start with a few reminders.

In standard quantum theory, when we have a closed system with $n$ states, the unitary time evolution of a state $|\psi(t)\rangle$ is generated by a self-adjoint $n \times n$ matrix $H$ called the Hamiltonian. In other words, $|\psi(t)\rangle$ satisfies Schrödinger’s equation:

$i \hbar \displaystyle{\frac{d}{d t}} |\psi(t) \rangle = H |\psi(t) \rangle.$

The state of a system starting off at time zero in the state $|\psi_0 \rangle$ and evolving for a time $t$ is then given by

$|\psi(t) \rangle = e^{-i t H}|\psi_0 \rangle.$

The observable properties of a quantum system are associated with self-adjoint operators. In the state $|\psi \rangle,$ the expected value of the observable associated to a self-adjoint operator $O$ is

$\langle O \rangle_{\psi} = \langle \psi | O | \psi \rangle$

This expected value is constant in time for all states if and only if $O$ commutes with the Hamiltonian $H$:

$[O, H] = 0 \quad \iff \quad \displaystyle{\frac{d}{d t}} \langle O \rangle_{\psi(t)} = 0 \quad \forall \: |\psi_0 \rangle, \forall t.$

In this case we say $O$ is a ‘conserved quantity’. The fact that we have two equivalent conditions for this is a quantum version of Noether’s theorem!

### The stochastic case

In stochastic mechanics, the story changes a bit. Now a state $|\psi(t)\rangle$ is a probability distribution: a vector with $n$ nonnegative components that sum to 1. Schrödinger’s equation gets replaced by the master equation:

$\displaystyle{\frac{d}{d t}} |\psi(t) \rangle = H |\psi(t) \rangle$

If we start with a probability distribution $|\psi_0 \rangle$ at time zero and evolve it according to this equation, at any later time have

$|\psi(t)\rangle = e^{t H} |\psi_0 \rangle.$

We want this always be a probability distribution. To ensure that this is so, the Hamiltonian $H$ must be infinitesimal stochastic: that is, a real-valued $n \times n$ matrix where the off-diagonal entries are nonnegative and the entries of each column sum to zero. It no longer needs to be self-adjoint!

When $H$ is infinitesimal stochastic, the operators $e^{t H}$ map the set of probability distributions to itself whenever $t \ge 0,$ and we call this family of operators a continuous-time Markov process, or more precisely a Markov semigroup.

In stochastic mechanics, we say an observable $O$ is a real diagonal $n \times n$ matrix, and its expected value is given by

$\langle O\rangle_{\psi} = \langle \hat{O} | \psi \rangle$

where $\hat{O}$ is the vector built from the diagonal entries of $O.$ More concretely,

$\langle O\rangle_{\psi} = \displaystyle{ \sum_i O_{i i} \psi_i }$

where $\psi_i$ is the $i$th component of the vector $|\psi\rangle.$

Here is a version of Noether’s theorem for stochastic mechanics:

Noether’s Theorem for Markov Processes (Baez–Fong). Suppose $H$ is an infinitesimal stochastic operator and $O$ is an observable. Then

$[O,H] =0$

if and only if

$\displaystyle{\frac{d}{d t}} \langle O \rangle_{\psi(t)} = 0$

and

$\displaystyle{\frac{d}{d t}}\langle O^2 \rangle_{\psi(t)} = 0$

for all $t \ge 0$ and all $\psi(t)$ obeying the master equation.   █

So, just as in quantum mechanics, whenever $[O,H]=0$ the expected value of $O$ will be conserved:

$\displaystyle{\frac{d}{d t}} \langle O\rangle_{\psi(t)} = 0$

for any $\psi_0$ and all $t \ge 0.$ However, John and Brendan saw that—unlike in quantum mechanics—you need more than just the expectation value of the observable $O$ to be constant to obtain the equation $[O,H]=0.$ You really need both

$\displaystyle{\frac{d}{d t}} \langle O\rangle_{\psi(t)} = 0$

together with

$\displaystyle{\frac{d}{d t}} \langle O^2\rangle_{\psi(t)} = 0$

for all initial data $\psi_0$ to be sure that $[O,H]=0.$

So it’s a bit subtle, but symmetries and conserved quantities have a rather different relationship than they do in quantum theory.

### A Noether theorem for Dirichlet operators

But what if the infinitesimal generator of our Markov semigroup is also self-adjoint? In other words, what if $H$ is both an infinitesimal stochastic matrix but also its own transpose: $H = H^\top$? Then it’s called a Dirichlet operator… and we found that in this case, we get a stochastic version of Noether’s theorem that more closely resembles the usual quantum one:

Noether’s Theorem for Dirichlet Operators. If $H$ is a Dirichlet operator and $O$ is an observable, then

$[O, H] = 0 \quad \iff \quad \displaystyle{\frac{d}{d t}} \langle O \rangle_{\psi(t)} = 0 \quad \forall \: |\psi_0 \rangle, \forall t \ge 0$

Proof. The $\Rightarrow$ direction is easy to show, and it follows from John and Brendan’s theorem. The point is to show the $\Leftarrow$ direction. Since $H$ is self-adjoint, we may use a spectral decomposition:

$H = \displaystyle{ \sum_k E_k |\phi_k \rangle \langle \phi_k |}$

where $\phi_k$ are an orthonormal basis of eigenvectors, and $E_k$ are the corresponding eigenvalues. We then have:

$\displaystyle{\frac{d}{d t}} \langle O \rangle_{\psi(t)} = \langle \hat{O} | H e^{t H} |\psi_0 \rangle = 0 \quad \forall \: |\psi_0 \rangle, \forall t \ge 0$

$\iff \quad \langle \hat{O}| H e^{t H} = 0 \quad \forall t \ge 0$

$\iff \quad \sum_k \langle \hat{O} | \phi_k \rangle E_k e^{t E_k} \langle \phi_k| = 0 \quad \forall t \ge 0$

$\iff \quad \langle \hat{O} | \phi_k \rangle E_k e^{t E_k} = 0 \quad \forall t \ge 0$

$\iff \quad |\hat{O} \rangle \in \mathrm{Span}\{|\phi_k \rangle \, : \; E_k = 0\} = \ker \: H,$

where the third equivalence is due to the vectors $|\phi_k \rangle$ being linearly independent. For any infinitesimal stochastic operator $H$ the corresponding transition graph consists of $m$ connected components iff we can reorder (permute) the states of the system such that $H$ becomes block-diagonal with $m$ blocks. Now it is easy to see that the kernel of $H$ is spanned by $m$ eigenvectors, one for each block. Since $H$ is also symmetric, the elements of each such vector can be chosen to be ones within the block and zeros outside it. Consequently

$|\hat{O} \rangle \in \ker \: H$

implies that we can choose the basis of eigenvectors of $O$ to be the vectors $|\phi_k \rangle,$ which implies

$[O, H] = 0$

Alternatively,

$|\hat{O} \rangle \in \ker \, H$

implies that

$|\hat{O^2} \rangle \in \ker \: H \; \iff \; \cdots \; \iff \; \displaystyle{\frac{d}{d t}} \langle O^2 \rangle_{\psi(t)} = 0 \; \forall \: |\psi_0 \rangle, \forall t \ge 0,$

where we have used the above sequence of equivalences backwards. Now, using John and Brendan’s original proof, we can obtain $[O, H] = 0.$   █

In summary, by restricting ourselves to the intersection of quantum and stochastic generators, we have found a version of Noether’s theorem for stochastic mechanics that looks formally just like the quantum version! However, this simplification comes at a cost. We find that the only observables $O$ whose expected value remains constant with time are those of the very restricted type described above, where the observable has the same value in every state in a connected component.

### Puzzles

Suppose we have a graph whose graph Laplacian matrix $H$ generates a Markov semigroup as follows:

$U(t) = e^{t H}$

Puzzle 1. Suppose that also $H = H^\top,$ so that $H$ is a Dirichlet operator and hence $i H$ generates a 1-parameter unitary group. Show that the indegree and outdegree of any node of our graph must be equal. Graphs with this property are called balanced.

Puzzle 2. Suppose that $U(t) = e^{t H}$ is doubly stochastic Markov semigroup, meaning that for all $t \ge 0$ each row and each column of $U(t)$ sums to 1:

$\displaystyle{ \sum_i U(t)_{i j} = \sum_j U(t)_{i j} = 1 }$

and all the matrix entries are nonnegative. Show that the Hamiltonian $H$ obeys

$\displaystyle{\sum_i H_{i j} = \sum_j H_{i j} = 0 }$

and all the off-diagonal entries of $H$ are nonnegative. Show the converse is also true.

Puzzle 3. Prove that any doubly stochastic Markov semigroup $U(t)$ is of the form $e^{t H}$ where $H$ is the graph Laplacian of a balanced graph.

Puzzle 4. Let $O(t)$ be a possibly time-dependent observable, and write $\langle O(t) \rangle_{\psi(t)}$ for its expected value with respect to some initial state $\psi_0$ evolving according to the master equation. Show that

$\displaystyle{ \frac{d}{d t}\langle O(t)\rangle_{\psi(t)} = \left\langle [O(t), H] \right\rangle_{\psi(t)} + \left\langle \frac{\partial O(t)}{\partial t}\right\rangle_{\psi(t)} }$

This is a stochastic version of the Ehrenfest theorem.

## Network Theory III

16 March, 2014

In the last of my Oxford talks I explain how entropy and relative entropy can be understood using certain categories related to probability theory… and how these categories also let us understand Bayesian networks!

The first two parts are explanations of these papers:

• John Baez, Tobias Fritz and Tom Leinster, A characterization of entropy in terms of information loss

• John Baez and Tobias Fritz, A Bayesian characterization of relative entropy.

Somewhere around here the talk was interrupted by a fire drill, waking up the entire audience!

By the way, in my talk I mistakenly said that relative entropy is a continuous functor; in fact it’s just lower semicontinuous. I’ve fixed this in my slides.

The third part of my talk was my own interpretation of Brendan Fong’s master’s thesis:

• Brendan Fong, Causal Theories: a Categorical Perspective on Bayesian Networks.

I took a slightly different approach, by saying that a causal theory $\mathcal{C}_G$ is the free category with products on certain objects and morphisms coming from a directed acyclic graph $G$. In his thesis he said $\mathcal{C}_G$ was the free symmetric monoidal category where each generating object is equipped with a cocommutative comonoid structure. This is close to a category with finite products, though perhaps not quite the same: a symmetric monoidal category where every object is equipped with a cocommutative comonoid structure in a natural way (i.e., making a bunch of squares commute) is a category with finite products. It would be interesting to see if this difference hurts or helps.

By making this slight change, I am claiming that causal theories can be seen as algebraic theories in the sense of Lawvere. This would be a very good thing, since we know a lot about those.

You can also see the slides of this talk. Click on any picture in the slides, or any text in blue, and get more information!

## Network Theory II

12 March, 2014

Chemists are secretly doing applied category theory! When chemists list a bunch of chemical reactions like

C + O₂ → CO₂

they are secretly describing a ‘category’.

That shouldn’t be surprising. A category is simply a collection of things called objects together with things called morphisms going from one object to another, often written

f: x → y

The rules of a category say:

1) we can compose a morphism f: x → y and another morphism g: y → z to get an arrow gf: x → z,

2) (hg)f = h(gf), so we don’t need to bother with parentheses when composing arrows,

3) every object x has an identity morphism 1ₓ: x → x that obeys 1ₓ f = f and f 1ₓ = f.

Whenever we have a bunch of things (objects) and processes (arrows) that take one thing to another, we’re likely to have a category. In chemistry, the objects are bunches of molecules and the arrows are chemical reactions. But we can ‘add’ bunches of molecules and also add reactions, so we have something more than a mere category: we have something called a symmetric monoidal category.

My talk here, part of a series, is an explanation of this viewpoint and how we can use it to take ideas from elementary particle physics and apply them to chemistry! ﻿For more details try this free book:

• John Baez and Jacob Biamonte, A Course on Quantum Techniques for Stochastic Mechanics.

as well as this paper on the Anderson–Craciun–Kurtz theorem (discussed in my talk):

• John Baez and Brendan Fong, Quantum techniques for studying equilibrium in reaction networks.

You can also see the slides of this talk. Click on any picture in the slides, or any text in blue, and get more information!

## Markov Models of Social Change (Part 1)

24 February, 2014

guest post by Alastair Jamieson-Lane

The world is complex, and making choices in a complex world is sometimes difficult.

As any leader knows, decisions must often be made with incomplete information. To make matters worse, the experts and scientists who are meant to advise on these important matters are also doing so with incomplete information—usually limited to only one or two specialist fields. When decisions need to be made that are dependent on multiple real-world systems, and your various advisors find it difficult to communicate, this can be problematic!

The generally accepted approach is to listen to whichever advisor tells you the things you want to hear.

When such an approach fails (for whatever mysterious and inexplicable reason) it might be prudent to consider such approaches as Bayesian inference, analysis of competing hypotheses or cross-impact balance analysis.

Because these methods require experts to formalize their opinions in an explicit, discipline neutral manner, we avoid many of the problems mentioned above. Also, if everything goes horribly wrong, you can blame the algorithm, and send the rioting public down to the local university to complain there.

In this blog article I will describe cross-impact balance analysis and a recent extension to this method, explaining its use, as well as some basic mathematical underpinnings. No familiarity with cross-impact balance analysis will be required.

### Wait—who is this guy?

Since this is my first time writing a blog post here, I hear introductions are in order.

Hi. I’m Alastair.

I am currently a Master’s student at the University of British Columbia, studying mathematics. In particular, I’m aiming to use evolutionary game theory to study academic publishing and hiring practices… and from there hopefully move on to studying governments (we’ll see how the PhD goes). I figure that both those systems seem important to solving the problems we’ve built for ourselves, and both may be under increasing pressure in coming years.

But that’s not what I’m here for today! Today I’m here to tell the story of cross-impact balance analysis, a tool I was introduced to at the complex systems summer school in Santa Fe.

### The story

Suppose (for example) that the local oracle has foretold that burning the forests will anger the nature gods

… and that if you do not put restrictions in place, your crops will wither and die.

Well, that doesn’t sound very good.

The merchant’s guild claims that such restrictions will cause all trade to grind to a halt.

Your most trusted generals point out that weakened trade will leave you vulnerable to invasion from all neighboring kingdoms.

The sailors guild adds that the wrath of Poseidon might make nautical trade more difficult.

The alchemists propose alternative sources of heat…

… while the druids propose special crops as a way of resisting the wrath of the gods…

… and so on.

Given this complex web of interaction, it might be a good time to consult the philosophers.

### Overview of CIB

This brings us to the question of what CIB (Cross-Impact Balance) analysis is, and how to use it.

At its heart, CIB analysis demands this: first, you must consider what aspects of the world you are interested in studying. This could be environmental or economic status, military expenditure, or the laws governing genetic modification. These we refer to as “descriptors”. For each “descriptor” we must create a list of possible “states”.

For example, if the descriptor we are interested in were “global temperature change” our states might be “+5 degree”, “+4 degrees” and so on down to “-2 degrees”.

The states of a descriptor are not meant to be all-encompassing, or offer complete detail, and they need not be numerical. For example, the descriptor “Agricultural policy” might have such states as “Permaculture subsidy”, “Genetic engineering”, “Intensive farming” or “No policy”.

For each of these states, we ask our panel of experts whether such a state would increase or decrease the tendency for some other descriptor to be in a particular state.

For example, we might ask: “On a scale from -3 to 3, how much does the agricultural policy of Intensive farming increase the probability that we will see global temperature increases of +2 degrees?”

By combining the opinions of a variety of experts in each field, and weighting based on certainty and expertise, we are able to construct matrices, much like the one below:

The above matrix is a description of my ant farm. The health of my colony is determined by the population, income, and education levels of my ants. For a less ant focused version of the above, please refer to:

• Elisabeth A. Lloyd and Vanessa J. Schweizer, Objectivity and a comparison of methodological scenario approaches for climate change research, Synthese (2013).

For any possible combination of descriptor states (referred to as a scenario) we can calculate the total impact on all possible descriptors. In the current scenario we have low population, high income and medium education (see highlighted rows).

Because the current scenario has high ant income, this strongly influences us to have low population (+3) and prevents a jump to high population (-3). This combined with the non-influence from education (zeros) leads to low population being the most favoured state for our population descriptor. Thus we expect no change. We say this is “consistent”.

Education however sees a different story. Here we have a strong influence towards high education levels (summing the column gives a total of 13). Thus our current state (medium education) is inconsistent, and we would expect the abundance of ant wealth to lead to an improvements in the ant schooling system.

Classical CIB analysis acts as a way to classify which hypothetical situations are consistent, and which are not.

Now, it is all well and good to claim that some scenarios are stable, but the real use of such a tool is in predicting (and influencing) the future.

By applying a deterministic rule that determines how inconsistencies are resolved, we can produce a “succession rule”. The most straight-forward example is to replace all descriptor states with whichever state is most favoured by the current scenario. In the example above we would switch to “Low population, medium income, high education”. A generation later we would switch back to “Low population, High income, medium education”, soon finding ourselves trapped in a loop.

All such rules will always lead to either a loop or a “sink”: a self consistent scenario which is succeeded only by itself.

So, how can we use this? How will this help us deal with the wrath of the gods (or ant farms)?

Firstly: we can identify loops and consistent scenarios which we believe are most favourable. It’s all well and good imagining some future utopia, but if it is inconsistent with itself, and will immediately lead to a slide into less favourable scenarios then we should not aim for it, we should find that most favourable realistic scenario and aim for that one.

Secondly: We can examine all our consistent scenarios, and determine whose “basin of attraction” we find ourselves in: that is, which scenario are we likely to end up in.

Thirdly: Suppose we could change our influence matrix slightly? How would we change it to favour scenarios we most prefer? If you don’t like the rules, change the game—or at the very least find out WHAT we would need to change to have the best effect.

### Concerns and caveats

So… what are the problems we might encounter? What are the drawbacks?

Well, first of all, we note that the real world does not tend to reach any form of eternal static scenario or perfect cycle. The fact that our model does might be regarded as reason for suspicion.

Secondly, although the classical method contains succession analysis, this analysis is not necessarily intended as a completely literal “prediction” of events. It gives a rough idea of the basins of attraction of our cycles and consistent scenarios, but is also somewhat arbitrary. What succession rule is most appropriate? Do all descriptors update simultaneously? Or only the one with the most “pressure”? Are our descriptors given in order of malleability, and only the fastest changing descriptor will change?

Thirdly, in collapsing our description of the world down into a finite number of states we are ignoring many tiny details. Most of these details are not important, but in assuming that our succession rules are deterministic, we imply that these details have no impact whatsoever.

If we instead treat succession as a somewhat random process, the first two of these problems can be solved, and the third somewhat reduced.

### Stochastic succession

In the classical CIB succession analysis, some rule is selected which deterministically decides which scenario follows from the present. Stochastic succession analysis instead tells us the probability that a given scenario will lead to another.

The simplest example of a stochastic succession rule is to simply select a single descriptor at random each time step, and only consider updates that might happen to that descriptor. This we refer to as dice succession. This (in some ways) represents hidden information: two systems that might look identical on the surface from the point of view of our very blockish CIB analysis might be different enough underneath to lead to different outcomes. If we have a shaky agricultural system, but a large amount of up-and-coming research, then which of these two factors becomes important first is down to the luck of the draw. Rather than attempt to model this fine detail, we instead merely accept it and incorporate this uncertainty into our model.

Even this most simplistic change leads to dramatics effects on our system. Most importantly, almost all cycles vanish from our results, as forks in the road allow us to diverge from the path of the cycle.

We can take stochastic succession further and consider more exotic rules for our transitions, ones that allow any transition to take place, not merely those that are most favored. For example:

$P(x,y) = A e^{I_x(y)/T}$

Here $x$ is our current scenario, $y$ is some possible future scenario, and $I_x(y)$ is the total impact score of $y$ from the perspective of $x$. $A$ is a simple normalizing constant, and $T$ is our system’s temperature. High temperature systems are dominated by random noise, while low temperature systems are dominated by the influences described by our experts.

Impact score is calculated by summing the impact of each state of our current scenario, on each state of our target scenario. For example, for the above, suppose we want to find $I_x(y)$ when $x$ is the given scenario “Low population, High income, medium education” and $y$ was the scenario “Medium population, medium income, High education”. We consider all numbers that are in rows which were states of $x$ and in columns that are states of $y$. This would give:

$I_x(y)= (0+0+0) + (-2 +0 +10) +(6+7+0) = 21$

Here each bracket refers to the sum of a particular column.
More generically we can write the formula as:

$\displaystyle{ I_x(y)= \sum_{i \subset x, \;j \subset y} M_{i,j} }$

Here $M_{i,j}$ refers to an entry in our cross-impact balance matrix, $i$ and $j$ are both states, and $i \subset x$ reads as “$i$ is a state of $x$”.

We refer to this function for computing transition probabilities as the Boltzmann succession law, due to its similarity to the Boltzmann distribution found in physics. We use it merely as an example, and by no means wish to imply that we expect the transitions for our true system to act in a precisely Boltzmann-like manner. Alternative functions can, and should, be experimented with. The Boltzmann succession law is however an effective example and has a number of nice properties: $P(x,y)$ is always positive, unchanged by adding a constant to every element of the cross-impact balance matrix, contains adjustable parameters, and unbounded above.

The Boltzmann succession rule is what I will refer to as fully stochastic: it allows transitions even against our experts’ judgement (with low probability). This is in contrast to dice succession which picks a direction at random, but still contains scenarios from which our system can not escape.

### Effects of stochastic succession

‘Partially stochastic’ processes such as the dice rule have very limited effect on the long term behavior of the model. Aside from removing most cycles, they behave almost exactly like our deterministic succession rules. So, let us instead discuss the more interesting fully stochastic succession rules.

In the fully stochastic system we can ask “after a very long time, what is the probability we will be in scenario $x$?”

By asking this question we can get some idea of the relative importance of all our future scenarios and states.

For example, if the scenario “high population, low education, low income” has a 40% probability in the long term, while most other scenarios have a probability of 0.2%, we can see that this scenario is crucial to the understanding of our system. Often scenarios already identified by deterministic succession analysis are the ones with the greatest long term probability—but by looking at long term probability we also gain information about the relative importance of each scenario.

In addition, we can encounter scenarios which are themselves inconsistent, but form cycles and/or clusters of interconnected scenarios. We can also notice scenarios that while technically ‘consistent’ in the deterministic rules are only barely so, and have limited weight due to a limited basin of attraction. We might identify scenarios that seem familiar in the real world, but are apparently highly unlikely in our analysis, indicating either that we should expect change… or perhaps suggesting a missing descriptor or a cross-impact in need of tweaking.

Armed with such a model, we can investigate what we can do to increase the short term and long term likelihood of desirable scenarios, and decrease the likelihood of undesirable scenarios.

As a last note, here are a few freely available resources that may prove useful. For a more formal introduction to CIB, try:

• Wolfgang Weimer-Jehle, Cross-impact balances: a system-theoretical approach to cross-impact analysis, Technological Forecasting & Social Change 73 (2006), 334–361.

• Wolfgang Weimer-Jehle, Properties of cross-impact balance analysis.

You can find free software for doing a classical CIB analysis here:

• ZIRIUS, ScenarioWizard.

ZIRIUS is the Research Center for Interdisciplinary Risk and Innovation Studies of the University of Stuttgart.

Here are some examples of CIB in action:

• Gerhard Fuchs, Ulrich Fahl, Andreas Pyka, Udo Staber, Stefan Voegele and Wolfgang Weimer-Jehle, Generating innovation scenarios using the cross-impact methodology, Department of Economics, University of Bremen, Discussion-Papers Series No. 007-2008.

• Ortwin Renn, Alexander Jager, Jurgen Deuschle and Wolfgang Weimer-Jehle, A normative-functional concept of sustainability and its indicators, International Journal of Global Environmental Issues, 9 (2008), 291–317.

Finally, this page contains a more complete list of articles, both practical and theoretical:

## Logic, Probability and Reflection

26 December, 2013

Last week I attended the Machine Intelligence Research Institute’s sixth Workshop on Logic, Probability, and Reflection. This one was in Berkeley, where the institute has its headquarters.

You may know this institute under their previous name: the Singularity Institute. It seems to be the brainchild of Eliezer Yudkowsky, a well-known advocate of ‘friendly artificial intelligence’, whom I interviewed in week311, week312 and week313 of This Week’s Finds. He takes an approach to artificial intelligence that’s heavily influenced by mathematical logic, and I got invited to the workshop because I blogged about a paper he wrote with Mihaly Barasz, Paul Christiano and Marcello Herresho ff on probability theory and logic.

I only have the energy to lay the groundwork for a good explanation of what happened in the workshop. So, after you read my post, please read this:

• Benja Fallenstein, Results from MIRI’s December workshop, Less Wrong, 28 December 2013.

The workshop had two main themes, so let me tell you what they were.

### Scientific induction in mathematics

The first theme is related to that paper I just mentioned. How should a rational agent assign probabilities to statements in mathematics? Of course an omniscient being could assign

probability 1 to every mathematical statement that’s provable,

probability 0 to every statement whose negation is provable,

and

to every statement that is neither provable nor disprovable.

But a real-world rational agent will never have time to check all proofs, so there will always be lots of statements it’s not sure about. Actual mathematicians always have conjectures, like the Twin Prime Conjecture, that we consider plausible even though nobody has proved them. And whenever we do research, we’re constantly estimating how likely it is for statements to be true, and changing our estimates as new evidence comes in. In other words, we use scientific induction in mathematics.

How could we automate this? Most of us don’t consciously assign numerical probabilities to mathematical statements. But maybe an AI mathematician should. If so, what rules should it follow?

It’s natural to try a version of Solomonoff induction, where our probability estimate, before any evidence comes in, favors statements that are simple. However, this runs up against problems. If you’re interested in learning more about this, try:

• Jeremy Hahn, Scientific induction in probabilistic mathematics.

It’s a summary of ideas people came up with during the workshop. I would like to explain them sometime, but for now I should move on.

### The Löbian obstacle

The second main theme was the ‘Löbian obstacle’. Löb’s theorem is the flip side of Gödel’s first incompleteness theorem, less famous but just as shocking. It seems to put limitations on how much a perfectly rational being can trust itself.

Since it’s the day after Christmas, let’s ease our way into these deep waters with the Santa Claus paradox, also known as Curry’s paradox.

If you have a child who is worried that Santa Claus might not exist, you can reassure them using this sentence:

If this sentence is true, Santa Claus exists.

Call it P, for short.

Assume, for the sake of argument, that P is true. Then what it says is true: “If P is true, Santa Claus exists.” And we’re assuming P is true. So, Santa Claus exists.

So, we’ve proved that if P is true, Santa Claus exists.

But that’s just what P says!

So, P is true.

So, Santa Claus exists!

There must be something wrong about this argument, even if Santa Claus does exist, because if it were valid you could you use it to prove anything at all. The self-reference is obviously suspicious. The sentence in question is a variant of the Liar Paradox:

This sentence is false.

since we can rewrite the Liar Paradox as

If this sentence is true, 0 = 1.

and then replace “0=1″ by any false statement you like.

However, Gödel figured out a way to squeeze solid insights from these dubious self-referential sentences. He did this by creating a statement in the language of arithmetic, referring to nothing but numbers, which nonetheless manages to effectively say

This sentence is unprovable.

If it were provable, you’d get a contradiction! So, either arithmetic is inconsistent or this sentence is unprovable. But if it’s unprovable, it’s true. So, there are true but unprovable statements in arithmetic… unless arithmetic is inconsistent! This discovery shook the world of mathematics.

Here I’m being quite sloppy, just to get the idea across.

For one thing, when I’m saying ‘provable’, I mean provable given some specific axioms for arithmetic, like the Peano axioms. If we change our axioms, different statements will be provable.

For another, the concept of ‘true’ statements in arithmetic is often shunned by logicians. That may sound shocking, but there are many reasons for this: for example, Tarski showed that the truth of statements about arithmetic is undefinable in arithmetic. ‘Provability’ is much easier to deal with.

So, a better way of thinking about Gödel’s result is that he constructed a statement that is neither provable nor disprovable from Peano’s axioms of arithmetic, unless those axioms are inconsistent (in which case we can prove everything, but it’s all worthless).

Furthermore, this result applies not just to Peano’s axioms but to any stronger set of axioms, as long as you can write a computer program to list those axioms.

In 1952, the logician Leon Henkin flipped Gödel’s idea around and asked about a sentence in the language of arithmetic that says:

This sentence is provable.

He asked: is this provable or not? The answer is much less obvious than for Gödel’s sentence. Play around with it and see what I mean.

But in 1954, Martin Hugo Löb showed that Henkin’s sentence is provable!

And Henkin noticed something amazing: Löb’s proof shows much more.

At this point it pays to become a bit more precise. Let us write $\mathrm{PA} \vdash P$ to mean the statement $P$ is provable from the Peano axioms of arithmetic. Gödel figured out how to encode statements in arithmetic as numbers, so let’s write $\# P$ for the Gödel number of any statement $P.$ And Gödel figured out how to write a statement in arithmetic, say

$\mathrm{Provable}(n)$

which says that the statement with Gödel number $n$ is provable using the Peano axioms.

Using this terminology, what Henkin originally did was find a number $n$ such that the sentence

$\mathrm{Provable}(n)$

has Gödel number $n.$ So, this sentence says

This sentence is provable from the Peano axioms of arithmetic.

What Löb did was show

$\mathrm{PA} \vdash \mathrm{Provable}(n)$

In other words, he showed that Henkin sentence really is provable from the Peano axioms!

What Henkin then did is prove that for any sentence $P$ in the language of arithmetic, if

$\mathrm{PA} \vdash \mathrm{Provable}(\# P) \implies P$

then

$\mathrm{PA} \vdash P$

In other words, suppose we can prove that the provability of $P$ implies $P.$ Then we can prove $P$!

At first this merely sounds nightmarishly complicated. But if you think about it long enough, you’ll see it’s downright terrifying! For example, suppose $P$ is some famous open question in arithmetic, like the Twin Prime Conjecture. You might hope to prove

The provability of the Twin Prime Conjecture implies the Twin Prime Conjecture.

Indeed, that seems like a perfectly reasonable thing to want. But it turns out that proving this is as hard as proving the Twin Prime Conjecture! Why? Because if we can prove the boldface sentence above, Löb and Henkin’s work instantly gives us a proof of Twin Prime Conjecture!

What does all this have to do with artificial intelligence?

Well, what I just said is true not only for Peano arithmetic, but any set of axioms including Peano arithmetic that a computer program can list. Suppose your highly logical AI mathematician has some such set of axioms, say $\mathrm{AI}.$ You might want it to trust itself. In other words, you might want

$\mathrm{AI} \vdash \mathrm{Provable}(\# P) \implies P$

for every sentence $P.$ This says, roughly, that whatever the AI can prove it can prove, it can prove.

But then Löb’s theorem would kick in and give

$\mathrm{AI} \vdash P$

for every sentence $P.$ And this would be disastrous: our AI would be inconsistent, because it could prove everything!

This is just the beginning of the problems. It gets more involved when we consider AI’s that spawn new AI’s and want to trust them. For more see:

• Eliezer Yudkowsky and Marcello Herreshoff, Tiling agents for self-modifying AI, and the Löbian obstacle.

At workshop various people made progress on this issue, which is recorded in these summaries:

• Eliezer Yudkowsky, The procrastination paradox.

Abstract. A theorem by Marcello Herresho , Benja Fallenstein, and Stuart Armstrong shows that if there exists an infinite series of theories $T_i$ extending $\mathrm{PA}$ where each $T_i$ proves the soundness of $T_{i+1}$, then all the $T_i$ must have only nonstandard models. We call this the Procrastination Theorem for reasons which will become apparent.

Here Fallenstein constructs a di fferent sequence of theories $T_i$ extending Peano arithmetic such that each $T_i$ proves the consistency of $T_{i+1},$ and all the theories are sound for $\Pi_1$ sentences—that is, sentences with only one $\forall$ quantifier outside the rest of the stuff.

The following summaries would take more work to explain:

• Nate Soares, Fallenstein’s monster.

• Nisan Stiennon, Recursively-defined logical theories are well-defined.

• Benja Fallenstein, The 5-and-10 problem and the tiling agents formalism.