## Exploring Climate Data (Part 1)

1 August, 2014

joint with Dara O Shayda

Emboldened by our experiments in El Niño analysis and prediction, people in the Azimuth Code Project have been starting to analyze weather and climate data. A lot of this work is exploratory, with no big conclusions. But it’s still interesting! So, let’s try some blog articles where we present this work.

This one will be about the air pressure on the island of Tahiti and in a city called Darwin in Australia: how they’re correlated, and how each one varies. This article will also be a quick introduction to some basic statistics, as well as ‘continuous wavelet transforms’.

### Darwin, Tahiti and El Niños

The El Niño Southern Oscillation is often studied using the air pressure in Darwin, Australia versus the air pressure in Tahiti. When there’s an El Niño, it gets stormy in the eastern Pacific so the air temperatures tend to be lower in Tahiti and higher in Darwin. When there’s a La Niña, it’s the other way around:

The Southern Oscillation Index or SOI is a normalized version of the monthly mean air pressure anomaly in Tahiti minus that in Darwin. Here anomaly means we subtract off the mean, and normalized means that we divide by the standard deviation.

So, the SOI tends to be negative when there’s an El Niño. On the other hand, when there’s an El Niño the Niño 3.4 index tends to be positive—this says it’s hotter than usual in a certain patch of the Pacific.

Here you can see how this works:

When the Niño 3.4 index is positive, the SOI tends to be negative, and vice versa!

It might be fun to explore precisely how well correlated they are. You can get the data to do that by clicking on the links above.

But here’s another question: how similar are the air pressure anomalies in Darwin and in Tahiti? Do we really need to take their difference, or are they so strongly anticorrelated that either one would be enough to detect an El Niño?

You can get the data to answer such questions here:

Southern Oscillation Index based upon annual standardization, Climate Analysis Section, NCAR/UCAR. This includes links to monthly sea level pressure anomalies in Darwin and Tahiti, in either ASCII format (click the second two links) or netCDF format (click the first one and read the explanation).

In fact this website has some nice graphs already made, which I might as well show you! Here’s the SOI and also the sum of the air pressure anomalies in Darwin and Tahiti, normalized in some way:

(Click to enlarge.)

If the sum were zero, the air pressure anomalies in Darwin and Tahiti would contain the same information and life would be simple. But it’s not!

How similar in character are the air pressure anomalies in Darwin and Tahiti? There are many ways to study this question. Dara tackled it by taking the air pressure anomaly data from 1866 to 2012 and computing some ‘continuous wavelet transforms’ of these air pressure anomalies. This is a good excuse for explaining how a continuous wavelet transform works.

### Very basic statistics

It helps to start with some very basic statistics. Suppose you have a list of numbers

$x = (x_1, \dots, x_n)$

You probably know how to take their mean, or average. People often write this with angle brackets:

$\displaystyle{ \langle x \rangle = \frac{1}{n} \sum_{i = 1}^n x_i }$

You can also calculate the mean of their squares:

$\displaystyle{ \langle x^2 \rangle = \frac{1}{n} \sum_{i = 1}^n x_i^2 }$

If you were naive you might think $\langle x^2 \rangle = \langle x \rangle^2,$ but in fact we have:

$\langle x^2 \rangle \ge \langle x \rangle^2$

and they’re equal only if all the $x_i$ are the same. The point is that if the numbers $x_i$ are spread out, the squares of the big ones (positive or negative) contribute more to the average of the squares than if we had averaged them out before squaring. The difference

$\langle x^2 \rangle - \langle x \rangle^2$

is called the variance; it says how spread out our numbers are. The square root of the variance is the standard deviation:

$\sigma_x = \sqrt{\langle x^2 \rangle - \langle x \rangle^2 }$

and this has the slight advantage that if you multiply all the numbers $x_i$ by some constant $c,$ the standard deviation gets multiplied by $|c|.$ (The variance gets multiplied by $c^2.$)

We can generalize the variance to a situation where we have two lists of numbers:

$x = (x_1, \dots, x_n)$

$y = (y_1, \dots, y_n)$

Namely, we can form the covariance

$\langle x y \rangle - \langle x \rangle \langle y \rangle$

This reduces to the variance when $x = y.$ It measures how much $x$ and $y$ vary together — ‘hand in hand’, as it were. A bit more precisely: if $x_i$ is greater than its mean value mainly for $i$ such that $y_i$ is greater than its mean value, the covariance is positive. On the other hand, if $x_i$ tends to be greater than average when $y_i$ is smaller than average — like with the air pressures at Darwin and Tahiti — the covariance will be negative.

For example, if

$x = (1,-1), \quad y = (1,-1)$

then they ‘vary hand in hand’, and the covariance

$\langle x y \rangle - \langle x \rangle \langle y \rangle = 1 - 0 = 1$

is positive. But if

$x = (1,-1), \quad y = (-1,1)$

then one is positive when the other is negative, so the covariance

$\langle x y \rangle - \langle x \rangle \langle y \rangle = -1 - 0 = -1$

is negative.

Of course the covariance will get bigger if we multiply both $x$ and $y$ by some big number. If we don’t want this effect, we can normalize the covariance and get the correlation:

$\displaystyle{ \frac{ \langle x y \rangle - \langle x \rangle \langle y \rangle }{\sigma_x \sigma_y} }$

which will always be between $-1$ and $1.$

For example, if we compute the correlation between the air pressure anomalies at Darwin and Tahiti, measured monthly from 1866 to 2012, we get
-0.253727. This indicates that when one goes up, the other tends to go down. But since we’re not getting -1, it means they’re not completely locked into a linear relationship where one is some negative number times the other.

Okay, we’re almost ready for continuous wavelet transforms! Here is the main thing we need to know. If the mean of either $x$ or $y$ is zero, the formula for covariance simplifies a lot, to

$\displaystyle{ \langle x y \rangle = \frac{1}{n} \sum_{i = 1}^n x_i y_i }$

So, this quantity says how much the numbers $x_i$ ‘vary hand in hand’ with the numbers $y_i,$ in the special case when one (or both) has mean zero.

We can do something similar if $x, y : \mathbb{R} \to \mathbb{R}$ are functions of time defined for all real numbers $t.$ The sum becomes an integral, and we have to give up on dividing by $n.$ We get:

$\displaystyle{ \int_{-\infty}^\infty x(t) y(t)\; d t }$

This is called the inner product of the functions $x$ and $y,$ and often it’s written $\langle x, y \rangle,$ but it’s a lot like the covariance.

### Continuous wavelet transforms

What are continuous wavelet transforms, and why should we care?

People have lots of tricks for studying ‘signals’, like series of numbers $x_i$ or functions $x : \mathbb{R} \to \mathbb{R}.$ One method is to ‘transform’ the signal in a way that reveals useful information. The Fourier transform decomposes a signal into sines and cosines of different frequencies. This lets us see how much power the signal has at different frequencies, but it doesn’t reveal how the power at different frequencies changes with time. For that we should use something else, like the Gabor transform explained by Blake Pollard in a previous post.

Sines and cosines are great, but we might want to look for other patterns in a signal. A ‘continuous wavelet transform’ lets us scan a signal for appearances of a given pattern at different times and also at different time scales: a pattern could go by quickly, or in a stretched out slow way.

To implement the continuous wavelet transform, we need a signal and a pattern to look for. The signal could be a function $x : \mathbb{R} \to \mathbb{R}.$ The pattern would then be another function $y: \mathbb{R} \to \mathbb{R},$ usually called a wavelet.

Here’s an example of a wavelet:

If we’re in a relaxed mood, we could call any function that looks like a bump with wiggles in it a wavelet. There are lots of famous wavelets, but this particular one is the fourth derivative of a certain Gaussian. Mathematica calls this particular wavelet DGaussianWavelet[4], and you can look up the formula under ‘Details’ on their webpage.

However, the exact formula doesn’t matter at all now! If we call this wavelet $y,$ all that matters is that it’s a bump with wiggles on it, and that its mean value is 0, or more precisely:

$\displaystyle{ \int_{-\infty}^\infty y(t) \; d t = 0 }$

As we saw in the last section, this fact lets us take our function $x$ and the wavelet $y$ and see how much they ‘vary hand it hand’ simply by computing their inner product:

$\displaystyle{ \langle x , y \rangle = \int_{-\infty}^\infty x(t) y(t)\; d t }$

Loosely speaking, this measures the ‘amount of $y$-shaped wiggle in the function $x$’. It’s amazing how hard it is to say something in plain English that perfectly captures the meaning of a simple formula like the above one—so take the quoted phrase with a huge grain of salt. But it gives a rough intuition.

Our wavelet $y$ happens to be centered at $t = 0.$ However, we might be interested in $y$-shaped wiggles that are centered not at zero but at some other number $s.$ We could detect these by shifting the function $y$ before taking its inner product with $x$:

$\displaystyle{ \int_{-\infty}^\infty x(t) y(t-s)\; d t }$

We could also be interested in measuring the amount of some stretched-out or squashed version of a $y$-shaped wiggle in the function $x.$ Again we could do this by changing $y$ before taking its inner product with $x$:

$\displaystyle{ \int_{-\infty}^\infty x(t) \; y\left(\frac{t}{P}\right) \; d t }$

When $P$ is big, we get a stretched-out version of $y.$ People sometimes call $P$ the period, since the period of the wiggles in $y$ will be proportional to this (though usually not equal to it).

Finally, we can combine these ideas, and compute

$\displaystyle{ \int_{-\infty}^\infty x(t) \; y\left(\frac{t- s}{P}\right)\; dt }$

This is a function of the shift $s$ and period $P$ which says how much of the $s$-shifted, $P$-stretched wavelet $y$ is lurking in the function $x.$ It’s a version of the continuous wavelet transform!

Mathematica implements this idea for time series, meaning lists of numbers $x = (x_1,\dots,x_n)$ instead of functions $x : \mathbb{R} \to \mathbb{R}.$ The idea is that we think of the numbers as samples of a function $x$:

$x_1 = x(\Delta t)$

$x_2 = x(2 \Delta t)$

and so on, where $\Delta t$ is some time step, and replace the integral above by a suitable sum. Mathematica has a function ContinuousWaveletTransform that does this, giving

$\displaystyle{ w(s,P) = \frac{1}{\sqrt{P}} \sum_{i = 1}^n x_i \; y\left(\frac{i \Delta t - s}{P}\right) }$

The factor of $1/\sqrt{P}$ in front is a useful extra trick: it’s the right way to compensate for the fact that when you stretch out out your wavelet $y$ by a factor of $P,$ it gets bigger. So, when we’re doing integrals, we should define the continuous wavelet transform of $y$ by:

$\displaystyle{ w(s,P) = \frac{1}{\sqrt{P}} \int_{-\infty}^\infty x(t) y(\frac{t- s}{P})\; dt }$

### The results

Dara Shayda started with the air pressure anomaly at Darwin and Tahiti, measured monthly from 1866 to 2012. Taking DGaussianWavelet[4] as his wavelet, he computed the continuous wavelet transform $w(s,P)$ as above. To show us the answer, he created a scalogram:

This is a 2-dimensional color plot showing roughly how big the continuous wavelet transform $w(s,P)$ is for different shifts $s$ and periods $P.$ Blue means it’s very small, green means it’s bigger, yellow means even bigger and red means very large.

Tahiti gave this:

You’ll notice that the patterns at Darwin and Tahiti are similar in character, but notably different in detail. For example, the red spots, where our chosen wavelet shows up strongly with period of order ~100 months, occur at different times.

Puzzle 1. What is the meaning of the ‘spikes’ in these scalograms? What sort of signal would give a spike of this sort?

Puzzle 2. Do a Gabor transform, also known as a ‘windowed Fourier transform’, of the same data. Blake Pollard explained the Gabor transform in his article Milankovitch vs the Ice Ages. This is a way to see how much a signal wiggles at a given frequency at a given time: we multiply the signal by a shifted Gaussian and then takes its Fourier transform.

Puzzle 3. Read about continuous wavelet transforms. If we want to reconstruct our signal $x$ from its continuous wavelet transform, why should we use a wavelet $y$ with

$\displaystyle{\int_{-\infty}^\infty y(t) \; d t = 0 ? }$

In fact we want a somewhat stronger condition, which is implied by the above equation when the Fourier transform of $y$ is smooth and integrable:

Continuous wavelet transform, Wikipedia.

### Another way to understand correlations

David Tweed mentioned another approach from signal processing to understanding the quantity

$\displaystyle{ \langle x y \rangle = \frac{1}{n} \sum_{i = 1}^n x_i y_i }$

If we’ve got two lists of data $x$ and $y$ that we want to compare to see if they behave similarly, the first thing we ought to do is multiplicatively scale each one so they’re of comparable magnitude. There are various possibilities for assigning a scale, but a reasonable one is to ensure they have equal ‘energy’

$\displaystyle{ \sum_{i=1}^n x_i^2 = \sum_{i=1}^n y_i^2 }$

(This can be achieved by dividing each list by its standard deviation, which is equivalent to what was done in the main derivation above.) Once we’ve done that then it’s clear that looking at

$\displaystyle{ \sum_{i=1}^n (x_i-y_i)^2 }$

gives small values when they have a very good match and progressively bigger values as they become less similar. Observe that

$\begin{array}{ccl} \displaystyle{\sum_{i=1}^n (x_i-y_i)^2 } &=& \displaystyle{ \sum_{i=1}^n (x_i^2 - 2 x_i y_i + y_i^2) }\\ &=& \displaystyle{ \sum_{i=1}^n x_i^2 - 2 \sum_{i=1}^n x_i y_i + \sum_{i=1}^n y_i^2 } \end{array}$

Since we’ve scaled things so that $\sum_{i=1}^n x_i^2$ and $\sum_{i=1}^n y_i^2$ are constants, we can see that when $\sum_{i=1}^n x_i y_i$ becomes bigger,

$\displaystyle{ \sum_{i=1}^n (x_i-y_i)^2 }$

becomes smaller. So,

$\displaystyle{\sum_{i=1}^n x_i y_i}$

serves as a measure of how close the lists are, under these assumptions.

## The Stochastic Resonance Program (Part 1)

10 May, 2014

guest post by David Tanzer

At the Azimuth Code Project, we are aiming to produce educational software that is relevant to the Earth sciences and the study of climate. Our present software takes the form of interactive web pages, which allow you to experiment with the parameters of models and view their outputs. But to fully understand the meaning of a program, we need to know about the concepts and theories that inform it. So we will be writing articles to explain both the programs themselves and the math and science behind them.

In this two-part series, I’ll explain this program:

Check it out—it runs on your browser! It was created by Allan Erskine and Glyn Adgie. In the Azimuth blog article Increasing the Signal-to-Noise Ratio with More Noise, Glyn Adgie and Tim van Beek give a nice explanation of the idea of stochastic resonance, which includes some clear and exciting graphs.

My goal today is give a compact, developer-oriented introduction to stochastic resonance, which will set the context for the next blog article, where I’ll dissect the program itself. By way of introduction, I am a software developer with research training in computer science. It’s a new area for me, and any clarifications will be welcome!

### The concept of stochastic resonance

Stochastic resonance is a phenomenon, occurring under certain circumstances, in which a noise source may amplify the effect of a weak signal. This concept was used in an early hypothesis about the timing of ice-age cycles, and has since been applied to a wide range of phenomena, including neuronal detection mechanisms and patterns of traffic congestion.

Suppose we have a signal detector whose internal, analog state is driven by an input signal, and suppose the analog states are partitioned into two regions, called “on” and “off” — this is a digital state, abstracted from the analog state. With a light switch, we could take the force as the input signal, the angle as the analog state, and the up/down classification of the angle as the digital state.

Consider the effect of a periodic input signal on the digital state. Suppose the wave amplitude is not large enough to change the digital state, yet large enough to drive the analog state close to the digital state boundary. Then, a bit of random noise, occurring near the peak of an input cycle, may “tap” the system over to the other digital state. So we will see a probability of state-transitions that is synchronized with the input signal. In a complex way, the noise has amplified the input signal.

But it’s a pretty funky amplifier! Here is a picture from the Azimuth library article on stochastic resonance:

Stochastic resonance has been found in the signal detection mechanisms of neurons. There are, for example, cells in the tails of crayfish that are tuned to low-frequency signals in the water caused by predator motions. These signals are too weak to cross the firing threshold for the neurons, but with the right amount of noise, they do trigger the neurons.

See:

Stochastic resonance, Azimuth Library.

Stochastic resonance in neurobiology, David Lyttle.

### Bistable stochastic resonance and Milankovitch theories of ice-age cycles

Stochastic resonance was originally formulated in terms of systems that are bistable — where each digital state is the basin of attraction of a stable equilibrium.

An early application of stochastic resonance was to a hypothesis, within the framework of bistable climate dynamics, about the timing of the ice-age cycles. Although it has not been confirmed, it remains of interest (1) historically, (2) because the timing of ice-age cycles remains an open problem, and (3) because the Milankovitch hypothesis upon which it rests is an active part of the current research.

In the bistable model, the climate states are a cold, “snowball” Earth and a hot, iceless Earth. The snowball Earth is stable because it is white, and hence reflects solar energy, which keeps it frozen. The iceless Earth is stable because it is dark, and hence absorbs solar energy, which keeps it melted.

The Milankovitch hypothesis states that the drivers of climate state change are long-duration cycles in the insolation — the solar energy received in the northern latitudes — caused by periodic changes in the Earth’s orbital parameters. The north is significant because that is where the glaciers are concentrated, and so a sufficient “pulse” in northern temperatures could initiate a state change.

Three relevant astronomical cycles have been identified:

• Changing of the eccentricity of the Earth’s elliptical orbit, with a period of 100 kiloyears

• Changing of the obliquity (tilt) of the Earth’s axis, with a period of 41 kiloyears

• Precession (swiveling) of the Earth’s axis, with a period of 23 kiloyears

In the stochastic resonance hypothesis, the Milankovitch signal is amplified by random events to produce climate state changes. In more recent Milankovitch theories, a deterministic forcing mechanism is used. In a theory by Didier Paillard, the climate is modeled with three states, called interglacial, mild glacial and full glacial, and the state changes depend on the volume of ice as well as the insolation.

See:

Milankovitch cycle, Azimuth Library.

Mathematics of the environment (part 10), John Baez. This gives an exposition of Paillard’s theory.

### Bistable systems defined by a potential function

Any smooth function with two local minima can be used to define a bistable system. For instance, consider the function $V(x) = x^4/4 - x^2/2$:

To define the bistable system, construct a differential equation where the time derivative of x is set to the negative of the derivative of the potential at x:

$dx/dt = -V'(x) = -x^3 + x = x(1 - x^2)$

So, for instance, where the potential graph is sloping upward as $x$ increases, $-V'(x)$ is negative, and this sends $X(t)$ ‘downhill’ towards the minimum.

The roots of $V'(x)$ yield stable equilibria at 1 and -1, and an unstable equilibrium at 0. The latter separates the basins of attraction for the stable equilibria.

### Discrete stochastic resonance

Now let’s look at a discrete-time model which exhibits stochastic resonance. This is the model used in the Azimuth demo program.

We construct the discrete-time derivative, using the potential function, a sampled sine wave, and a normally distributed random number:

$\Delta X_t = -V'(X_t) * \Delta t + \mathrm{Wave}(t) + \mathrm{Noise}(t) =$
$X_t (1 - X_t^2) \Delta t + \alpha * \sin(\omega t) + \beta * \mathrm{GaussianSample}(t)$

where $\Delta t$ is a constant and $t$ is restricted to multiples of $\Delta t.$

This equation is the discrete-time counterpart to a continuous-time stochastic differential equation.

Next time, we will look into the Azimuth demo program itself.

## Noether’s Theorem: Quantum vs Stochastic

3 May, 2014

guest post by Ville Bergholm

In 1915 Emmy Noether discovered an important connection between the symmetries of a system and its conserved quantities. Her result has become a staple of modern physics and is known as Noether’s theorem.

The theorem and its generalizations have found particularly wide use in quantum theory. Those of you following the Network Theory series here on Azimuth might recall Part 11 where John Baez and Brendan Fong proved a version of Noether’s theorem for stochastic systems. Their result is now published here:

• John Baez and Brendan Fong, A Noether theorem for stochastic mechanics, J. Math. Phys. 54:013301 (2013).

One goal of the network theory series here on Azimuth has been to merge ideas appearing in quantum theory with other disciplines. John and Brendan proved their stochastic version of Noether’s theorem by exploiting ‘stochastic mechanics’ which was formulated in the network theory series to mathematically resemble quantum theory. Their result, which we will outline below, was different than what would be expected in quantum theory, so it is interesting to try to figure out why.

Recently Jacob Biamonte, Mauro Faccin and myself have been working to try to get to the bottom of these differences. What we’ve done is prove a version of Noether’s theorem for Dirichlet operators. As you may recall from Parts 16 and 20 of the network theory series, these are the operators that generate both stochastic and quantum processes. In the language of the series, they lie in the intersection of stochastic and quantum mechanics. So, they are a subclass of the infinitesimal stochastic operators considered in John and Brendan’s work.

The extra structure of Dirichlet operators—compared with the wider class of infinitesimal stochastic operators—provided a handle for us to dig a little deeper into understanding the intersection of these two theories. By the end of this article, astute readers will be able to prove that Dirichlet operators generate doubly stochastic processes.

Before we get into the details of our proof, let’s recall first how conservation laws work in quantum mechanics, and then contrast this with what John and Brendan discovered for stochastic systems. (For a more detailed comparison between the stochastic and quantum versions of the theorem, see Part 13 of the network theory series.)

### The quantum case

I’ll assume you’re familiar with quantum theory, but let’s start with a few reminders.

In standard quantum theory, when we have a closed system with $n$ states, the unitary time evolution of a state $|\psi(t)\rangle$ is generated by a self-adjoint $n \times n$ matrix $H$ called the Hamiltonian. In other words, $|\psi(t)\rangle$ satisfies Schrödinger’s equation:

$i \hbar \displaystyle{\frac{d}{d t}} |\psi(t) \rangle = H |\psi(t) \rangle.$

The state of a system starting off at time zero in the state $|\psi_0 \rangle$ and evolving for a time $t$ is then given by

$|\psi(t) \rangle = e^{-i t H}|\psi_0 \rangle.$

The observable properties of a quantum system are associated with self-adjoint operators. In the state $|\psi \rangle,$ the expected value of the observable associated to a self-adjoint operator $O$ is

$\langle O \rangle_{\psi} = \langle \psi | O | \psi \rangle$

This expected value is constant in time for all states if and only if $O$ commutes with the Hamiltonian $H$:

$[O, H] = 0 \quad \iff \quad \displaystyle{\frac{d}{d t}} \langle O \rangle_{\psi(t)} = 0 \quad \forall \: |\psi_0 \rangle, \forall t.$

In this case we say $O$ is a ‘conserved quantity’. The fact that we have two equivalent conditions for this is a quantum version of Noether’s theorem!

### The stochastic case

In stochastic mechanics, the story changes a bit. Now a state $|\psi(t)\rangle$ is a probability distribution: a vector with $n$ nonnegative components that sum to 1. Schrödinger’s equation gets replaced by the master equation:

$\displaystyle{\frac{d}{d t}} |\psi(t) \rangle = H |\psi(t) \rangle$

If we start with a probability distribution $|\psi_0 \rangle$ at time zero and evolve it according to this equation, at any later time have

$|\psi(t)\rangle = e^{t H} |\psi_0 \rangle.$

We want this always be a probability distribution. To ensure that this is so, the Hamiltonian $H$ must be infinitesimal stochastic: that is, a real-valued $n \times n$ matrix where the off-diagonal entries are nonnegative and the entries of each column sum to zero. It no longer needs to be self-adjoint!

When $H$ is infinitesimal stochastic, the operators $e^{t H}$ map the set of probability distributions to itself whenever $t \ge 0,$ and we call this family of operators a continuous-time Markov process, or more precisely a Markov semigroup.

In stochastic mechanics, we say an observable $O$ is a real diagonal $n \times n$ matrix, and its expected value is given by

$\langle O\rangle_{\psi} = \langle \hat{O} | \psi \rangle$

where $\hat{O}$ is the vector built from the diagonal entries of $O.$ More concretely,

$\langle O\rangle_{\psi} = \displaystyle{ \sum_i O_{i i} \psi_i }$

where $\psi_i$ is the $i$th component of the vector $|\psi\rangle.$

Here is a version of Noether’s theorem for stochastic mechanics:

Noether’s Theorem for Markov Processes (Baez–Fong). Suppose $H$ is an infinitesimal stochastic operator and $O$ is an observable. Then

$[O,H] =0$

if and only if

$\displaystyle{\frac{d}{d t}} \langle O \rangle_{\psi(t)} = 0$

and

$\displaystyle{\frac{d}{d t}}\langle O^2 \rangle_{\psi(t)} = 0$

for all $t \ge 0$ and all $\psi(t)$ obeying the master equation.   █

So, just as in quantum mechanics, whenever $[O,H]=0$ the expected value of $O$ will be conserved:

$\displaystyle{\frac{d}{d t}} \langle O\rangle_{\psi(t)} = 0$

for any $\psi_0$ and all $t \ge 0.$ However, John and Brendan saw that—unlike in quantum mechanics—you need more than just the expectation value of the observable $O$ to be constant to obtain the equation $[O,H]=0.$ You really need both

$\displaystyle{\frac{d}{d t}} \langle O\rangle_{\psi(t)} = 0$

together with

$\displaystyle{\frac{d}{d t}} \langle O^2\rangle_{\psi(t)} = 0$

for all initial data $\psi_0$ to be sure that $[O,H]=0.$

So it’s a bit subtle, but symmetries and conserved quantities have a rather different relationship than they do in quantum theory.

### A Noether theorem for Dirichlet operators

But what if the infinitesimal generator of our Markov semigroup is also self-adjoint? In other words, what if $H$ is both an infinitesimal stochastic matrix but also its own transpose: $H = H^\top$? Then it’s called a Dirichlet operator… and we found that in this case, we get a stochastic version of Noether’s theorem that more closely resembles the usual quantum one:

Noether’s Theorem for Dirichlet Operators. If $H$ is a Dirichlet operator and $O$ is an observable, then

$[O, H] = 0 \quad \iff \quad \displaystyle{\frac{d}{d t}} \langle O \rangle_{\psi(t)} = 0 \quad \forall \: |\psi_0 \rangle, \forall t \ge 0$

Proof. The $\Rightarrow$ direction is easy to show, and it follows from John and Brendan’s theorem. The point is to show the $\Leftarrow$ direction. Since $H$ is self-adjoint, we may use a spectral decomposition:

$H = \displaystyle{ \sum_k E_k |\phi_k \rangle \langle \phi_k |}$

where $\phi_k$ are an orthonormal basis of eigenvectors, and $E_k$ are the corresponding eigenvalues. We then have:

$\displaystyle{\frac{d}{d t}} \langle O \rangle_{\psi(t)} = \langle \hat{O} | H e^{t H} |\psi_0 \rangle = 0 \quad \forall \: |\psi_0 \rangle, \forall t \ge 0$

$\iff \quad \langle \hat{O}| H e^{t H} = 0 \quad \forall t \ge 0$

$\iff \quad \sum_k \langle \hat{O} | \phi_k \rangle E_k e^{t E_k} \langle \phi_k| = 0 \quad \forall t \ge 0$

$\iff \quad \langle \hat{O} | \phi_k \rangle E_k e^{t E_k} = 0 \quad \forall t \ge 0$

$\iff \quad |\hat{O} \rangle \in \mathrm{Span}\{|\phi_k \rangle \, : \; E_k = 0\} = \ker \: H,$

where the third equivalence is due to the vectors $|\phi_k \rangle$ being linearly independent. For any infinitesimal stochastic operator $H$ the corresponding transition graph consists of $m$ connected components iff we can reorder (permute) the states of the system such that $H$ becomes block-diagonal with $m$ blocks. Now it is easy to see that the kernel of $H$ is spanned by $m$ eigenvectors, one for each block. Since $H$ is also symmetric, the elements of each such vector can be chosen to be ones within the block and zeros outside it. Consequently

$|\hat{O} \rangle \in \ker \: H$

implies that we can choose the basis of eigenvectors of $O$ to be the vectors $|\phi_k \rangle,$ which implies

$[O, H] = 0$

Alternatively,

$|\hat{O} \rangle \in \ker \, H$

implies that

$|\hat{O^2} \rangle \in \ker \: H \; \iff \; \cdots \; \iff \; \displaystyle{\frac{d}{d t}} \langle O^2 \rangle_{\psi(t)} = 0 \; \forall \: |\psi_0 \rangle, \forall t \ge 0,$

where we have used the above sequence of equivalences backwards. Now, using John and Brendan’s original proof, we can obtain $[O, H] = 0.$   █

In summary, by restricting ourselves to the intersection of quantum and stochastic generators, we have found a version of Noether’s theorem for stochastic mechanics that looks formally just like the quantum version! However, this simplification comes at a cost. We find that the only observables $O$ whose expected value remains constant with time are those of the very restricted type described above, where the observable has the same value in every state in a connected component.

### Puzzles

Suppose we have a graph whose graph Laplacian matrix $H$ generates a Markov semigroup as follows:

$U(t) = e^{t H}$

Puzzle 1. Suppose that also $H = H^\top,$ so that $H$ is a Dirichlet operator and hence $i H$ generates a 1-parameter unitary group. Show that the indegree and outdegree of any node of our graph must be equal. Graphs with this property are called balanced.

Puzzle 2. Suppose that $U(t) = e^{t H}$ is doubly stochastic Markov semigroup, meaning that for all $t \ge 0$ each row and each column of $U(t)$ sums to 1:

$\displaystyle{ \sum_i U(t)_{i j} = \sum_j U(t)_{i j} = 1 }$

and all the matrix entries are nonnegative. Show that the Hamiltonian $H$ obeys

$\displaystyle{\sum_i H_{i j} = \sum_j H_{i j} = 0 }$

and all the off-diagonal entries of $H$ are nonnegative. Show the converse is also true.

Puzzle 3. Prove that any doubly stochastic Markov semigroup $U(t)$ is of the form $e^{t H}$ where $H$ is the graph Laplacian of a balanced graph.

Puzzle 4. Let $O(t)$ be a possibly time-dependent observable, and write $\langle O(t) \rangle_{\psi(t)}$ for its expected value with respect to some initial state $\psi_0$ evolving according to the master equation. Show that

$\displaystyle{ \frac{d}{d t}\langle O(t)\rangle_{\psi(t)} = \left\langle [O(t), H] \right\rangle_{\psi(t)} + \left\langle \frac{\partial O(t)}{\partial t}\right\rangle_{\psi(t)} }$

This is a stochastic version of the Ehrenfest theorem.

## Network Theory III

16 March, 2014

In the last of my Oxford talks I explain how entropy and relative entropy can be understood using certain categories related to probability theory… and how these categories also let us understand Bayesian networks!

The first two parts are explanations of these papers:

• John Baez, Tobias Fritz and Tom Leinster, A characterization of entropy in terms of information loss

• John Baez and Tobias Fritz, A Bayesian characterization of relative entropy.

Somewhere around here the talk was interrupted by a fire drill, waking up the entire audience!

By the way, in my talk I mistakenly said that relative entropy is a continuous functor; in fact it’s just lower semicontinuous. I’ve fixed this in my slides.

The third part of my talk was my own interpretation of Brendan Fong’s master’s thesis:

• Brendan Fong, Causal Theories: a Categorical Perspective on Bayesian Networks.

I took a slightly different approach, by saying that a causal theory $\mathcal{C}_G$ is the free category with products on certain objects and morphisms coming from a directed acyclic graph $G$. In his thesis he said $\mathcal{C}_G$ was the free symmetric monoidal category where each generating object is equipped with a cocommutative comonoid structure. This is close to a category with finite products, though perhaps not quite the same: a symmetric monoidal category where every object is equipped with a cocommutative comonoid structure in a natural way (i.e., making a bunch of squares commute) is a category with finite products. It would be interesting to see if this difference hurts or helps.

By making this slight change, I am claiming that causal theories can be seen as algebraic theories in the sense of Lawvere. This would be a very good thing, since we know a lot about those.

You can also see the slides of this talk. Click on any picture in the slides, or any text in blue, and get more information!

## Network Theory II

12 March, 2014

Chemists are secretly doing applied category theory! When chemists list a bunch of chemical reactions like

C + O₂ → CO₂

they are secretly describing a ‘category’.

That shouldn’t be surprising. A category is simply a collection of things called objects together with things called morphisms going from one object to another, often written

f: x → y

The rules of a category say:

1) we can compose a morphism f: x → y and another morphism g: y → z to get an arrow gf: x → z,

2) (hg)f = h(gf), so we don’t need to bother with parentheses when composing arrows,

3) every object x has an identity morphism 1ₓ: x → x that obeys 1ₓ f = f and f 1ₓ = f.

Whenever we have a bunch of things (objects) and processes (arrows) that take one thing to another, we’re likely to have a category. In chemistry, the objects are bunches of molecules and the arrows are chemical reactions. But we can ‘add’ bunches of molecules and also add reactions, so we have something more than a mere category: we have something called a symmetric monoidal category.

My talk here, part of a series, is an explanation of this viewpoint and how we can use it to take ideas from elementary particle physics and apply them to chemistry! ﻿For more details try this free book:

• John Baez and Jacob Biamonte, A Course on Quantum Techniques for Stochastic Mechanics.

as well as this paper on the Anderson–Craciun–Kurtz theorem (discussed in my talk):

• John Baez and Brendan Fong, Quantum techniques for studying equilibrium in reaction networks.

You can also see the slides of this talk. Click on any picture in the slides, or any text in blue, and get more information!

## Markov Models of Social Change (Part 1)

24 February, 2014

guest post by Alastair Jamieson-Lane

The world is complex, and making choices in a complex world is sometimes difficult.

As any leader knows, decisions must often be made with incomplete information. To make matters worse, the experts and scientists who are meant to advise on these important matters are also doing so with incomplete information—usually limited to only one or two specialist fields. When decisions need to be made that are dependent on multiple real-world systems, and your various advisors find it difficult to communicate, this can be problematic!

The generally accepted approach is to listen to whichever advisor tells you the things you want to hear.

When such an approach fails (for whatever mysterious and inexplicable reason) it might be prudent to consider such approaches as Bayesian inference, analysis of competing hypotheses or cross-impact balance analysis.

Because these methods require experts to formalize their opinions in an explicit, discipline neutral manner, we avoid many of the problems mentioned above. Also, if everything goes horribly wrong, you can blame the algorithm, and send the rioting public down to the local university to complain there.

In this blog article I will describe cross-impact balance analysis and a recent extension to this method, explaining its use, as well as some basic mathematical underpinnings. No familiarity with cross-impact balance analysis will be required.

### Wait—who is this guy?

Since this is my first time writing a blog post here, I hear introductions are in order.

Hi. I’m Alastair.

I am currently a Master’s student at the University of British Columbia, studying mathematics. In particular, I’m aiming to use evolutionary game theory to study academic publishing and hiring practices… and from there hopefully move on to studying governments (we’ll see how the PhD goes). I figure that both those systems seem important to solving the problems we’ve built for ourselves, and both may be under increasing pressure in coming years.

But that’s not what I’m here for today! Today I’m here to tell the story of cross-impact balance analysis, a tool I was introduced to at the complex systems summer school in Santa Fe.

### The story

Suppose (for example) that the local oracle has foretold that burning the forests will anger the nature gods

… and that if you do not put restrictions in place, your crops will wither and die.

Well, that doesn’t sound very good.

The merchant’s guild claims that such restrictions will cause all trade to grind to a halt.

Your most trusted generals point out that weakened trade will leave you vulnerable to invasion from all neighboring kingdoms.

The sailors guild adds that the wrath of Poseidon might make nautical trade more difficult.

The alchemists propose alternative sources of heat…

… while the druids propose special crops as a way of resisting the wrath of the gods…

… and so on.

Given this complex web of interaction, it might be a good time to consult the philosophers.

### Overview of CIB

This brings us to the question of what CIB (Cross-Impact Balance) analysis is, and how to use it.

At its heart, CIB analysis demands this: first, you must consider what aspects of the world you are interested in studying. This could be environmental or economic status, military expenditure, or the laws governing genetic modification. These we refer to as “descriptors”. For each “descriptor” we must create a list of possible “states”.

For example, if the descriptor we are interested in were “global temperature change” our states might be “+5 degree”, “+4 degrees” and so on down to “-2 degrees”.

The states of a descriptor are not meant to be all-encompassing, or offer complete detail, and they need not be numerical. For example, the descriptor “Agricultural policy” might have such states as “Permaculture subsidy”, “Genetic engineering”, “Intensive farming” or “No policy”.

For each of these states, we ask our panel of experts whether such a state would increase or decrease the tendency for some other descriptor to be in a particular state.

For example, we might ask: “On a scale from -3 to 3, how much does the agricultural policy of Intensive farming increase the probability that we will see global temperature increases of +2 degrees?”

By combining the opinions of a variety of experts in each field, and weighting based on certainty and expertise, we are able to construct matrices, much like the one below:

The above matrix is a description of my ant farm. The health of my colony is determined by the population, income, and education levels of my ants. For a less ant focused version of the above, please refer to:

• Elisabeth A. Lloyd and Vanessa J. Schweizer, Objectivity and a comparison of methodological scenario approaches for climate change research, Synthese (2013).

For any possible combination of descriptor states (referred to as a scenario) we can calculate the total impact on all possible descriptors. In the current scenario we have low population, high income and medium education (see highlighted rows).

Because the current scenario has high ant income, this strongly influences us to have low population (+3) and prevents a jump to high population (-3). This combined with the non-influence from education (zeros) leads to low population being the most favoured state for our population descriptor. Thus we expect no change. We say this is “consistent”.

Education however sees a different story. Here we have a strong influence towards high education levels (summing the column gives a total of 13). Thus our current state (medium education) is inconsistent, and we would expect the abundance of ant wealth to lead to an improvements in the ant schooling system.

Classical CIB analysis acts as a way to classify which hypothetical situations are consistent, and which are not.

Now, it is all well and good to claim that some scenarios are stable, but the real use of such a tool is in predicting (and influencing) the future.

By applying a deterministic rule that determines how inconsistencies are resolved, we can produce a “succession rule”. The most straight-forward example is to replace all descriptor states with whichever state is most favoured by the current scenario. In the example above we would switch to “Low population, medium income, high education”. A generation later we would switch back to “Low population, High income, medium education”, soon finding ourselves trapped in a loop.

All such rules will always lead to either a loop or a “sink”: a self consistent scenario which is succeeded only by itself.

So, how can we use this? How will this help us deal with the wrath of the gods (or ant farms)?

Firstly: we can identify loops and consistent scenarios which we believe are most favourable. It’s all well and good imagining some future utopia, but if it is inconsistent with itself, and will immediately lead to a slide into less favourable scenarios then we should not aim for it, we should find that most favourable realistic scenario and aim for that one.

Secondly: We can examine all our consistent scenarios, and determine whose “basin of attraction” we find ourselves in: that is, which scenario are we likely to end up in.

Thirdly: Suppose we could change our influence matrix slightly? How would we change it to favour scenarios we most prefer? If you don’t like the rules, change the game—or at the very least find out WHAT we would need to change to have the best effect.

### Concerns and caveats

So… what are the problems we might encounter? What are the drawbacks?

Well, first of all, we note that the real world does not tend to reach any form of eternal static scenario or perfect cycle. The fact that our model does might be regarded as reason for suspicion.

Secondly, although the classical method contains succession analysis, this analysis is not necessarily intended as a completely literal “prediction” of events. It gives a rough idea of the basins of attraction of our cycles and consistent scenarios, but is also somewhat arbitrary. What succession rule is most appropriate? Do all descriptors update simultaneously? Or only the one with the most “pressure”? Are our descriptors given in order of malleability, and only the fastest changing descriptor will change?

Thirdly, in collapsing our description of the world down into a finite number of states we are ignoring many tiny details. Most of these details are not important, but in assuming that our succession rules are deterministic, we imply that these details have no impact whatsoever.

If we instead treat succession as a somewhat random process, the first two of these problems can be solved, and the third somewhat reduced.

### Stochastic succession

In the classical CIB succession analysis, some rule is selected which deterministically decides which scenario follows from the present. Stochastic succession analysis instead tells us the probability that a given scenario will lead to another.

The simplest example of a stochastic succession rule is to simply select a single descriptor at random each time step, and only consider updates that might happen to that descriptor. This we refer to as dice succession. This (in some ways) represents hidden information: two systems that might look identical on the surface from the point of view of our very blockish CIB analysis might be different enough underneath to lead to different outcomes. If we have a shaky agricultural system, but a large amount of up-and-coming research, then which of these two factors becomes important first is down to the luck of the draw. Rather than attempt to model this fine detail, we instead merely accept it and incorporate this uncertainty into our model.

Even this most simplistic change leads to dramatics effects on our system. Most importantly, almost all cycles vanish from our results, as forks in the road allow us to diverge from the path of the cycle.

We can take stochastic succession further and consider more exotic rules for our transitions, ones that allow any transition to take place, not merely those that are most favored. For example:

$P(x,y) = A e^{I_x(y)/T}$

Here $x$ is our current scenario, $y$ is some possible future scenario, and $I_x(y)$ is the total impact score of $y$ from the perspective of $x$. $A$ is a simple normalizing constant, and $T$ is our system’s temperature. High temperature systems are dominated by random noise, while low temperature systems are dominated by the influences described by our experts.

Impact score is calculated by summing the impact of each state of our current scenario, on each state of our target scenario. For example, for the above, suppose we want to find $I_x(y)$ when $x$ is the given scenario “Low population, High income, medium education” and $y$ was the scenario “Medium population, medium income, High education”. We consider all numbers that are in rows which were states of $x$ and in columns that are states of $y$. This would give:

$I_x(y)= (0+0+0) + (-2 +0 +10) +(6+7+0) = 21$

Here each bracket refers to the sum of a particular column.
More generically we can write the formula as:

$\displaystyle{ I_x(y)= \sum_{i \subset x, \;j \subset y} M_{i,j} }$

Here $M_{i,j}$ refers to an entry in our cross-impact balance matrix, $i$ and $j$ are both states, and $i \subset x$ reads as “$i$ is a state of $x$”.

We refer to this function for computing transition probabilities as the Boltzmann succession law, due to its similarity to the Boltzmann distribution found in physics. We use it merely as an example, and by no means wish to imply that we expect the transitions for our true system to act in a precisely Boltzmann-like manner. Alternative functions can, and should, be experimented with. The Boltzmann succession law is however an effective example and has a number of nice properties: $P(x,y)$ is always positive, unchanged by adding a constant to every element of the cross-impact balance matrix, contains adjustable parameters, and unbounded above.

The Boltzmann succession rule is what I will refer to as fully stochastic: it allows transitions even against our experts’ judgement (with low probability). This is in contrast to dice succession which picks a direction at random, but still contains scenarios from which our system can not escape.

### Effects of stochastic succession

‘Partially stochastic’ processes such as the dice rule have very limited effect on the long term behavior of the model. Aside from removing most cycles, they behave almost exactly like our deterministic succession rules. So, let us instead discuss the more interesting fully stochastic succession rules.

In the fully stochastic system we can ask “after a very long time, what is the probability we will be in scenario $x$?”

By asking this question we can get some idea of the relative importance of all our future scenarios and states.

For example, if the scenario “high population, low education, low income” has a 40% probability in the long term, while most other scenarios have a probability of 0.2%, we can see that this scenario is crucial to the understanding of our system. Often scenarios already identified by deterministic succession analysis are the ones with the greatest long term probability—but by looking at long term probability we also gain information about the relative importance of each scenario.

In addition, we can encounter scenarios which are themselves inconsistent, but form cycles and/or clusters of interconnected scenarios. We can also notice scenarios that while technically ‘consistent’ in the deterministic rules are only barely so, and have limited weight due to a limited basin of attraction. We might identify scenarios that seem familiar in the real world, but are apparently highly unlikely in our analysis, indicating either that we should expect change… or perhaps suggesting a missing descriptor or a cross-impact in need of tweaking.

Armed with such a model, we can investigate what we can do to increase the short term and long term likelihood of desirable scenarios, and decrease the likelihood of undesirable scenarios.

### Some further reading

As a last note, here are a few freely available resources that may prove useful. For a more formal introduction to CIB, try:

• Wolfgang Weimer-Jehle, Cross-impact balances: a system-theoretical approach to cross-impact analysis, Technological Forecasting & Social Change 73 (2006), 334–361.

• Wolfgang Weimer-Jehle, Properties of cross-impact balance analysis.

You can find free software for doing a classical CIB analysis here:

• ZIRIUS, ScenarioWizard.

ZIRIUS is the Research Center for Interdisciplinary Risk and Innovation Studies of the University of Stuttgart.

Here are some examples of CIB in action:

• Gerhard Fuchs, Ulrich Fahl, Andreas Pyka, Udo Staber, Stefan Voegele and Wolfgang Weimer-Jehle, Generating innovation scenarios using the cross-impact methodology, Department of Economics, University of Bremen, Discussion-Papers Series No. 007-2008.

• Ortwin Renn, Alexander Jager, Jurgen Deuschle and Wolfgang Weimer-Jehle, A normative-functional concept of sustainability and its indicators, International Journal of Global Environmental Issues, 9 (2008), 291–317.

Finally, this page contains a more complete list of articles, both practical and theoretical:

## Logic, Probability and Reflection

26 December, 2013

Last week I attended the Machine Intelligence Research Institute’s sixth Workshop on Logic, Probability, and Reflection. This one was in Berkeley, where the institute has its headquarters.

You may know this institute under their previous name: the Singularity Institute. It seems to be the brainchild of Eliezer Yudkowsky, a well-known advocate of ‘friendly artificial intelligence’, whom I interviewed in week311, week312 and week313 of This Week’s Finds. He takes an approach to artificial intelligence that’s heavily influenced by mathematical logic, and I got invited to the workshop because I blogged about a paper he wrote with Mihaly Barasz, Paul Christiano and Marcello Herresho ff on probability theory and logic.

I only have the energy to lay the groundwork for a good explanation of what happened in the workshop. So, after you read my post, please read this:

• Benja Fallenstein, Results from MIRI’s December workshop, Less Wrong, 28 December 2013.

The workshop had two main themes, so let me tell you what they were.

### Scientific induction in mathematics

The first theme is related to that paper I just mentioned. How should a rational agent assign probabilities to statements in mathematics? Of course an omniscient being could assign

probability 1 to every mathematical statement that’s provable,

probability 0 to every statement whose negation is provable,

and

to every statement that is neither provable nor disprovable.

But a real-world rational agent will never have time to check all proofs, so there will always be lots of statements it’s not sure about. Actual mathematicians always have conjectures, like the Twin Prime Conjecture, that we consider plausible even though nobody has proved them. And whenever we do research, we’re constantly estimating how likely it is for statements to be true, and changing our estimates as new evidence comes in. In other words, we use scientific induction in mathematics.

How could we automate this? Most of us don’t consciously assign numerical probabilities to mathematical statements. But maybe an AI mathematician should. If so, what rules should it follow?

It’s natural to try a version of Solomonoff induction, where our probability estimate, before any evidence comes in, favors statements that are simple. However, this runs up against problems. If you’re interested in learning more about this, try:

• Jeremy Hahn, Scientific induction in probabilistic mathematics.

It’s a summary of ideas people came up with during the workshop. I would like to explain them sometime, but for now I should move on.

### The Löbian obstacle

The second main theme was the ‘Löbian obstacle’. Löb’s theorem is the flip side of Gödel’s first incompleteness theorem, less famous but just as shocking. It seems to put limitations on how much a perfectly rational being can trust itself.

Since it’s the day after Christmas, let’s ease our way into these deep waters with the Santa Claus paradox, also known as Curry’s paradox.

If you have a child who is worried that Santa Claus might not exist, you can reassure them using this sentence:

If this sentence is true, Santa Claus exists.

Call it P, for short.

Assume, for the sake of argument, that P is true. Then what it says is true: “If P is true, Santa Claus exists.” And we’re assuming P is true. So, Santa Claus exists.

So, we’ve proved that if P is true, Santa Claus exists.

But that’s just what P says!

So, P is true.

So, Santa Claus exists!

There must be something wrong about this argument, even if Santa Claus does exist, because if it were valid you could you use it to prove anything at all. The self-reference is obviously suspicious. The sentence in question is a variant of the Liar Paradox:

This sentence is false.

since we can rewrite the Liar Paradox as

If this sentence is true, 0 = 1.

and then replace “0=1″ by any false statement you like.

However, Gödel figured out a way to squeeze solid insights from these dubious self-referential sentences. He did this by creating a statement in the language of arithmetic, referring to nothing but numbers, which nonetheless manages to effectively say

This sentence is unprovable.

If it were provable, you’d get a contradiction! So, either arithmetic is inconsistent or this sentence is unprovable. But if it’s unprovable, it’s true. So, there are true but unprovable statements in arithmetic… unless arithmetic is inconsistent! This discovery shook the world of mathematics.

Here I’m being quite sloppy, just to get the idea across.

For one thing, when I’m saying ‘provable’, I mean provable given some specific axioms for arithmetic, like the Peano axioms. If we change our axioms, different statements will be provable.

For another, the concept of ‘true’ statements in arithmetic is often shunned by logicians. That may sound shocking, but there are many reasons for this: for example, Tarski showed that the truth of statements about arithmetic is undefinable in arithmetic. ‘Provability’ is much easier to deal with.

So, a better way of thinking about Gödel’s result is that he constructed a statement that is neither provable nor disprovable from Peano’s axioms of arithmetic, unless those axioms are inconsistent (in which case we can prove everything, but it’s all worthless).

Furthermore, this result applies not just to Peano’s axioms but to any stronger set of axioms, as long as you can write a computer program to list those axioms.

In 1952, the logician Leon Henkin flipped Gödel’s idea around and asked about a sentence in the language of arithmetic that says:

This sentence is provable.

He asked: is this provable or not? The answer is much less obvious than for Gödel’s sentence. Play around with it and see what I mean.

But in 1954, Martin Hugo Löb showed that Henkin’s sentence is provable!

And Henkin noticed something amazing: Löb’s proof shows much more.

At this point it pays to become a bit more precise. Let us write $\mathrm{PA} \vdash P$ to mean the statement $P$ is provable from the Peano axioms of arithmetic. Gödel figured out how to encode statements in arithmetic as numbers, so let’s write $\# P$ for the Gödel number of any statement $P.$ And Gödel figured out how to write a statement in arithmetic, say

$\mathrm{Provable}(n)$

which says that the statement with Gödel number $n$ is provable using the Peano axioms.

Using this terminology, what Henkin originally did was find a number $n$ such that the sentence

$\mathrm{Provable}(n)$

has Gödel number $n.$ So, this sentence says

This sentence is provable from the Peano axioms of arithmetic.

What Löb did was show

$\mathrm{PA} \vdash \mathrm{Provable}(n)$

In other words, he showed that Henkin sentence really is provable from the Peano axioms!

What Henkin then did is prove that for any sentence $P$ in the language of arithmetic, if

$\mathrm{PA} \vdash \mathrm{Provable}(\# P) \implies P$

then

$\mathrm{PA} \vdash P$

In other words, suppose we can prove that the provability of $P$ implies $P.$ Then we can prove $P$!

At first this merely sounds nightmarishly complicated. But if you think about it long enough, you’ll see it’s downright terrifying! For example, suppose $P$ is some famous open question in arithmetic, like the Twin Prime Conjecture. You might hope to prove

The provability of the Twin Prime Conjecture implies the Twin Prime Conjecture.

Indeed, that seems like a perfectly reasonable thing to want. But it turns out that proving this is as hard as proving the Twin Prime Conjecture! Why? Because if we can prove the boldface sentence above, Löb and Henkin’s work instantly gives us a proof of Twin Prime Conjecture!

What does all this have to do with artificial intelligence?

Well, what I just said is true not only for Peano arithmetic, but any set of axioms including Peano arithmetic that a computer program can list. Suppose your highly logical AI mathematician has some such set of axioms, say $\mathrm{AI}.$ You might want it to trust itself. In other words, you might want

$\mathrm{AI} \vdash \mathrm{Provable}(\# P) \implies P$

for every sentence $P.$ This says, roughly, that whatever the AI can prove it can prove, it can prove.

But then Löb’s theorem would kick in and give

$\mathrm{AI} \vdash P$

for every sentence $P.$ And this would be disastrous: our AI would be inconsistent, because it could prove everything!

This is just the beginning of the problems. It gets more involved when we consider AI’s that spawn new AI’s and want to trust them. For more see:

• Eliezer Yudkowsky and Marcello Herreshoff, Tiling agents for self-modifying AI, and the Löbian obstacle.

At workshop various people made progress on this issue, which is recorded in these summaries:

• Eliezer Yudkowsky, The procrastination paradox.

Abstract. A theorem by Marcello Herresho , Benja Fallenstein, and Stuart Armstrong shows that if there exists an infinite series of theories $T_i$ extending $\mathrm{PA}$ where each $T_i$ proves the soundness of $T_{i+1}$, then all the $T_i$ must have only nonstandard models. We call this the Procrastination Theorem for reasons which will become apparent.

Here Fallenstein constructs a di fferent sequence of theories $T_i$ extending Peano arithmetic such that each $T_i$ proves the consistency of $T_{i+1},$ and all the theories are sound for $\Pi_1$ sentences—that is, sentences with only one $\forall$ quantifier outside the rest of the stuff.

The following summaries would take more work to explain:

• Nate Soares, Fallenstein’s monster.

• Nisan Stiennon, Recursively-defined logical theories are well-defined.

• Benja Fallenstein, The 5-and-10 problem and the tiling agents formalism.

Again: before reading any of these summaries, I urge you to read Benja Fallenstein’s post, which will help you understand them!

## Quantum Network Theory (Part 2)

13 August, 2013

guest post by Tomi Johnson

Last time I told you how a random walk called the ‘uniform escape walk’ could be used to analyze a network. In particular, Google uses it to rank nodes. For the case of an undirected network, the steady state of this random walk tells us the degrees of the nodes—that is, how many edges come out of each node.

Now I’m going to prove this to you. I’ll also exploit the connection between this random walk and a quantum walk, also introduced last time. In particular, I’ll connect the properties of this quantum walk to the degrees of a network by exploiting its relationship with the random walk.

This is pretty useful, considering how tricky these quantum walks can be. As the parts of the world that we model using quantum mechanics get bigger and have more complicated structures, like biological network, we need all the help in understanding quantum walks that we can get. So I’d better start!

### Flashback

Starting with any (simple, connected) graph, we can get an old-fashioned ‘stochastic’ random walk on this graph, but also a quantum walk. The first is the uniform escape stochastic walk, where the walker has an equal probability per time of walking along any edge leaving the node they are standing at. The second is the related quantum walk we’re going to study now. These two walks are generated by two matrices, which we called $S$ and $Q.$ The good thing is that these matrices are similar, in the technical sense.

We studied this last time, and everything we learned is summarized here:

where:

$G$ is a simple graph that specifies

$A$ the adjacency matrix (the generator of a quantum walk) with elements $A_{i j}$ equal to unity if nodes $i$ and $j$ are connected, and zero otherwise ($A_{i i} = 0$), which subtracted from

$D$ the diagonal matrix of degrees $D_{i i} = \sum_j A_{i j}$ gives

$L = D - A$ the symmetric Laplacian (generator of stochastic and quantum walks), which when normalized by $D$ returns both

$S = L D^{-1}$ the generator of the uniform escape stochastic walk and

$Q = D^{-1/2} L D^{-1/2}$ the quantum walk generator to which it is similar!

Now I hope you remember where we are. Next I’ll talk you through the mathematics of the uniform escape stochastic walk $S$ and how it connects to the degrees of the nodes in the large-time limit. Then I’ll show you how this helps us solve aspects of the quantum walk generated by $Q.$

### Stochastic walk

The uniform escape stochastic walk generated by $S$ is popular because it has a really useful stationary state.

To recap from Part 20 of the network theory series, a stationary state of a stochastic walk is one that does not change in time. By the master equation

$\displaystyle{ \frac{d}{d t} \psi(t) = -S \psi(t)}$

the stationary state must be an eigenvector of $S$ with eigenvalue $0.$

A fantastic pair of theorems hold:

• There is always a unique (up to multiplication by a positive number) positive eigenvector $\pi$ of $S$ with eigenvalue $0.$ That is, there is a unique stationary state $\pi.$

• Regardless of the initial state $\psi(0),$ any solution of the master equation approaches this stationary state $\pi$ in the large-time limit:

$\displaystyle{ \lim_{t \rightarrow \infty} \psi(t) = \pi }$

To find this unique stationary state, consider the Laplacian $L,$ which is both infinitesimal stochastic and symmetric. Among other things, this means the rows of $L$ sum to zero:

$\displaystyle{ \sum_j L_{i j} = 0 }$

Thus, the ‘all ones’ vector $\mathbf{1}$ is an eigenvector of $L$ with zero eigenvalue:

$L \mathbf{1} = 0$

Inserting the identity $I = D^{-1} D$ into this equation we then find $D \mathbf{1}$ is a zero eigenvector of $S$:

$L \mathbf{1} = ( L D^{-1} ) ( D \mathbf{1} ) = S ( D \mathbf{1} ) = 0$

Therefore we just need to normalize this to get the large-time stationary state of the walk:

$\displaystyle{ \pi = \frac{D \mathbf{1}}{\sum_i D_{i i}} }$

If we write $i$ for the basis vector that is zero except at the ith node of our graph, and 1 at that node, the inner product $\langle i , \pi \rangle$ is large-time probability of finding a walker at that node. The equation above implies this is proportional to the degree $D_{i i}$ of node $i.$

We can check this for the following graph:

We find that $\pi$ is

$\displaystyle{ \left( \begin{matrix} 1/6 \\ 1/6 \\ 1/4 \\ 1/4 \\ 1/6 \end{matrix} \right) }$

which implies large-time probability $1/6$ for nodes $1,$ $2$ and $5,$ and $1/4$ for nodes $3$ and $4.$ Comparing this to the original graph, this exactly reflects the arrangement of degrees, as we knew it must.

Math works!

### The quantum walk

Next up is the quantum walk generated by $Q.$ Not a lot is known about quantum walks on networks of arbitrary geometry, but below we’ll see some analytical results are obtained by exploiting the similarity of $S$ and $Q.$

Where to start? Well, let’s start at the bottom, what quantum physicists call the ground state. In contrast to stochastic walks, for a quantum walk every eigenvector $\phi_k$ of $Q$ is a stationary state of the quantum walk. (In Puzzle 5, at the bottom of this page, I ask you to prove this). The stationary state $\phi_0$ is of particular interest physically and mathematically. Physically, since eigenvectors of the $Q$ correspond to states of well-defined energy equal to the associated eigenvalue, $\phi_0$ is the state of lowest energy, energy zero, hence the name ‘ground state’. (In Puzzle 3, I ask you to prove that all eigenvalues of $Q$ are non-negative, so zero really does correspond to the ground state.)

Mathematically, the relationship between eigenvectors implied by the similarity of $S$ and $Q$ means

$\phi_0 \propto D^{-1/2} \pi \propto D^{1/2} \mathbf{1}$

So in the ground state, the probability of our quantum walker being found at node $i$ is

$| \langle i , \phi_0 \rangle |^2 \propto | \langle i , D^{1/2} \rangle \mathbf{1} |^2 = D_{i i}$

Amazingly, this probability is proportional to the degree and so is exactly the same as $\langle i , \pi \rangle,$ the probability in the stationary state $\pi$ of the stochastic walk!

In short: a zero energy quantum walk $Q$ leads to exactly the same distribution of the walker over the nodes as in the large-time limit of the uniform escape stochastic walk $S.$ The classically important notion of degree distribution also plays a role in quantum walks!

This is already pretty exciting. What else can we say? If you are someone who feels faint at the sight of quantum mechanics, well done for getting this far, but watch out for what’s coming next.

What if the walker starts in some other initial state? Is there some quantum walk analogue of the unique large-time state of a stochastic walk?

In fact, the quantum walk in general does not converge to a stationary state. But there is a probability distribution that can be thought to characterize the quantum walk in the same way as the large-time state characterizes the stochastic walk. It’s the large-time average probability vector $P.$

If you didn’t know the time that had passed since the beginning of a quantum walk, then the best estimate for the probability of your measuring the walker to be at node $i$ would be the large-time average probability

$\displaystyle{ \langle i , P \rangle = \lim_{T \rightarrow \infty} \frac{1}{T} \int_0^T | \psi_i (t) |^2 d t }$

There’s a bit that we can do to simplify this expression. As usual in quantum mechanics, let’s start with the trick of diagonalizing $Q.$ This amounts to writing

$\displaystyle{ Q= \sum_k \epsilon_k \Phi_k }$

where $\Phi_k$ are projectors onto the eigenvectors $\phi_k$ of $Q,$ and $\epsilon_k$ are the corresponding eigenvalues of $Q.$ If we insert this equation into

$\psi(t) = e^{-Q t} \psi(0)$

we get

$\displaystyle{ \psi(t) = \sum_k e^{-\epsilon_k t} \Phi_k \psi(0) }$

and thus

$\displaystyle{ \langle i , P \rangle = \lim_{T \rightarrow \infty} \frac{1}{T} \int_0^T | \sum_k e^{-i \epsilon_k t} \langle i, \Phi_k \psi (0) \rangle |^2 d t }$

Due to the integral over all time, the interference between terms corresponding to different eigenvalues averages to zero, leaving:

$\displaystyle{ \langle i , P \rangle = \sum_k | \langle i, \Phi_k \psi(0) \rangle |^2 }$

The large-time average probability is then the sum of terms contributed by the projections of the initial state onto each eigenspace.

So we have a distribution that characterizes a quantum walk for a general initial state, but it’s a complicated beast. What can we say about it?

Our best hope of understanding the large-time average probability is through the term $| \langle i, \Phi_0 \psi (0) \rangle |^2$ associated with the zero energy eigenspace, since we know everything about this space.

For example, we know the zero energy eigenspace is one-dimensional and spanned by the eigenvector $\phi_0.$ This means that the projector is just the usual outer product

$\Phi_0 = | \phi_0 \rangle \langle \phi_0 | = \phi_0 \phi_0^\dagger$

where we have normalized $\phi_0$ according to the inner product $\langle \phi_0, \phi_0\rangle = 1.$ (If you’re wondering why I’m using all these angled brackets, well, they’re a notation named after Dirac that is adored by quantum physicists.)

The zero eigenspace contribution to the large-time average probability then breaks nicely into two:

$\begin{array}{ccl} | \langle i, \Phi_0 \psi (0) \rangle |^2 &=& | \langle i, \phi_0\rangle \; \langle \phi_0, \psi (0) \rangle |^2 \\ \\ &=& | \langle i, \phi_0\rangle |^2 \; | \langle \phi_0 , \psi (0) \rangle |^2 \\ \\ &=& \langle i , \pi \rangle \; | \langle \phi_0 , \psi (0) \rangle |^2 \end{array}$

This is just the product of two probabilities:

• first, the probability $\langle i , \pi \rangle$ for a quantum state in the zero energy eigenspace to be at node $i,$ as we found above,

and

• second, the probability $| \langle \phi_0, \psi (0)\rangle |^2$ of being in this eigenspace to begin with. (Remember, in quantum mechanics the probability of measuring the system to have an energy is the modulus squared of the projection of the state onto the associated eigenspace, which for the one-dimensional zero energy eigenspace means just the inner product with the ground state.)

This is all we need to say something interesting about the large-time average probability for all states. We’ve basically shown that we can break the large-time probability vector $P$ into a sum of two normalized probability vectors:

$P = (1- \eta) \pi + \eta \Omega$

the first $\pi$ being the stochastic stationary state associated with the zero energy eigenspace, and the second $\Omega$ associated with the higher energy eigenspaces, with

$\displaystyle{ \langle i , \Omega \rangle = \frac{ \sum_{k\neq 0} | \langle i, \Phi_k \psi (0) \rangle |^2 }{ \eta} }$

The weight of each term is governed by the parameter

$\eta = 1 - | \langle \phi_0, \psi (0)\rangle |^2$

which you could think of as the quantumness of the result. This is one minus the probability of the walker being in the zero energy eigenspace, or equivalently the probability of the walker being outside the zero energy eigenspace.

So even if we don’t know $\Omega,$ we know its importance is controlled by a parameter $\eta$ that governs how close the large-time average distribution $P$ of the quantum walk is to the corresponding stochastic stationary distribution $\pi.$

What do we mean by ‘close’? Find out for yourself:

Puzzle 1. Show, using a triangle inequality, that the trace distance between the two characteristic stochastic and quantum distributions $\{ \langle i , P \rangle \}_i$ and $\{ \langle i , \pi \rangle \}_i$ is upper-bounded by $2 \eta.$

Can we say anything physical about when the quantumness $\eta$ is big or small?

Because the eigenvalues of $Q$ have a physical interpretation in terms of energy, the answer is yes. The quantumness $\eta$ is the probability of being outside the zero energy state. Call the next lowest eigenvalue $\Delta = \min_{k \neq 0} \epsilon_k$ the energy gap. If the quantum walk is not in the zero energy eigenspace then it must be in an eigenspace of energy greater or equal to $\Delta.$ Therefore the expected energy $E$ of the quantum walker must bound the quantumness $E \ge \eta \Delta.$

This tells us that a quantum walk with a low energy is similar to a stochastic walk in the large-time limit. We already knew this was exactly true in the zero energy limit, but this result goes further.

So little is known about quantum walks on networks of arbitrary geometry that we were very pleased to find this result. It says there is a special case in which the walk is characterized by the degree distribution of the network, and a clear physical parameter that bounds how far the walk is from this special case.

Also, in finding it we learned that the difficulties of the initial state dependence, enhanced by the lack of convergence to a stationary state, could be overcome for a quantum walk, and that the relationships between quantum and stochastic walks extend beyond those with shared generators.

### What next?

That’s all for the latest bit of idea sharing at the interface between stochastic and quantum systems.

I hope I’ve piqued your interest about quantum walks. There’s so much still left to work out about this topic, and your help is needed!

Other questions we have include: What holds analytically about the form of the quantum correction? Numerically it is known that the so-called quantum correction $\Omega$ tends to enhance the probability of being found on nodes of low degree compared to $\pi.$ Can someone explain why? What happens if a small amount of stochastic noise is added to a quantum walk? Or a lot of noise?

It’s difficult to know who is best placed to answer these questions: experts in quantum physics, graph theory, complex networks or stochastic processes? I suspect it’ll take a bit of help from everyone.

A couple of textbooks with comprehensive sections on non-negative matrices and continuous-time stochastic processes are:

• Peter Lancaster and Miron Tismenetsky, The Theory of Matrices: with Applications, 2nd edition, Academic Press, San Diego, 1985.

• James R. Norris, Markov Chains, Cambridge University Press, Cambridge, 1997.

There is, of course, the book that arose from the Azimuth network theory series, which considers several relationships between quantum and stochastic processes on networks:

• John Baez and Jacob Biamonte, A Course on Quantum Techniques for Stochastic Mechanics, 2012.

Another couple of books on complex networks are:

• Mark Newman, Networks: An Introduction, Oxford University Press, Oxford, 2010.

• Ernesto Estrada, The Structure of Complex Networks: Theory and Applications, Oxford University Press, Oxford, 2011. Note that the first chapter is available free online.

There are plenty more useful references in our article on this topic:

• Mauro Faccin, Tomi Johnson, Jacob Biamonte, Sabre Kais and Piotr Migdał, Degree distribution in quantum walks on complex networks.

### Puzzles for the enthusiastic

Sadly I didn’t have space to show proofs of all the theorems I used. So here are a few puzzles that guide you to doing the proofs for yourself:

#### Stochastic walks and stationary states

Puzzle 2. (For the hard core.) Prove there is always a unique positive eigenvector for a stochastic walk generated by $S.$ You’ll need the assumption that the graph $G$ is connected. It’s not simple, and you’ll probably need help from a book, perhaps one of those above by Lancaster and Tismenetsky, and Norris.

Puzzle 3. Show that the eigenvalues of $S$ (and therefore $Q$) are non-negative. A good way to start this proof is to apply the Perron-Frobenius theorem to the non-negative matrix $M = - S + I \max_i S_{i i}.$ This implies that $M$ has a positive eigenvalue $r$ equal to its spectral radius

$r = \max_k | \lambda_k |$

where $\lambda_k$ are the eigenvalues of $M,$ and the associated eigenvector $v$ is positive. Since $S = - M + I \max_i S_{i i},$ it follows that $S$ shares the eigenvectors of $M$ and the associated eigenvalues are related by inverted translation:

$\epsilon_k = - \lambda_k + \max_i S_{i i}$

Puzzle 4. Prove that regardless of the initial state $\psi(0),$ the zero eigenvector $\pi$ is obtained in the large-time limit $\lim_{t \rightarrow \infty} \psi(t) = \pi$ of the walk generated by $S.$ This breaks down into two parts:

(a) Using the approach from Puzzle 5, to show that $S v = \epsilon_v v,$ the positivity of $v$ and the infinitesimal stochastic property $\sum_i S_{i j} = 0$ imply that $\epsilon_v = \epsilon_0 = 0$ and thus $v = \pi$ is actually the unique zero eigenvector and stationary state of $S$ (its uniqueness follows from puzzle 4, you don’t need to re-prove it).

(b) By inserting the decomposition $S = \sum_k \epsilon_k \Pi_k$ into $e^{-S t}$ and using the result of puzzle 5, complete the proof.

(Though I ask you to use the diagonalizability of $S,$ the main results still hold if the generator is irreducible but not diagonalizable.)

#### Quantum walks

Here are a couple of extra puzzles for those of you interested in quantum mechanics:

Puzzle 5. In quantum mechanics, probabilities are given by the moduli squared of amplitudes, so multiplying a state by a number of modulus unity has no physical effect. By inserting

$\displaystyle{ Q= \sum_k \epsilon_k \Phi_k }$

into the quantum time evolution matrix $e^{-Q t},$ show that if

$\psi(0) = \phi_k$

then

$\psi(t) = e^{ - i \epsilon_k t} \psi(0)$

hence $\phi_k$ is a stationary state in the quantum sense, as probabilities don’t change in time.

Puzzle 6. By expanding the initial state $\psi(0)$ in terms of the complete orthogonal basis vectors $\phi_k$ show that for a quantum walk $\psi(t)$ never converges to a stationary state unless it began in one.

## Monte Carlo Methods in Climate Science

23 July, 2013

joint with David Tweed

One way the Azimuth Project can help save the planet is to get bright young students interested in ecology, climate science, green technology, and stuff like that. So, we are writing an article for Math Horizons, an American magazine for undergraduate math majors. This blog article is a draft of that. You can also see it in PDF form here.

We’d really like to hear your comments! There are severe limits on including more detail, since the article should be easy to read and short. So please don’t ask us to explain more stuff: we’re most interested to know if you sincerely don’t understand something, or feel that students would have trouble understanding something. For comparison, you can see sample Math Horizons articles here.

### Introduction

They look placid lapping against the beach on a calm day, but the oceans are actually quite dynamic. The ocean currents act as ‘conveyor belts’, transporting heat both vertically between the water’s surface and the depths and laterally from one area of the globe to another. This effect is so significant that the temperature and precipitation patterns can change dramatically when currents do.

For example: shortly after the last ice age, northern Europe experienced a shocking change in climate from 10,800 to 9,500 BC. At the start of this period temperatures plummeted in a matter of decades. It became 7° Celsius colder, and glaciers started forming in England! The cold spell lasted for over a thousand years, but it ended as suddenly as it had begun.

Why? The most popular theory is that that a huge lake in North America formed by melting glaciers burst its bank—and in a massive torrent lasting for years, the water from this lake rushed out to the northern Atlantic ocean. By floating atop the denser salt water, this fresh water blocked a major current: the Atlantic Meridional Overturning Circulation. This current brings warm water north and helps keep northern Europe warm. So, when iit shut down, northern Europe was plunged into a deep freeze.

Right now global warming is causing ice sheets in Greenland to melt and release fresh water into the North Atlantic. Could this shut down the Atlantic Meridional Overturning Circulation and make the climate of Northern Europe much colder? In 2010, Keller and Urban [KU] tackled this question using a simple climate model, historical data, probability theory, and lots of computing power. Their goal was to understand the spectrum of possible futures compatible with what we know today.

Let us look at some of the ideas underlying their work.

### Box models

The earth’s physical behaviour, including the climate is far too complex to simulate from the bottom up using basic physical principles, at least for now. The most detailed models today can take days to run on very powerful computers. So to make reasonable predictions on a laptop in a tractable time-frame, geophysical modellers use some tricks.

First, it is possible to split geophysical phenomena into ‘boxes’ containing strongly related things. For example: atmospheric gases, particulate levels and clouds all affect each other strongly; likewise the heat content, currents and salinity of the oceans all interact strongly. However, the interactions between the atmosphere and the oceans are weaker, and we can approximately describe them using just a few settings, such as the amount of atmospheric CO2 entering or leaving the oceans. Clearly these interactions must be consistent—for example, the amount of CO2 leaving the atmosphere box must equal the amount entering the ocean box—but breaking a complicated system into parts lets different specialists focus on different aspects; then we can combine these parts and get an approximate model of entire planet. The box model used by Keller and Urban is shown in Figure 1.

1. The box model used by Keller and Urban.

Second, it turn out that simple but effective box models can be distilled from the complicated physics in terms of forcings and feedbacks. Essentially a forcing is a measured input to the system, such as solar radiation or CO2 released by burning fossil fuels. As an analogy, consider a child on a swing: the adult’s push every so often is a forcing. Similarly a feedback describes how the current ‘box variables’ influence future ones. In the swing analogy, one feedback is how the velocity will influence the future height. Specifying feedbacks typically uses knowledge of the detailed low-level physics to derive simple, tractable functional relationships between groups of large-scale observables, a bit like how we derive the physics of a gas by thinking about collisions of lots of particles.

However, it is often not feasible to get actual settings for the parameters in our model starting from first principles. In other words, often we can get the general form of the equations in our model, but they contain a lot of constants that we can estimate only by looking at historical data.

### Probability modeling

Suppose we have a box model that depends on some settings $S.$ For example, in Keller and Urban’s model, $S$ is a list of 18 numbers. To keep things simple, suppose the settings are element of some finite set. Suppose we also have huge hard disc full of historical measurements, and we want to use this to find the best estimate of $S.$ Because our data is full of ‘noise’ from other, unmodeled phenomena we generally cannot unambiguously deduce a single set of settings. Instead we have to look at things in terms of probabilities. More precisely, we need to study the probability that $S$ take some value $s$ given that the measurements take some value. Let’s call the measurements $M$, and again let’s keep things simple by saying $M$ takes values in some finite set of possible measurements.

The probability that $S = s$ given that $M$ takes some value $m$ is called the conditional probability $P(S=s | M=m).$ How can we compute this conditional probability? This is a somewhat tricky problem.

One thing we can more easily do is repeatedly run our model with randomly chosen settings and see what measurements it predicts. By doing this, we can compute the probability that given setting values $S = s,$ the model predicts measurements $M=m.$ This again is a conditional probability, but now it is called $P(M=m|S=s).$

This is not what we want: it’s backwards! But here Bayes’ rule comes to the rescue, relating what we want to what we can more easily compute:

$\displaystyle{ P(S = s | M = m) = P(M = m| S = s) \frac{P(S = s)}{P(M = m)} }$

Here $P(S = s)$ is the probability that the settings take a specific value $s,$ and similarly for $P(M = m).$ Bayes’ rule is quite easy to prove, and it is actually a general rule that applies to any random variables, not just the settings and the measurements in our problem [Y]. It underpins most methods of figuring out hidden quantities from observed ones. For this reason, it is widely used in modern statistics and data analysis [K].

How does Bayes’ rule help us here? When we repeatedly run our model with randomly chosen settings, we have control over $P(S = s).$ As mentioned, we can compute $P(M=m| S=s).$ Finally, $P(M = m)$ is independent of our choice of settings. So, we can use Bayes’ rule to compute $P(S = s | M = m)$ up to a constant factor. And since probabilities must sum to 1, we can figure out this constant.

This lets us do many things. It lets us find the most likely values of the settings for our model, given our hard disc full of observed data. It also lets us find the probability that the settings lie within some set. This is important: if we’re facing the possibility of a climate disaster, we don’t just want to know the most likely outcome. We would like to know to know that with 95% probability, the outcome will lie in some range.

### An example

Let us look at an example much simpler than that considered by Keller and Urban. Suppose our measurements are real numbers $m_0,\dots, m_T$ related by

$m_{t+1} = s m_t - m_{t-1} + N_t$

Here $s,$ a real constant, is our ‘setting’, while $N_t$ is some ‘noise': an independent Gaussian random variable for each time $t,$ each with mean zero and some fixed standard deviation. Then the measurements $m_t$ will have roughly sinusoidal behavior but with irregularity added by the noise at each time step, as illustrated in Figure 2.

2. The example system: red are predicted measurements for a given value of the settings, green is another simulation for the same $s$ value and blue is a simulation for a slightly different $s.$

Note how there is no clear signal from either the curves or the differences that the green curve is at the correct setting value while the blue one has the wrong one: the noise makes it nontrivial to estimate $s.$ This is a baby version of the problem faced by Keller and Urban.

### Markov Chain Monte Carlo

Having glibly said that we can compute the conditional probability $P(M=m | S=s),$ how do we actually do this? The simplest way would be to run our model many, many times with the settings set at $S=s$ and determine the fraction of times it predicts measurements equal to $m.$ This gives us an estimate of $P(M=m | S=s).$ Then we can use Bayes’ rule to work out $P(M=m|S=s),$ at least up to a constant factor.

Doing all this by hand would be incredibly time consuming and error prone, so computers are used for this task. In our example, we do this in Figure 3. As we keep running our model over and over, the curve showing $P(M=m |S=s)$ as a function of $s$ settles down to the right answer.

3. The estimates of $P(M=m | S=s)$ as a function of $s$ using uniform sampling, ending up with 480 samples at each point.

However, this is computationally inefficient, as shown in the probability distribution for small numbers of samples. This has quite a few ‘kinks’, which only disappear later. The problem is that there are lots of possible choices of $s$ to try. And this is for a very simple model!

When dealing with the 18 settings involved in the model of Keller and Urban, trying every combination would take far too long. A way to avoid this is Markov Chain Monte Carlo sampling. Monte Carlo is famous for its casinos, so a ‘Monte Carlo’ algorithm is one that uses randomness. A ‘Markov chain’ is a random walk: for example, where you repeatedly flip a coin and take one step right when you get heads, and one step right when you get tails. So, in Markov Chain Monte Carlo, we perform a random walk through the collection of all possible settings, collecting samples.

The key to making this work is that at each step on the walk a proposed modification $s'$ to the current settings $s$ is generated randomly—but it may be rejected if it does not seem to improve the estimates. The essence of the rule is:

The modification $s \mapsto s'$ is randomly accepted with a probability equal to the ratio

$\displaystyle{ \frac{P(M=m | S=s')}{ P(M=m | S=s)} }$

Otherwise the walk stays at the current position.

If the modification is better, so that the ratio is greater than 1, the new state is always accepted. With some additional tricks—such as discarding the very beginning of the walk—this gives a set of samples from which can be used to compute $P(M=m | S=s).$ Then we can compute $P(S = s | M = m)$ using Bayes’ rule.

Figure 4 shows the results of using the Markov Chain Monte Carlo procedure to figure out $P(S= s| M= m)$ in our example.

4. The estimates of $P(S = s|M = m)$ curves using Markov Chain Monte Carlo, showing the current distribution estimate at increasing intervals. The red line shows the current position of the random walk. Again the kinks are almost gone in the final distribution.

Note that the final distribution has only peformed about 66 thousand simulations in total, while the full sampling peformed over 1.5 million. The key advantage of Markov Chain Monte Carlo is that it avoids performing many simulations in areas where the probability is low, as we can see from the way the walk path remains under the big peak in the probability density almost all the time. What is more impressive is that it achieves this without any global view of the probability density, just by looking at how $P(M=m | S=s)$ changes when we make small changes in the settings. This becomes even more important as we move to dealing with systems with many more dimensions and settings, where it proves very effective at finding regions of high probability density whatever their shape.

Why is it worth doing so much work to estimate the probability distribution for settings for a climate model? One reason is that we can then estimate probabilities of future events, such as the collapse of the Atlantic Meridional Ocean Current. And what’s the answer? According to Keller and Urban’s calculation, this current will likely weaken by about a fifth in the 21st century, but a complete collapse is unlikely before 2300. This claim needs to be checked in many ways—for example, using more detailed models. But the importance of the issue is clear, and we hope we have made the importance of good mathematical ideas for climate science clear as well.

### Exploring the topic

The Azimuth Project is a group of scientists, engineers and computer programmers interested in questions like this [A]. If you have questions, or want to help out, just email us. Versions of the computer programs we used in this paper will be made available here in a while.

Here are some projects you can try, perhaps with the help of Kruschke’s textbook [K]:

• There are other ways to do setting estimation using time series: compare some to MCMC in terms of accuracy and robustness.

• We’ve seen a 1-dimensional system with one setting. Simulate some multi-dimensional and multi-setting systems. What new issues arise?

Acknowledgements. We thank Nathan Urban and other
members of the Azimuth Project for many helpful discussions.

### References

[A] Azimuth Project, http://www.azimuthproject.org.

[KU] Klaus Keller and Nathan Urban, Probabilistic hindcasts and projections of the coupled climate, carbon cycle and Atlantic meridional overturning circulation system: a Bayesian fusion of century-scale measurements with a simple model, Tellus A 62 (2010), 737–750. Also available free online.

[K] John K. Kruschke, Doing Bayesian Data Analysis: A Tutorial with R and BUGS, Academic Press, New York, 2010.

[Y] Eliezer S. Yudkowsky, An intuitive explanation of Bayes’ theorem.

## Negative Probabilities

19 July, 2013

The physicists Dirac and Feynman, both bold when it came to new mathematical ideas, both said we should think about negative probabilities.

These days, Kolmogorov’s axioms for probabilities are used to justify formulating probability theory in terms of measure theory. Mathematically, the theory of measures that take negative or even complex values is well-developed. So, to the extent that probability theory is just measure theory, you can say a lot is known about negative probabilities.

But probability theory is not just measure theory; it adds its own distinctive ideas. To get these into the picture, we really need to ask some basic questions, like: what could it mean to say something had a negative chance of happening?

I really have no idea.

In this paper:

• Paul Dirac, The physical interpretation of quantum mechanics, Proc. Roy. Soc. London A 180 (1942), 1–39.

Dirac wrote:

Negative energies and probabilities should not be considered as nonsense. They are well-defined concepts mathematically, like a negative of money.

In fact, I think negative money could have been the origin of negative numbers. Venetian bankers started writing numbers in red to symbolize debts—hence the phrase ‘in the red’ for being in debt. So, you could say negative numbers were invented to formalize the idea of debt and make accounting easier. Bankers couldn’t really get rich if negative money didn’t exist.

A negative dollar is a dollar you owe someone. But how can you owe someone a probability?﻿ I haven’t figured this out.

Unsurprisingly, the clearest writing about negative probabilities that I’ve found is by Feynman:

• Richard P. Feynman, Negative probability, in Quantum Implications: Essays in Honour of David Bohm, eds. F. David Peat and Basil Hiley, Routledge & Kegan Paul Ltd, London, 1987, pp. 235–248.

He emphasizes that even if the final answer of a calculation must be positive, negative numbers are often allowed to appear in intermediate steps… and that this can happen with probabilities.

Let me quote some:

Some twenty years ago one problem we theoretical physicists had was that if we combined the principles of quantum mechanics and those of relativity plus certain tacit assumptions, we seemed only able to produce theories (the quantum field theories) which gave infinity for the answer to certain questions. These infinities are kept in abeyance (and now possibly eliminated altogether) by the awkward process of renormalization. In an attempt to understand all this better, and perhaps to make a theory which would give only finite answers from the start, I looked into the” tacit assumptions” to see if they could be altered.

One of the assumptions was that the probability for an event must always be a positive number. Trying to think of negative probabilities gave me cultural shock at first, but when I finally got easy with the concept I wrote myself a note so I wouldn’t forget my thoughts. I think that Prof. Bohm has just the combination of imagination and boldness to find them interesting and amusing. I am delighted to have this opportunity to publish them in such an appropriate place. I have taken the opportunity to add some further, more recent, thoughts about applications to two state systems.

Unfortunately I never did find out how to use the freedom of allowing probabilities to be negative to solve the original problem of infinities in quantum field theory!

It is usual to suppose that, since the probabilities of events must be positive, a theory which gives negative numbers for such quantities must be absurd. I should show here how negative probabilities might be interpreted. A negative number, say of apples, seems like an absurdity. A man starting a day with five apples who gives away ten and is given eight during the day has three left. I can calculate this in two steps: 5 -10 = -5 and -5 + 8 + 3. The final answer is satisfactorily positive and correct although in the intermediate steps of calculation negative numbers appear. In the real situation there must be special limitations of the time in which the various apples are received and given since he never really has a negative number, yet the use of negative numbers as an abstract calculation permits us freedom to do our mathematical calculations in any order simplifying the analysis enormously, and permitting us to disregard inessential details. The idea of negative numbers is an exceedingly fruitful mathematical invention. Today a person who balks at making a calculation in this way is considered backward or ignorant, or to have some kind of a mental block. It is the purpose of this paper to point out that we have a similar strong block against negative probabilities. By discussing a number of examples, I hope to show that they are entirely rational of course, and that their use simplifies calculation and thought in a number of applications in physics.

First let us consider a simple probability problem, and how we usually calculate things and then see what would happen if we allowed some of our normal probabilities in the calculations to be negative. Let us imagine a roulette wheel with, for simplicity, just three numbers: 1, 2, 3. Suppose however, the operator by control of a switch under the table can put the wheel into one of two conditions A, B in each of which the probability of 1, 2, 3 are different. If the wheel is in condition A, the probability of 1 is p1A = 0.3 say, of 2 is p2A = 0.6, of 3 is p3A =0.1. But if the wheel is in condition B, these probabilities are

p1B = 0.1, p2B = 0.4, p3B = 0.5

say as in the table.

 Cond. A Cond. B 1 0.3 0.1 2 0.6 0.4 3 0.1 0.5

We, of course, use the table in this way: suppose the operator puts the wheel into condition A 7/10 of the time and into B the other 3/10 of the time at random. (That is the probability of condition A, pA = 0.7, and of B, pB = 0.3.) Then the probability of getting 1 is

Prob. 1 = 0.7 (0.3) + 0.3 (0.1) = 0.24,

etc.

[...]

Now, however, suppose that some of the conditional probabilities are negative, suppose the table reads so that, as we shall say, if the system is in condition B the probability of getting 1 is -0.4. This sounds absurd but we must say it this way if we wish that our way of thought and language be precisely the same whether the actual quantities pi α in our calculations are positive or negative. That is the essence of the mathematical use of negative numbers—to permit an efficiency in reasoning so that various cases can be considered together by the same line of reasoning, being assured that intermediary steps which are not readily interpreted (like -5 apples) will not lead to absurd results. Let us see what p1B = -0.4 “means” by seeing how we calculate with it.

He gives an example showing how meaningful end results can sometimes arise even if the conditional probabilities like p1B are negative or greater than 1.

It is not my intention here to contend that the final probability of a verifiable physical event can be negative. On the other hand, conditional probabilities and probabilities of imagined intermediary states may be negative in a calculation of probabilities of physical events or states. If a physical theory for calculating probabilities yields a negative probability for a given situation under certain assumed conditions, we need not conclude the theory is incorrect. Two other possibilities of interpretation exist. One is that the conditions (for example, initial conditions) may not be capable of being realized in the physical world. The other possibility is that the situation for which the probability appears to be negative is not one that can be verified directly. A combination of these two, limitation of verifiability and freedom in initial conditions, may also be a solution to the apparent difficulty.

The rest of this paper illustrates these points with a number of examples drawn from physics which are less artificial than our roulette wheel. Since the result must ultimately have a positive probability, the question may be asked, why not rearrange the calculation so that the probabilities are positive in all the intermediate states? The same question might be asked of an accountant who subtracts the total disbursements before adding the total receipts. He stands a chance of going through an intermediary negative sum. Why not rearrange the calculation? Why bother? There is nothing mathematically wrong with this method of calculating and it frees the mind to think clearly and simply in a situation otherwise quite complicated. An analysis in terms of various states or conditions may simplify a calculation at the expense of requiring negative probabilities for these states. It is not really much expense.

Our first physical example is one in which one· usually uses negative probabilities without noticing it. It is not a very profound example and is practically the same in content as our previous example. A particle diffusing in one dimension in a rod has a probability of being at $x$ at time $t$ of $P(x,t)$ satisfying

$\partial P(x,t)/\partial t = -\partial^2 P(x,t)/\partial x^2$

Suppose at $x =0$ and $x =\pi$ the rod has absorbers at both ends so that $P(x,t) = 0$ there. Let the probability of being at $x$ at $t = 0$ be given as $P(x,0) =f(x).$ What is $P(x,t)$ thereafter? It is

$\displaystyle{ P(x,t) = \sum_{n=1}^\infty P_n \; \sin x \;\exp(-n^2 t) }$

where $P_n$ is given by

$x \displaystyle{ f(x) = \sum_{n = 1}^\infty P_n \; \sin n x }$

or

$x \displaystyle{ P_n = \frac{2}{\pi} \int_0^\pi f(x) \sin nx \; dx }$

The easiest way of analyzing this (and the way used if $P(x,t)$ is a temperature, for example) is to say that there are certain distributions that behave in an especially simple way. If $f(x)$ starts as $\sin nx$ it will remain that shape, simply decreasing with time, as $e^{-n^2 t}$ Any distribution $f(x)$ can be thought of as a superposition of such sine waves. But $f(x)$ cannot be $\sin nx$ if $f(x)$ is a probability and probabilities must always be positive. Yet the analysis is so simple this way that no one has really objected for long.

He also gives examples from quantum mechanics, but the interesting thing about the examples above is that they’re purely classical—and the second one, at least, is something physicists are quite used to.

Sometimes it’s good to temporarily put aside making sense of ideas and just see if you can develop rules to consistently work with them. For example: the square root of -1. People had to get good at using it before they understood what it really was: a rotation by a quarter turn in the plane.

Along those, lines, here’s an interesting attempt to work with negative probabilities:

• Gábor J. Székely, Half of a coin: negative probabilities, Wilmott Magazine (July 2005), 66–68.

He uses rigorous mathematics to study something that sounds absurd: ‘half a coin’. Suppose you make a bet with an ordinary fair coin, where you get 1 dollar if it comes up heads and 0 dollars if it comes up tails. Next, suppose you want this bet to be the same as making two bets involving two separate ‘half coins’. Then you can do it if a half coin has infinitely many sides numbered 0,1,2,3, etc., and you win $n$ dollars when side number $n$ comes up….

… and if the probability of side $n$ coming up obeys a special formula…

and if this probability can be negative sometimes!

This seems very bizarre, but the math is solid, even if the problem of interpreting it may drive you insane.

Let’s see how it works. Consider a game $G$ where the probability of winning $n = 0, 1, 2, \dots$ dollars is $g(n).$ Then we can summarize this game using a generating function:

$\displaystyle{ G(z) = \sum_{n = 0}^\infty g(n) , z^n }$

Now suppose you play two independent games like this, $G$ and another one, say $H,$ with generating function

$\displaystyle{ H(z) = \sum_{n = 0}^\infty h(n) , z^n }$

Then there’s a new game $GH$ that consists of playing both games. The reason I’m writing it as $GH$ is that its generating function is the product

$\displaystyle{ G(z) H(z) = \sum_{m,n = 0}^\infty g(m) h(n) z^{m+n} }$

See why? With probability $g(m) h(n)$ you win $m$ dollars in game $G$ and $n$ dollars in game $H,$ for a total of $m + n$ dollars.

The game where you flip a fair coin and win 1 dollar if it lands heads up and 0 dollars if lands tails up has generating function

$\displaystyle{ G(z) = \frac{1}{2}(1 + z) }$

The half-coin is an imaginary game $H$ such that playing two copies of this game is the same as playing the game $G.$ If such a game really existed, we would have

$G(z) = H(z)^2$

so

$\displaystyle{ H(z) = \sqrt{\frac{1}{2}(1 + z)} }$

However, if you work out the Taylor series of this function, every even term is negative except for the zeroth term. So, this game can exist only if we allow negative probabilities.

(Experts on generating functions and combinatorics will enjoy how the coefficients of the Taylor series of $H(z)$ involves the Catalan numbers.)

By the way, it’s worth remembering that for a long time mathematicians believed that negative numbers made no sense. As late as 1758 the British mathematician Francis Maseres claimed that negative numbers

… darken the very whole doctrines of the equations and make dark of the things which are in their nature excessively obvious and simple.

So opinions on these things can change. And since I’ve spent a lot of time working on ‘sets with fractional cardinality’, and have made lots of progress on that idea, and other strange ideas, I like to spend a little time now and then investigating other nonsensical-sounding generalizations of familiar concepts.﻿

This paper by Mark Burgin has a nice collection of references on negative probability:

• Mark Burgin, Interpretations of negative probability.

He valiantly tries to provide a frequentist interpretation of negative probabilities. He needs ‘negative events’ to get negative frequencies of events occurring, and he gives this example:

To better understand how negative elementary events appear and how negative probability emerges, consider the following example. Let us consider the situation when an attentive person A with the high knowledge of English writes some text T. We may ask what the probability is for the word “texxt” or “wrod” to appear in his text T. Conventional probability theory gives 0 as the answer. However, we all know that there are usually misprints. So, due to such a misprint this word may appear but then it would be corrected. In terms of extended probability, a negative value (say, -0.1) of the probability for the word “texxt” to appear in his text T means that this word may appear due to a misprint but then it’ll be corrected and will not be present in the text T.

Maybe he’s saying that the misprint occurs with probability 0.1 and then it ‘de-occurs’ with the same probability, giving a total probability of

$0.1 - 0.1 = 0$

I’m not sure.

Here’s another paper on the subject:

• Espen Gaarder Haug, Why so negative to negative probabilities?, Wilmott Magazine.

It certainly gets points for a nice title! However, like Burgin’s paper, I find it a lot less clear than what Feynman wrote.

Notice that like Székely’s paper, Haug’s originally appeared in the Wilmott Magazine. I hadn’t heard of that, but it’s about finance. So it seems that the bankers, having invented negative numbers to get us into debt, are now struggling to invent negative probabilities! In fact Haug’s article tries some applications of negative probabilities to finance.

Scary.

For further discussion, with some nice remarks by the quantum physics experts Matt Leifer and Michael Nielsen, see the comments on my Google+ post on this topic. Matt Leifer casts cold water on the idea of using negative probabilities in quantum theory. On the other hand, Michael Nielsen points out some interesting features of the Wigner quasiprobability distribution, which is the best possible attempt to assign a probability density for a quantum particle to have any given position and momentum. It can be negative! But if you integrate it over all momenta, you get the probability density for the particle having any given position:

$|\psi(x)|^2$

And if you integrate it over all positions, you get the probability density for the particle having any given momentum:

$|\widehat{\psi}(p)|^2$