## Logic, Probability and Reflection

26 December, 2013

Last week I attended the Machine Intelligence Research Institute’s sixth Workshop on Logic, Probability, and Reflection. This one was in Berkeley, where the institute has its headquarters.

You may know this institute under their previous name: the Singularity Institute. It seems to be the brainchild of Eliezer Yudkowsky, a well-known advocate of ‘friendly artificial intelligence’, whom I interviewed in week311, week312 and week313 of This Week’s Finds. He takes an approach to artificial intelligence that’s heavily influenced by mathematical logic, and I got invited to the workshop because I blogged about a paper he wrote with Mihaly Barasz, Paul Christiano and Marcello Herresho ff on probability theory and logic.

I only have the energy to lay the groundwork for a good explanation of what happened in the workshop. So, after you read my post, please read this:

• Benja Fallenstein, Results from MIRI’s December workshop, Less Wrong, 28 December 2013.

The workshop had two main themes, so let me tell you what they were.

### Scientific induction in mathematics

The first theme is related to that paper I just mentioned. How should a rational agent assign probabilities to statements in mathematics? Of course an omniscient being could assign

probability 1 to every mathematical statement that’s provable,

probability 0 to every statement whose negation is provable,

and

to every statement that is neither provable nor disprovable.

But a real-world rational agent will never have time to check all proofs, so there will always be lots of statements it’s not sure about. Actual mathematicians always have conjectures, like the Twin Prime Conjecture, that we consider plausible even though nobody has proved them. And whenever we do research, we’re constantly estimating how likely it is for statements to be true, and changing our estimates as new evidence comes in. In other words, we use scientific induction in mathematics.

How could we automate this? Most of us don’t consciously assign numerical probabilities to mathematical statements. But maybe an AI mathematician should. If so, what rules should it follow?

It’s natural to try a version of Solomonoff induction, where our probability estimate, before any evidence comes in, favors statements that are simple. However, this runs up against problems. If you’re interested in learning more about this, try:

• Jeremy Hahn, Scientific induction in probabilistic mathematics.

It’s a summary of ideas people came up with during the workshop. I would like to explain them sometime, but for now I should move on.

### The Löbian obstacle

The second main theme was the ‘Löbian obstacle’. Löb’s theorem is the flip side of Gödel’s first incompleteness theorem, less famous but just as shocking. It seems to put limitations on how much a perfectly rational being can trust itself.

Since it’s the day after Christmas, let’s ease our way into these deep waters with the Santa Claus paradox, also known as Curry’s paradox.

If you have a child who is worried that Santa Claus might not exist, you can reassure them using this sentence:

If this sentence is true, Santa Claus exists.

Call it P, for short.

Assume, for the sake of argument, that P is true. Then what it says is true: “If P is true, Santa Claus exists.” And we’re assuming P is true. So, Santa Claus exists.

So, we’ve proved that if P is true, Santa Claus exists.

But that’s just what P says!

So, P is true.

So, Santa Claus exists!

There must be something wrong about this argument, even if Santa Claus does exist, because if it were valid you could you use it to prove anything at all. The self-reference is obviously suspicious. The sentence in question is a variant of the Liar Paradox:

This sentence is false.

since we can rewrite the Liar Paradox as

If this sentence is true, 0 = 1.

and then replace “0=1″ by any false statement you like.

However, Gödel figured out a way to squeeze solid insights from these dubious self-referential sentences. He did this by creating a statement in the language of arithmetic, referring to nothing but numbers, which nonetheless manages to effectively say

This sentence is unprovable.

If it were provable, you’d get a contradiction! So, either arithmetic is inconsistent or this sentence is unprovable. But if it’s unprovable, it’s true. So, there are true but unprovable statements in arithmetic… unless arithmetic is inconsistent! This discovery shook the world of mathematics.

Here I’m being quite sloppy, just to get the idea across.

For one thing, when I’m saying ‘provable’, I mean provable given some specific axioms for arithmetic, like the Peano axioms. If we change our axioms, different statements will be provable.

For another, the concept of ‘true’ statements in arithmetic is often shunned by logicians. That may sound shocking, but there are many reasons for this: for example, Tarski showed that the truth of statements about arithmetic is undefinable in arithmetic. ‘Provability’ is much easier to deal with.

So, a better way of thinking about Gödel’s result is that he constructed a statement that is neither provable nor disprovable from Peano’s axioms of arithmetic, unless those axioms are inconsistent (in which case we can prove everything, but it’s all worthless).

Furthermore, this result applies not just to Peano’s axioms but to any stronger set of axioms, as long as you can write a computer program to list those axioms.

In 1952, the logician Leon Henkin flipped Gödel’s idea around and asked about a sentence in the language of arithmetic that says:

This sentence is provable.

He asked: is this provable or not? The answer is much less obvious than for Gödel’s sentence. Play around with it and see what I mean.

But in 1954, Martin Hugo Löb showed that Henkin’s sentence is provable!

And Henkin noticed something amazing: Löb’s proof shows much more.

At this point it pays to become a bit more precise. Let us write $\mathrm{PA} \vdash P$ to mean the statement $P$ is provable from the Peano axioms of arithmetic. Gödel figured out how to encode statements in arithmetic as numbers, so let’s write $\# P$ for the Gödel number of any statement $P.$ And Gödel figured out how to write a statement in arithmetic, say

$\mathrm{Provable}(n)$

which says that the statement with Gödel number $n$ is provable using the Peano axioms.

Using this terminology, what Henkin originally did was find a number $n$ such that the sentence

$\mathrm{Provable}(n)$

has Gödel number $n.$ So, this sentence says

This sentence is provable from the Peano axioms of arithmetic.

What Löb did was show

$\mathrm{PA} \vdash \mathrm{Provable}(n)$

In other words, he showed that Henkin sentence really is provable from the Peano axioms!

What Henkin then did is prove that for any sentence $P$ in the language of arithmetic, if

$\mathrm{PA} \vdash \mathrm{Provable}(\# P) \implies P$

then

$\mathrm{PA} \vdash P$

In other words, suppose we can prove that the provability of $P$ implies $P.$ Then we can prove $P$!

At first this merely sounds nightmarishly complicated. But if you think about it long enough, you’ll see it’s downright terrifying! For example, suppose $P$ is some famous open question in arithmetic, like the Twin Prime Conjecture. You might hope to prove

The provability of the Twin Prime Conjecture implies the Twin Prime Conjecture.

Indeed, that seems like a perfectly reasonable thing to want. But it turns out that proving this is as hard as proving the Twin Prime Conjecture! Why? Because if we can prove the boldface sentence above, Löb and Henkin’s work instantly gives us a proof of Twin Prime Conjecture!

What does all this have to do with artificial intelligence?

Well, what I just said is true not only for Peano arithmetic, but any set of axioms including Peano arithmetic that a computer program can list. Suppose your highly logical AI mathematician has some such set of axioms, say $\mathrm{AI}.$ You might want it to trust itself. In other words, you might want

$\mathrm{AI} \vdash \mathrm{Provable}(\# P) \implies P$

for every sentence $P.$ This says, roughly, that whatever the AI can prove it can prove, it can prove.

But then Löb’s theorem would kick in and give

$\mathrm{AI} \vdash P$

for every sentence $P.$ And this would be disastrous: our AI would be inconsistent, because it could prove everything!

This is just the beginning of the problems. It gets more involved when we consider AI’s that spawn new AI’s and want to trust them. For more see:

• Eliezer Yudkowsky and Marcello Herreshoff, Tiling agents for self-modifying AI, and the Löbian obstacle.

At workshop various people made progress on this issue, which is recorded in these summaries:

• Eliezer Yudkowsky, The procrastination paradox.

Abstract. A theorem by Marcello Herresho , Benja Fallenstein, and Stuart Armstrong shows that if there exists an infinite series of theories $T_i$ extending $\mathrm{PA}$ where each $T_i$ proves the soundness of $T_{i+1}$, then all the $T_i$ must have only nonstandard models. We call this the Procrastination Theorem for reasons which will become apparent.

Here Fallenstein constructs a di fferent sequence of theories $T_i$ extending Peano arithmetic such that each $T_i$ proves the consistency of $T_{i+1},$ and all the theories are sound for $\Pi_1$ sentences—that is, sentences with only one $\forall$ quantifier outside the rest of the stuff.

The following summaries would take more work to explain:

• Nate Soares, Fallenstein’s monster.

• Nisan Stiennon, Recursively-defined logical theories are well-defined.

• Benja Fallenstein, The 5-and-10 problem and the tiling agents formalism.

## Quantum Network Theory (Part 2)

13 August, 2013

guest post by Tomi Johnson

Last time I told you how a random walk called the ‘uniform escape walk’ could be used to analyze a network. In particular, Google uses it to rank nodes. For the case of an undirected network, the steady state of this random walk tells us the degrees of the nodes—that is, how many edges come out of each node.

Now I’m going to prove this to you. I’ll also exploit the connection between this random walk and a quantum walk, also introduced last time. In particular, I’ll connect the properties of this quantum walk to the degrees of a network by exploiting its relationship with the random walk.

This is pretty useful, considering how tricky these quantum walks can be. As the parts of the world that we model using quantum mechanics get bigger and have more complicated structures, like biological network, we need all the help in understanding quantum walks that we can get. So I’d better start!

### Flashback

Starting with any (simple, connected) graph, we can get an old-fashioned ‘stochastic’ random walk on this graph, but also a quantum walk. The first is the uniform escape stochastic walk, where the walker has an equal probability per time of walking along any edge leaving the node they are standing at. The second is the related quantum walk we’re going to study now. These two walks are generated by two matrices, which we called $S$ and $Q.$ The good thing is that these matrices are similar, in the technical sense.

We studied this last time, and everything we learned is summarized here:

where:

$G$ is a simple graph that specifies

$A$ the adjacency matrix (the generator of a quantum walk) with elements $A_{i j}$ equal to unity if nodes $i$ and $j$ are connected, and zero otherwise ($A_{i i} = 0$), which subtracted from

$D$ the diagonal matrix of degrees $D_{i i} = \sum_j A_{i j}$ gives

$L = D - A$ the symmetric Laplacian (generator of stochastic and quantum walks), which when normalized by $D$ returns both

$S = L D^{-1}$ the generator of the uniform escape stochastic walk and

$Q = D^{-1/2} L D^{-1/2}$ the quantum walk generator to which it is similar!

Now I hope you remember where we are. Next I’ll talk you through the mathematics of the uniform escape stochastic walk $S$ and how it connects to the degrees of the nodes in the large-time limit. Then I’ll show you how this helps us solve aspects of the quantum walk generated by $Q.$

### Stochastic walk

The uniform escape stochastic walk generated by $S$ is popular because it has a really useful stationary state.

To recap from Part 20 of the network theory series, a stationary state of a stochastic walk is one that does not change in time. By the master equation

$\displaystyle{ \frac{d}{d t} \psi(t) = -S \psi(t)}$

the stationary state must be an eigenvector of $S$ with eigenvalue $0.$

A fantastic pair of theorems hold:

• There is always a unique (up to multiplication by a positive number) positive eigenvector $\pi$ of $S$ with eigenvalue $0.$ That is, there is a unique stationary state $\pi.$

• Regardless of the initial state $\psi(0),$ any solution of the master equation approaches this stationary state $\pi$ in the large-time limit:

$\displaystyle{ \lim_{t \rightarrow \infty} \psi(t) = \pi }$

To find this unique stationary state, consider the Laplacian $L,$ which is both infinitesimal stochastic and symmetric. Among other things, this means the rows of $L$ sum to zero:

$\displaystyle{ \sum_j L_{i j} = 0 }$

Thus, the ‘all ones’ vector $\mathbf{1}$ is an eigenvector of $L$ with zero eigenvalue:

$L \mathbf{1} = 0$

Inserting the identity $I = D^{-1} D$ into this equation we then find $D \mathbf{1}$ is a zero eigenvector of $S$:

$L \mathbf{1} = ( L D^{-1} ) ( D \mathbf{1} ) = S ( D \mathbf{1} ) = 0$

Therefore we just need to normalize this to get the large-time stationary state of the walk:

$\displaystyle{ \pi = \frac{D \mathbf{1}}{\sum_i D_{i i}} }$

If we write $i$ for the basis vector that is zero except at the ith node of our graph, and 1 at that node, the inner product $\langle i , \pi \rangle$ is large-time probability of finding a walker at that node. The equation above implies this is proportional to the degree $D_{i i}$ of node $i.$

We can check this for the following graph:

We find that $\pi$ is

$\displaystyle{ \left( \begin{matrix} 1/6 \\ 1/6 \\ 1/4 \\ 1/4 \\ 1/6 \end{matrix} \right) }$

which implies large-time probability $1/6$ for nodes $1,$ $2$ and $5,$ and $1/4$ for nodes $3$ and $4.$ Comparing this to the original graph, this exactly reflects the arrangement of degrees, as we knew it must.

Math works!

### The quantum walk

Next up is the quantum walk generated by $Q.$ Not a lot is known about quantum walks on networks of arbitrary geometry, but below we’ll see some analytical results are obtained by exploiting the similarity of $S$ and $Q.$

Where to start? Well, let’s start at the bottom, what quantum physicists call the ground state. In contrast to stochastic walks, for a quantum walk every eigenvector $\phi_k$ of $Q$ is a stationary state of the quantum walk. (In Puzzle 5, at the bottom of this page, I ask you to prove this). The stationary state $\phi_0$ is of particular interest physically and mathematically. Physically, since eigenvectors of the $Q$ correspond to states of well-defined energy equal to the associated eigenvalue, $\phi_0$ is the state of lowest energy, energy zero, hence the name ‘ground state’. (In Puzzle 3, I ask you to prove that all eigenvalues of $Q$ are non-negative, so zero really does correspond to the ground state.)

Mathematically, the relationship between eigenvectors implied by the similarity of $S$ and $Q$ means

$\phi_0 \propto D^{-1/2} \pi \propto D^{1/2} \mathbf{1}$

So in the ground state, the probability of our quantum walker being found at node $i$ is

$| \langle i , \phi_0 \rangle |^2 \propto | \langle i , D^{1/2} \rangle \mathbf{1} |^2 = D_{i i}$

Amazingly, this probability is proportional to the degree and so is exactly the same as $\langle i , \pi \rangle,$ the probability in the stationary state $\pi$ of the stochastic walk!

In short: a zero energy quantum walk $Q$ leads to exactly the same distribution of the walker over the nodes as in the large-time limit of the uniform escape stochastic walk $S.$ The classically important notion of degree distribution also plays a role in quantum walks!

This is already pretty exciting. What else can we say? If you are someone who feels faint at the sight of quantum mechanics, well done for getting this far, but watch out for what’s coming next.

What if the walker starts in some other initial state? Is there some quantum walk analogue of the unique large-time state of a stochastic walk?

In fact, the quantum walk in general does not converge to a stationary state. But there is a probability distribution that can be thought to characterize the quantum walk in the same way as the large-time state characterizes the stochastic walk. It’s the large-time average probability vector $P.$

If you didn’t know the time that had passed since the beginning of a quantum walk, then the best estimate for the probability of your measuring the walker to be at node $i$ would be the large-time average probability

$\displaystyle{ \langle i , P \rangle = \lim_{T \rightarrow \infty} \frac{1}{T} \int_0^T | \psi_i (t) |^2 d t }$

There’s a bit that we can do to simplify this expression. As usual in quantum mechanics, let’s start with the trick of diagonalizing $Q.$ This amounts to writing

$\displaystyle{ Q= \sum_k \epsilon_k \Phi_k }$

where $\Phi_k$ are projectors onto the eigenvectors $\phi_k$ of $Q,$ and $\epsilon_k$ are the corresponding eigenvalues of $Q.$ If we insert this equation into

$\psi(t) = e^{-Q t} \psi(0)$

we get

$\displaystyle{ \psi(t) = \sum_k e^{-\epsilon_k t} \Phi_k \psi(0) }$

and thus

$\displaystyle{ \langle i , P \rangle = \lim_{T \rightarrow \infty} \frac{1}{T} \int_0^T | \sum_k e^{-i \epsilon_k t} \langle i, \Phi_k \psi (0) \rangle |^2 d t }$

Due to the integral over all time, the interference between terms corresponding to different eigenvalues averages to zero, leaving:

$\displaystyle{ \langle i , P \rangle = \sum_k | \langle i, \Phi_k \psi(0) \rangle |^2 }$

The large-time average probability is then the sum of terms contributed by the projections of the initial state onto each eigenspace.

So we have a distribution that characterizes a quantum walk for a general initial state, but it’s a complicated beast. What can we say about it?

Our best hope of understanding the large-time average probability is through the term $| \langle i, \Phi_0 \psi (0) \rangle |^2$ associated with the zero energy eigenspace, since we know everything about this space.

For example, we know the zero energy eigenspace is one-dimensional and spanned by the eigenvector $\phi_0.$ This means that the projector is just the usual outer product

$\Phi_0 = | \phi_0 \rangle \langle \phi_0 | = \phi_0 \phi_0^\dagger$

where we have normalized $\phi_0$ according to the inner product $\langle \phi_0, \phi_0\rangle = 1.$ (If you’re wondering why I’m using all these angled brackets, well, they’re a notation named after Dirac that is adored by quantum physicists.)

The zero eigenspace contribution to the large-time average probability then breaks nicely into two:

$\begin{array}{ccl} | \langle i, \Phi_0 \psi (0) \rangle |^2 &=& | \langle i, \phi_0\rangle \; \langle \phi_0, \psi (0) \rangle |^2 \\ \\ &=& | \langle i, \phi_0\rangle |^2 \; | \langle \phi_0 , \psi (0) \rangle |^2 \\ \\ &=& \langle i , \pi \rangle \; | \langle \phi_0 , \psi (0) \rangle |^2 \end{array}$

This is just the product of two probabilities:

• first, the probability $\langle i , \pi \rangle$ for a quantum state in the zero energy eigenspace to be at node $i,$ as we found above,

and

• second, the probability $| \langle \phi_0, \psi (0)\rangle |^2$ of being in this eigenspace to begin with. (Remember, in quantum mechanics the probability of measuring the system to have an energy is the modulus squared of the projection of the state onto the associated eigenspace, which for the one-dimensional zero energy eigenspace means just the inner product with the ground state.)

This is all we need to say something interesting about the large-time average probability for all states. We’ve basically shown that we can break the large-time probability vector $P$ into a sum of two normalized probability vectors:

$P = (1- \eta) \pi + \eta \Omega$

the first $\pi$ being the stochastic stationary state associated with the zero energy eigenspace, and the second $\Omega$ associated with the higher energy eigenspaces, with

$\displaystyle{ \langle i , \Omega \rangle = \frac{ \sum_{k\neq 0} | \langle i, \Phi_k \psi (0) \rangle |^2 }{ \eta} }$

The weight of each term is governed by the parameter

$\eta = 1 - | \langle \phi_0, \psi (0)\rangle |^2$

which you could think of as the quantumness of the result. This is one minus the probability of the walker being in the zero energy eigenspace, or equivalently the probability of the walker being outside the zero energy eigenspace.

So even if we don’t know $\Omega,$ we know its importance is controlled by a parameter $\eta$ that governs how close the large-time average distribution $P$ of the quantum walk is to the corresponding stochastic stationary distribution $\pi.$

What do we mean by ‘close’? Find out for yourself:

Puzzle 1. Show, using a triangle inequality, that the trace distance between the two characteristic stochastic and quantum distributions $\{ \langle i , P \rangle \}_i$ and $\{ \langle i , \pi \rangle \}_i$ is upper-bounded by $2 \eta.$

Can we say anything physical about when the quantumness $\eta$ is big or small?

Because the eigenvalues of $Q$ have a physical interpretation in terms of energy, the answer is yes. The quantumness $\eta$ is the probability of being outside the zero energy state. Call the next lowest eigenvalue $\Delta = \min_{k \neq 0} \epsilon_k$ the energy gap. If the quantum walk is not in the zero energy eigenspace then it must be in an eigenspace of energy greater or equal to $\Delta.$ Therefore the expected energy $E$ of the quantum walker must bound the quantumness $E \ge \eta \Delta.$

This tells us that a quantum walk with a low energy is similar to a stochastic walk in the large-time limit. We already knew this was exactly true in the zero energy limit, but this result goes further.

So little is known about quantum walks on networks of arbitrary geometry that we were very pleased to find this result. It says there is a special case in which the walk is characterized by the degree distribution of the network, and a clear physical parameter that bounds how far the walk is from this special case.

Also, in finding it we learned that the difficulties of the initial state dependence, enhanced by the lack of convergence to a stationary state, could be overcome for a quantum walk, and that the relationships between quantum and stochastic walks extend beyond those with shared generators.

### What next?

That’s all for the latest bit of idea sharing at the interface between stochastic and quantum systems.

Other questions we have include: What holds analytically about the form of the quantum correction? Numerically it is known that the so-called quantum correction $\Omega$ tends to enhance the probability of being found on nodes of low degree compared to $\pi.$ Can someone explain why? What happens if a small amount of stochastic noise is added to a quantum walk? Or a lot of noise?

It’s difficult to know who is best placed to answer these questions: experts in quantum physics, graph theory, complex networks or stochastic processes? I suspect it’ll take a bit of help from everyone.

A couple of textbooks with comprehensive sections on non-negative matrices and continuous-time stochastic processes are:

• Peter Lancaster and Miron Tismenetsky, The Theory of Matrices: with Applications, 2nd edition, Academic Press, San Diego, 1985.

• James R. Norris, Markov Chains, Cambridge University Press, Cambridge, 1997.

There is, of course, the book that arose from the Azimuth network theory series, which considers several relationships between quantum and stochastic processes on networks:

• John Baez and Jacob Biamonte, A Course on Quantum Techniques for Stochastic Mechanics, 2012.

Another couple of books on complex networks are:

• Mark Newman, Networks: An Introduction, Oxford University Press, Oxford, 2010.

• Ernesto Estrada, The Structure of Complex Networks: Theory and Applications, Oxford University Press, Oxford, 2011. Note that the first chapter is available free online.

There are plenty more useful references in our article on this topic:

• Mauro Faccin, Tomi Johnson, Jacob Biamonte, Sabre Kais and Piotr Migdał, Degree distribution in quantum walks on complex networks.

### Puzzles for the enthusiastic

Sadly I didn’t have space to show proofs of all the theorems I used. So here are a few puzzles that guide you to doing the proofs for yourself:

#### Stochastic walks and stationary states

Puzzle 2. (For the hard core.) Prove there is always a unique positive eigenvector for a stochastic walk generated by $S.$ You’ll need the assumption that the graph $G$ is connected. It’s not simple, and you’ll probably need help from a book, perhaps one of those above by Lancaster and Tismenetsky, and Norris.

Puzzle 3. Show that the eigenvalues of $S$ (and therefore $Q$) are non-negative. A good way to start this proof is to apply the Perron-Frobenius theorem to the non-negative matrix $M = - S + I \max_i S_{i i}.$ This implies that $M$ has a positive eigenvalue $r$ equal to its spectral radius

$r = \max_k | \lambda_k |$

where $\lambda_k$ are the eigenvalues of $M,$ and the associated eigenvector $v$ is positive. Since $S = - M + I \max_i S_{i i},$ it follows that $S$ shares the eigenvectors of $M$ and the associated eigenvalues are related by inverted translation:

$\epsilon_k = - \lambda_k + \max_i S_{i i}$

Puzzle 4. Prove that regardless of the initial state $\psi(0),$ the zero eigenvector $\pi$ is obtained in the large-time limit $\lim_{t \rightarrow \infty} \psi(t) = \pi$ of the walk generated by $S.$ This breaks down into two parts:

(a) Using the approach from Puzzle 5, to show that $S v = \epsilon_v v,$ the positivity of $v$ and the infinitesimal stochastic property $\sum_i S_{i j} = 0$ imply that $\epsilon_v = \epsilon_0 = 0$ and thus $v = \pi$ is actually the unique zero eigenvector and stationary state of $S$ (its uniqueness follows from puzzle 4, you don’t need to re-prove it).

(b) By inserting the decomposition $S = \sum_k \epsilon_k \Pi_k$ into $e^{-S t}$ and using the result of puzzle 5, complete the proof.

(Though I ask you to use the diagonalizability of $S,$ the main results still hold if the generator is irreducible but not diagonalizable.)

#### Quantum walks

Here are a couple of extra puzzles for those of you interested in quantum mechanics:

Puzzle 5. In quantum mechanics, probabilities are given by the moduli squared of amplitudes, so multiplying a state by a number of modulus unity has no physical effect. By inserting

$\displaystyle{ Q= \sum_k \epsilon_k \Phi_k }$

into the quantum time evolution matrix $e^{-Q t},$ show that if

$\psi(0) = \phi_k$

then

$\psi(t) = e^{ - i \epsilon_k t} \psi(0)$

hence $\phi_k$ is a stationary state in the quantum sense, as probabilities don’t change in time.

Puzzle 6. By expanding the initial state $\psi(0)$ in terms of the complete orthogonal basis vectors $\phi_k$ show that for a quantum walk $\psi(t)$ never converges to a stationary state unless it began in one.

## Monte Carlo Methods in Climate Science

23 July, 2013

joint with David Tweed

One way the Azimuth Project can help save the planet is to get bright young students interested in ecology, climate science, green technology, and stuff like that. So, we are writing an article for Math Horizons, an American magazine for undergraduate math majors. This blog article is a draft of that. You can also see it in PDF form here.

We’d really like to hear your comments! There are severe limits on including more detail, since the article should be easy to read and short. So please don’t ask us to explain more stuff: we’re most interested to know if you sincerely don’t understand something, or feel that students would have trouble understanding something. For comparison, you can see sample Math Horizons articles here.

### Introduction

They look placid lapping against the beach on a calm day, but the oceans are actually quite dynamic. The ocean currents act as ‘conveyor belts’, transporting heat both vertically between the water’s surface and the depths and laterally from one area of the globe to another. This effect is so significant that the temperature and precipitation patterns can change dramatically when currents do.

For example: shortly after the last ice age, northern Europe experienced a shocking change in climate from 10,800 to 9,500 BC. At the start of this period temperatures plummeted in a matter of decades. It became 7° Celsius colder, and glaciers started forming in England! The cold spell lasted for over a thousand years, but it ended as suddenly as it had begun.

Why? The most popular theory is that that a huge lake in North America formed by melting glaciers burst its bank—and in a massive torrent lasting for years, the water from this lake rushed out to the northern Atlantic ocean. By floating atop the denser salt water, this fresh water blocked a major current: the Atlantic Meridional Overturning Circulation. This current brings warm water north and helps keep northern Europe warm. So, when iit shut down, northern Europe was plunged into a deep freeze.

Right now global warming is causing ice sheets in Greenland to melt and release fresh water into the North Atlantic. Could this shut down the Atlantic Meridional Overturning Circulation and make the climate of Northern Europe much colder? In 2010, Keller and Urban [KU] tackled this question using a simple climate model, historical data, probability theory, and lots of computing power. Their goal was to understand the spectrum of possible futures compatible with what we know today.

Let us look at some of the ideas underlying their work.

### Box models

The earth’s physical behaviour, including the climate is far too complex to simulate from the bottom up using basic physical principles, at least for now. The most detailed models today can take days to run on very powerful computers. So to make reasonable predictions on a laptop in a tractable time-frame, geophysical modellers use some tricks.

First, it is possible to split geophysical phenomena into ‘boxes’ containing strongly related things. For example: atmospheric gases, particulate levels and clouds all affect each other strongly; likewise the heat content, currents and salinity of the oceans all interact strongly. However, the interactions between the atmosphere and the oceans are weaker, and we can approximately describe them using just a few settings, such as the amount of atmospheric CO2 entering or leaving the oceans. Clearly these interactions must be consistent—for example, the amount of CO2 leaving the atmosphere box must equal the amount entering the ocean box—but breaking a complicated system into parts lets different specialists focus on different aspects; then we can combine these parts and get an approximate model of entire planet. The box model used by Keller and Urban is shown in Figure 1.

1. The box model used by Keller and Urban.

Second, it turn out that simple but effective box models can be distilled from the complicated physics in terms of forcings and feedbacks. Essentially a forcing is a measured input to the system, such as solar radiation or CO2 released by burning fossil fuels. As an analogy, consider a child on a swing: the adult’s push every so often is a forcing. Similarly a feedback describes how the current ‘box variables’ influence future ones. In the swing analogy, one feedback is how the velocity will influence the future height. Specifying feedbacks typically uses knowledge of the detailed low-level physics to derive simple, tractable functional relationships between groups of large-scale observables, a bit like how we derive the physics of a gas by thinking about collisions of lots of particles.

However, it is often not feasible to get actual settings for the parameters in our model starting from first principles. In other words, often we can get the general form of the equations in our model, but they contain a lot of constants that we can estimate only by looking at historical data.

### Probability modeling

Suppose we have a box model that depends on some settings $S.$ For example, in Keller and Urban’s model, $S$ is a list of 18 numbers. To keep things simple, suppose the settings are element of some finite set. Suppose we also have huge hard disc full of historical measurements, and we want to use this to find the best estimate of $S.$ Because our data is full of ‘noise’ from other, unmodeled phenomena we generally cannot unambiguously deduce a single set of settings. Instead we have to look at things in terms of probabilities. More precisely, we need to study the probability that $S$ take some value $s$ given that the measurements take some value. Let’s call the measurements $M$, and again let’s keep things simple by saying $M$ takes values in some finite set of possible measurements.

The probability that $S = s$ given that $M$ takes some value $m$ is called the conditional probability $P(S=s | M=m).$ How can we compute this conditional probability? This is a somewhat tricky problem.

One thing we can more easily do is repeatedly run our model with randomly chosen settings and see what measurements it predicts. By doing this, we can compute the probability that given setting values $S = s,$ the model predicts measurements $M=m.$ This again is a conditional probability, but now it is called $P(M=m|S=s).$

This is not what we want: it’s backwards! But here Bayes’ rule comes to the rescue, relating what we want to what we can more easily compute:

$\displaystyle{ P(S = s | M = m) = P(M = m| S = s) \frac{P(S = s)}{P(M = m)} }$

Here $P(S = s)$ is the probability that the settings take a specific value $s,$ and similarly for $P(M = m).$ Bayes’ rule is quite easy to prove, and it is actually a general rule that applies to any random variables, not just the settings and the measurements in our problem [Y]. It underpins most methods of figuring out hidden quantities from observed ones. For this reason, it is widely used in modern statistics and data analysis [K].

How does Bayes’ rule help us here? When we repeatedly run our model with randomly chosen settings, we have control over $P(S = s).$ As mentioned, we can compute $P(M=m| S=s).$ Finally, $P(M = m)$ is independent of our choice of settings. So, we can use Bayes’ rule to compute $P(S = s | M = m)$ up to a constant factor. And since probabilities must sum to 1, we can figure out this constant.

This lets us do many things. It lets us find the most likely values of the settings for our model, given our hard disc full of observed data. It also lets us find the probability that the settings lie within some set. This is important: if we’re facing the possibility of a climate disaster, we don’t just want to know the most likely outcome. We would like to know to know that with 95% probability, the outcome will lie in some range.

### An example

Let us look at an example much simpler than that considered by Keller and Urban. Suppose our measurements are real numbers $m_0,\dots, m_T$ related by

$m_{t+1} = s m_t - m_{t-1} + N_t$

Here $s,$ a real constant, is our ‘setting’, while $N_t$ is some ‘noise': an independent Gaussian random variable for each time $t,$ each with mean zero and some fixed standard deviation. Then the measurements $m_t$ will have roughly sinusoidal behavior but with irregularity added by the noise at each time step, as illustrated in Figure 2.

2. The example system: red are predicted measurements for a given value of the settings, green is another simulation for the same $s$ value and blue is a simulation for a slightly different $s.$

Note how there is no clear signal from either the curves or the differences that the green curve is at the correct setting value while the blue one has the wrong one: the noise makes it nontrivial to estimate $s.$ This is a baby version of the problem faced by Keller and Urban.

### Markov Chain Monte Carlo

Having glibly said that we can compute the conditional probability $P(M=m | S=s),$ how do we actually do this? The simplest way would be to run our model many, many times with the settings set at $S=s$ and determine the fraction of times it predicts measurements equal to $m.$ This gives us an estimate of $P(M=m | S=s).$ Then we can use Bayes’ rule to work out $P(M=m|S=s),$ at least up to a constant factor.

Doing all this by hand would be incredibly time consuming and error prone, so computers are used for this task. In our example, we do this in Figure 3. As we keep running our model over and over, the curve showing $P(M=m |S=s)$ as a function of $s$ settles down to the right answer.

3. The estimates of $P(M=m | S=s)$ as a function of $s$ using uniform sampling, ending up with 480 samples at each point.

However, this is computationally inefficient, as shown in the probability distribution for small numbers of samples. This has quite a few ‘kinks’, which only disappear later. The problem is that there are lots of possible choices of $s$ to try. And this is for a very simple model!

When dealing with the 18 settings involved in the model of Keller and Urban, trying every combination would take far too long. A way to avoid this is Markov Chain Monte Carlo sampling. Monte Carlo is famous for its casinos, so a ‘Monte Carlo’ algorithm is one that uses randomness. A ‘Markov chain’ is a random walk: for example, where you repeatedly flip a coin and take one step right when you get heads, and one step right when you get tails. So, in Markov Chain Monte Carlo, we perform a random walk through the collection of all possible settings, collecting samples.

The key to making this work is that at each step on the walk a proposed modification $s'$ to the current settings $s$ is generated randomly—but it may be rejected if it does not seem to improve the estimates. The essence of the rule is:

The modification $s \mapsto s'$ is randomly accepted with a probability equal to the ratio

$\displaystyle{ \frac{P(M=m | S=s')}{ P(M=m | S=s)} }$

Otherwise the walk stays at the current position.

If the modification is better, so that the ratio is greater than 1, the new state is always accepted. With some additional tricks—such as discarding the very beginning of the walk—this gives a set of samples from which can be used to compute $P(M=m | S=s).$ Then we can compute $P(S = s | M = m)$ using Bayes’ rule.

Figure 4 shows the results of using the Markov Chain Monte Carlo procedure to figure out $P(S= s| M= m)$ in our example.

4. The estimates of $P(S = s|M = m)$ curves using Markov Chain Monte Carlo, showing the current distribution estimate at increasing intervals. The red line shows the current position of the random walk. Again the kinks are almost gone in the final distribution.

Note that the final distribution has only peformed about 66 thousand simulations in total, while the full sampling peformed over 1.5 million. The key advantage of Markov Chain Monte Carlo is that it avoids performing many simulations in areas where the probability is low, as we can see from the way the walk path remains under the big peak in the probability density almost all the time. What is more impressive is that it achieves this without any global view of the probability density, just by looking at how $P(M=m | S=s)$ changes when we make small changes in the settings. This becomes even more important as we move to dealing with systems with many more dimensions and settings, where it proves very effective at finding regions of high probability density whatever their shape.

Why is it worth doing so much work to estimate the probability distribution for settings for a climate model? One reason is that we can then estimate probabilities of future events, such as the collapse of the Atlantic Meridional Ocean Current. And what’s the answer? According to Keller and Urban’s calculation, this current will likely weaken by about a fifth in the 21st century, but a complete collapse is unlikely before 2300. This claim needs to be checked in many ways—for example, using more detailed models. But the importance of the issue is clear, and we hope we have made the importance of good mathematical ideas for climate science clear as well.

### Exploring the topic

The Azimuth Project is a group of scientists, engineers and computer programmers interested in questions like this [A]. If you have questions, or want to help out, just email us. Versions of the computer programs we used in this paper will be made available here in a while.

Here are some projects you can try, perhaps with the help of Kruschke’s textbook [K]:

• There are other ways to do setting estimation using time series: compare some to MCMC in terms of accuracy and robustness.

• We’ve seen a 1-dimensional system with one setting. Simulate some multi-dimensional and multi-setting systems. What new issues arise?

Acknowledgements. We thank Nathan Urban and other
members of the Azimuth Project for many helpful discussions.

### References

[A] Azimuth Project, http://www.azimuthproject.org.

[KU] Klaus Keller and Nathan Urban, Probabilistic hindcasts and projections of the coupled climate, carbon cycle and Atlantic meridional overturning circulation system: a Bayesian fusion of century-scale measurements with a simple model, Tellus A 62 (2010), 737–750. Also available free online.

[K] John K. Kruschke, Doing Bayesian Data Analysis: A Tutorial with R and BUGS, Academic Press, New York, 2010.

[Y] Eliezer S. Yudkowsky, An intuitive explanation of Bayes’ theorem.

## Negative Probabilities

19 July, 2013

The physicists Dirac and Feynman, both bold when it came to new mathematical ideas, both said we should think about negative probabilities.

These days, Kolmogorov’s axioms for probabilities are used to justify formulating probability theory in terms of measure theory. Mathematically, the theory of measures that take negative or even complex values is well-developed. So, to the extent that probability theory is just measure theory, you can say a lot is known about negative probabilities.

But probability theory is not just measure theory; it adds its own distinctive ideas. To get these into the picture, we really need to ask some basic questions, like: what could it mean to say something had a negative chance of happening?

I really have no idea.

In this paper:

• Paul Dirac, The physical interpretation of quantum mechanics, Proc. Roy. Soc. London A 180 (1942), 1–39.

Dirac wrote:

Negative energies and probabilities should not be considered as nonsense. They are well-defined concepts mathematically, like a negative of money.

In fact, I think negative money could have been the origin of negative numbers. Venetian bankers started writing numbers in red to symbolize debts—hence the phrase ‘in the red’ for being in debt. So, you could say negative numbers were invented to formalize the idea of debt and make accounting easier. Bankers couldn’t really get rich if negative money didn’t exist.

A negative dollar is a dollar you owe someone. But how can you owe someone a probability?﻿ I haven’t figured this out.

Unsurprisingly, the clearest writing about negative probabilities that I’ve found is by Feynman:

• Richard P. Feynman, Negative probability, in Quantum Implications: Essays in Honour of David Bohm, eds. F. David Peat and Basil Hiley, Routledge & Kegan Paul Ltd, London, 1987, pp. 235–248.

He emphasizes that even if the final answer of a calculation must be positive, negative numbers are often allowed to appear in intermediate steps… and that this can happen with probabilities.

Let me quote some:

Some twenty years ago one problem we theoretical physicists had was that if we combined the principles of quantum mechanics and those of relativity plus certain tacit assumptions, we seemed only able to produce theories (the quantum field theories) which gave infinity for the answer to certain questions. These infinities are kept in abeyance (and now possibly eliminated altogether) by the awkward process of renormalization. In an attempt to understand all this better, and perhaps to make a theory which would give only finite answers from the start, I looked into the” tacit assumptions” to see if they could be altered.

One of the assumptions was that the probability for an event must always be a positive number. Trying to think of negative probabilities gave me cultural shock at first, but when I finally got easy with the concept I wrote myself a note so I wouldn’t forget my thoughts. I think that Prof. Bohm has just the combination of imagination and boldness to find them interesting and amusing. I am delighted to have this opportunity to publish them in such an appropriate place. I have taken the opportunity to add some further, more recent, thoughts about applications to two state systems.

Unfortunately I never did find out how to use the freedom of allowing probabilities to be negative to solve the original problem of infinities in quantum field theory!

It is usual to suppose that, since the probabilities of events must be positive, a theory which gives negative numbers for such quantities must be absurd. I should show here how negative probabilities might be interpreted. A negative number, say of apples, seems like an absurdity. A man starting a day with five apples who gives away ten and is given eight during the day has three left. I can calculate this in two steps: 5 -10 = -5 and -5 + 8 + 3. The final answer is satisfactorily positive and correct although in the intermediate steps of calculation negative numbers appear. In the real situation there must be special limitations of the time in which the various apples are received and given since he never really has a negative number, yet the use of negative numbers as an abstract calculation permits us freedom to do our mathematical calculations in any order simplifying the analysis enormously, and permitting us to disregard inessential details. The idea of negative numbers is an exceedingly fruitful mathematical invention. Today a person who balks at making a calculation in this way is considered backward or ignorant, or to have some kind of a mental block. It is the purpose of this paper to point out that we have a similar strong block against negative probabilities. By discussing a number of examples, I hope to show that they are entirely rational of course, and that their use simplifies calculation and thought in a number of applications in physics.

First let us consider a simple probability problem, and how we usually calculate things and then see what would happen if we allowed some of our normal probabilities in the calculations to be negative. Let us imagine a roulette wheel with, for simplicity, just three numbers: 1, 2, 3. Suppose however, the operator by control of a switch under the table can put the wheel into one of two conditions A, B in each of which the probability of 1, 2, 3 are different. If the wheel is in condition A, the probability of 1 is p1A = 0.3 say, of 2 is p2A = 0.6, of 3 is p3A =0.1. But if the wheel is in condition B, these probabilities are

p1B = 0.1, p2B = 0.4, p3B = 0.5

say as in the table.

 Cond. A Cond. B 1 0.3 0.1 2 0.6 0.4 3 0.1 0.5

We, of course, use the table in this way: suppose the operator puts the wheel into condition A 7/10 of the time and into B the other 3/10 of the time at random. (That is the probability of condition A, pA = 0.7, and of B, pB = 0.3.) Then the probability of getting 1 is

Prob. 1 = 0.7 (0.3) + 0.3 (0.1) = 0.24,

etc.

[…]

Now, however, suppose that some of the conditional probabilities are negative, suppose the table reads so that, as we shall say, if the system is in condition B the probability of getting 1 is -0.4. This sounds absurd but we must say it this way if we wish that our way of thought and language be precisely the same whether the actual quantities pi α in our calculations are positive or negative. That is the essence of the mathematical use of negative numbers—to permit an efficiency in reasoning so that various cases can be considered together by the same line of reasoning, being assured that intermediary steps which are not readily interpreted (like -5 apples) will not lead to absurd results. Let us see what p1B = -0.4 “means” by seeing how we calculate with it.

He gives an example showing how meaningful end results can sometimes arise even if the conditional probabilities like p1B are negative or greater than 1.

It is not my intention here to contend that the final probability of a verifiable physical event can be negative. On the other hand, conditional probabilities and probabilities of imagined intermediary states may be negative in a calculation of probabilities of physical events or states. If a physical theory for calculating probabilities yields a negative probability for a given situation under certain assumed conditions, we need not conclude the theory is incorrect. Two other possibilities of interpretation exist. One is that the conditions (for example, initial conditions) may not be capable of being realized in the physical world. The other possibility is that the situation for which the probability appears to be negative is not one that can be verified directly. A combination of these two, limitation of verifiability and freedom in initial conditions, may also be a solution to the apparent difficulty.

The rest of this paper illustrates these points with a number of examples drawn from physics which are less artificial than our roulette wheel. Since the result must ultimately have a positive probability, the question may be asked, why not rearrange the calculation so that the probabilities are positive in all the intermediate states? The same question might be asked of an accountant who subtracts the total disbursements before adding the total receipts. He stands a chance of going through an intermediary negative sum. Why not rearrange the calculation? Why bother? There is nothing mathematically wrong with this method of calculating and it frees the mind to think clearly and simply in a situation otherwise quite complicated. An analysis in terms of various states or conditions may simplify a calculation at the expense of requiring negative probabilities for these states. It is not really much expense.

Our first physical example is one in which one· usually uses negative probabilities without noticing it. It is not a very profound example and is practically the same in content as our previous example. A particle diffusing in one dimension in a rod has a probability of being at $x$ at time $t$ of $P(x,t)$ satisfying

$\partial P(x,t)/\partial t = -\partial^2 P(x,t)/\partial x^2$

Suppose at $x =0$ and $x =\pi$ the rod has absorbers at both ends so that $P(x,t) = 0$ there. Let the probability of being at $x$ at $t = 0$ be given as $P(x,0) =f(x).$ What is $P(x,t)$ thereafter? It is

$\displaystyle{ P(x,t) = \sum_{n=1}^\infty P_n \; \sin x \;\exp(-n^2 t) }$

where $P_n$ is given by

$x \displaystyle{ f(x) = \sum_{n = 1}^\infty P_n \; \sin n x }$

or

$x \displaystyle{ P_n = \frac{2}{\pi} \int_0^\pi f(x) \sin nx \; dx }$

The easiest way of analyzing this (and the way used if $P(x,t)$ is a temperature, for example) is to say that there are certain distributions that behave in an especially simple way. If $f(x)$ starts as $\sin nx$ it will remain that shape, simply decreasing with time, as $e^{-n^2 t}$ Any distribution $f(x)$ can be thought of as a superposition of such sine waves. But $f(x)$ cannot be $\sin nx$ if $f(x)$ is a probability and probabilities must always be positive. Yet the analysis is so simple this way that no one has really objected for long.

He also gives examples from quantum mechanics, but the interesting thing about the examples above is that they’re purely classical—and the second one, at least, is something physicists are quite used to.

Sometimes it’s good to temporarily put aside making sense of ideas and just see if you can develop rules to consistently work with them. For example: the square root of -1. People had to get good at using it before they understood what it really was: a rotation by a quarter turn in the plane.

Along those, lines, here’s an interesting attempt to work with negative probabilities:

• Gábor J. Székely, Half of a coin: negative probabilities, Wilmott Magazine (July 2005), 66–68.

He uses rigorous mathematics to study something that sounds absurd: ‘half a coin’. Suppose you make a bet with an ordinary fair coin, where you get 1 dollar if it comes up heads and 0 dollars if it comes up tails. Next, suppose you want this bet to be the same as making two bets involving two separate ‘half coins’. Then you can do it if a half coin has infinitely many sides numbered 0,1,2,3, etc., and you win $n$ dollars when side number $n$ comes up….

… and if the probability of side $n$ coming up obeys a special formula…

and if this probability can be negative sometimes!

This seems very bizarre, but the math is solid, even if the problem of interpreting it may drive you insane.

Let’s see how it works. Consider a game $G$ where the probability of winning $n = 0, 1, 2, \dots$ dollars is $g(n).$ Then we can summarize this game using a generating function:

$\displaystyle{ G(z) = \sum_{n = 0}^\infty g(n) , z^n }$

Now suppose you play two independent games like this, $G$ and another one, say $H,$ with generating function

$\displaystyle{ H(z) = \sum_{n = 0}^\infty h(n) , z^n }$

Then there’s a new game $GH$ that consists of playing both games. The reason I’m writing it as $GH$ is that its generating function is the product

$\displaystyle{ G(z) H(z) = \sum_{m,n = 0}^\infty g(m) h(n) z^{m+n} }$

See why? With probability $g(m) h(n)$ you win $m$ dollars in game $G$ and $n$ dollars in game $H,$ for a total of $m + n$ dollars.

The game where you flip a fair coin and win 1 dollar if it lands heads up and 0 dollars if lands tails up has generating function

$\displaystyle{ G(z) = \frac{1}{2}(1 + z) }$

The half-coin is an imaginary game $H$ such that playing two copies of this game is the same as playing the game $G.$ If such a game really existed, we would have

$G(z) = H(z)^2$

so

$\displaystyle{ H(z) = \sqrt{\frac{1}{2}(1 + z)} }$

However, if you work out the Taylor series of this function, every even term is negative except for the zeroth term. So, this game can exist only if we allow negative probabilities.

(Experts on generating functions and combinatorics will enjoy how the coefficients of the Taylor series of $H(z)$ involves the Catalan numbers.)

By the way, it’s worth remembering that for a long time mathematicians believed that negative numbers made no sense. As late as 1758 the British mathematician Francis Maseres claimed that negative numbers

… darken the very whole doctrines of the equations and make dark of the things which are in their nature excessively obvious and simple.

So opinions on these things can change. And since I’ve spent a lot of time working on ‘sets with fractional cardinality’, and have made lots of progress on that idea, and other strange ideas, I like to spend a little time now and then investigating other nonsensical-sounding generalizations of familiar concepts.﻿

This paper by Mark Burgin has a nice collection of references on negative probability:

• Mark Burgin, Interpretations of negative probability.

He valiantly tries to provide a frequentist interpretation of negative probabilities. He needs ‘negative events’ to get negative frequencies of events occurring, and he gives this example:

To better understand how negative elementary events appear and how negative probability emerges, consider the following example. Let us consider the situation when an attentive person A with the high knowledge of English writes some text T. We may ask what the probability is for the word “texxt” or “wrod” to appear in his text T. Conventional probability theory gives 0 as the answer. However, we all know that there are usually misprints. So, due to such a misprint this word may appear but then it would be corrected. In terms of extended probability, a negative value (say, -0.1) of the probability for the word “texxt” to appear in his text T means that this word may appear due to a misprint but then it’ll be corrected and will not be present in the text T.

Maybe he’s saying that the misprint occurs with probability 0.1 and then it ‘de-occurs’ with the same probability, giving a total probability of

$0.1 - 0.1 = 0$

I’m not sure.

Here’s another paper on the subject:

• Espen Gaarder Haug, Why so negative to negative probabilities?, Wilmott Magazine.

It certainly gets points for a nice title! However, like Burgin’s paper, I find it a lot less clear than what Feynman wrote.

Notice that like Székely’s paper, Haug’s originally appeared in the Wilmott Magazine. I hadn’t heard of that, but it’s about finance. So it seems that the bankers, having invented negative numbers to get us into debt, are now struggling to invent negative probabilities! In fact Haug’s article tries some applications of negative probabilities to finance.

Scary.

For further discussion, with some nice remarks by the quantum physics experts Matt Leifer and Michael Nielsen, see the comments on my Google+ post on this topic. Matt Leifer casts cold water on the idea of using negative probabilities in quantum theory. On the other hand, Michael Nielsen points out some interesting features of the Wigner quasiprobability distribution, which is the best possible attempt to assign a probability density for a quantum particle to have any given position and momentum. It can be negative! But if you integrate it over all momenta, you get the probability density for the particle having any given position:

$|\psi(x)|^2$

And if you integrate it over all positions, you get the probability density for the particle having any given momentum:

$|\widehat{\psi}(p)|^2$

## Coherence for Solutions of the Master Equation

10 July, 2013

guest post by Arjun Jain

I am a master’s student in the physics department of the Indian Institute of Technology Roorkee. I’m originally from Delhi. Since some time now, I’ve been wanting to go into Mathematical Physics. I hope to do a PhD in that. Apart from maths and physics, I am also quite passionate about art and music.

Right now I am visiting John Baez at the Centre for Quantum Technologies, and we’re working on chemical reaction networks. This post can be considered as an annotation to the last paragraph of John’s paper, Quantum Techniques for Reaction Networks, where he raises the question of when a solution to the master equation that starts as a coherent state will remain coherent for all times. Remember, the ‘master equation’ describes the random evolution of collections of classical particles, and a ‘coherent state’ is one where the probability distribution of particles of each type is a Poisson distribution.

If you’ve been following the network theory series on this blog, you’ll know these concepts, and you’ll know the Anderson-Craciun-Kurtz theorem gives many examples of coherent states that remain coherent. However, all these are equilibrium solutions of the master equation: they don’t change with time. Moreover they are complex balanced equilibria: the rate at which any complex is produced equals the rate at which it is consumed.

There are also non-equilibrium examples where coherent states remain coherent. But they seem rather rare, and I would like to explain why. So, I will give a necessary condition for it to happen. I’ll give the proof first, and then discuss some simple examples. We will see that while the condition is necessary, it is not sufficient.

First, recall the setup. If you’ve been following the network theory series, you can skip the next section.

### Reaction networks

Definition. A reaction network consists of:

• a finite set $S$ of species,

• a finite set $K$ of complexes, where a complex is a finite sum of species, or in other words, an element of $\mathbb{N}^S,$

• a graph with $K$ as its set of vertices and some set $T$ of edges.

You should have in mind something like this:

where our set of species is $S = \{A,B,C,D,E\},$ the complexes are things like $A + E,$ and the arrows are the elements of $T,$ called transitions or reactions. So, we have functions

$s , t : T \to K$

saying the source and target of each transition.

Next:

Definition. A stochastic reaction network is a reaction network together with a function $r: T \to (0,\infty)$ assigning a rate constant to each reaction.

From this we can write down the master equation, which describes how a stochastic state evolves in time:

$\displaystyle{ \frac{d}{dt} \Psi(t) = H \Psi(t) }$

Here $\Psi(t)$ is a vector in the stochastic Fock space, which is the space of formal power series in a bunch of variables, one for each species, and $H$ is an operator on this space, called the Hamiltonian.

From now on I’ll number the species with numbers from $1$ to $k,$ so

$S = \{1, \dots, k\}$

Then the stochastic Fock space consists of real formal power series in variables that I’ll call $z_1, \dots, z_k.$ We can write any of these power series as

$\displaystyle{\Psi = \sum_{\ell \in \mathbb{N}^k} \psi_\ell z^\ell }$

where

$z^\ell = z_1^{\ell_1} \cdots z_k^{\ell_k}$

We have annihilation and creation operators on the stochastic Fock space:

$\displaystyle{ a_i \Psi = \frac{\partial}{\partial z_i} \Psi }$

$\displaystyle{ a_i^\dagger \Psi = z_i \Psi }$

and the Hamiltonian is built from these as follows:

$\displaystyle{ H = \sum_{\tau \in T} r(\tau) \, ({a^\dagger}^{t(\tau)} - {a^\dagger}^{s(\tau)}) \, a^{s(\tau)} }$

John explained this here (using slightly different notation), so I won’t go into much detail now, but I’ll say what all the symbols mean. Remember that the source of a transition $\tau$ is a complex, or list of natural numbers:

$s(\tau) = (s_1(\tau), \dots, s_k(\tau))$

So, the power $a^{s(\tau)}$ is really an abbreviation for a big product of annihilation operators, like this:

$\displaystyle{ a^{s(\tau)} = a_1^{s_1(\tau)} \cdots a_k^{s_k(\tau)} }$

This describes the annihilation of all the inputs to the transition $\tau.$ Similarly, we define

$\displaystyle{ {a^\dagger}^{s(\tau)} = {a_1^\dagger}^{s_1(\tau)} \cdots {a_k^\dagger}^{s_k(\tau)} }$

and

$\displaystyle{ {a^\dagger}^{t(\tau)} = {a_1^\dagger}^{t_1(\tau)} \cdots {a_k^\dagger}^{t_k(\tau)} }$

### The result

Here’s the result:

Theorem. If a solution $\Psi(t)$ of the master equation is a coherent state for all times $t \ge 0,$ then $\Psi(0)$ must be complex balanced except for complexes of degree 0 or 1.

This requires some explanation.

First, saying that $\Psi(t)$ is a coherent state means that it is an eigenvector of all the annihilation operators. Concretely this means

$\Psi (t) = \displaystyle{\frac{e^{c(t) \cdot z}}{e^{c_1(t) + \cdots + c_k(t)}}}$

where

$c(t) = (c_1(t), \dots, c_k(t)) \in [0,\infty)^k$

and

$z = (z_1, \dots, z_k)$

It will be helpful to write

$\mathbf{1}= (1,1,1,...)$

so we can write

$\Psi (t) = \displaystyle{ e^{c(t) \cdot (z - \mathbf{1})} }$

Second, we say that a complex has degree $d$ if it is a sum of exactly $d$ species. For example, in this reaction network:

the complexes $A + C$ and $B + E$ have degree 2, while the rest have degree 1. We use the word ‘degree’ because each complex $\ell$ gives a monomial

$z^\ell = z_1^{\ell_1} \cdots z_k^{\ell_k}$

and the degree of the complex is the degree of this monomial, namely

$\ell_1 + \cdots + \ell_k$

Third and finally, we say a solution $\Psi(t)$ of the master equation is complex balanced for a specific complex $\ell$ if the total rate at which that complex is produced equals the total rate at which it’s destroyed.

Now we are ready to prove the theorem:

Proof. Consider the master equation

$\displaystyle { \frac{d \Psi (t)}{d t} = H \psi (t) }$

Assume that $\Psi(t)$ is a coherent state for all $t \ge 0.$ This means

$\Psi (t) = \displaystyle{ e^{c(t) \cdot (z - \mathbf{1})} }$

For convenience, we write $c(t)$ simply as $c,$ and similarly for the components $c_i$. Then we have

$\displaystyle{ \frac{d\Psi(t)}{dt} = (\dot{c} \cdot (z - \mathbf{1})) \, e^{c \cdot (z - \mathbf{1})} }$

On the other hand, the master equation gives

$\begin{array}{ccl} \displaystyle {\frac{d\Psi(t)}{dt}} &=& \displaystyle{ \sum_{\tau \in T} r(\tau) \, ({a^\dagger}^{t(\tau)} - {a^\dagger}^{s(\tau)}) \, a^{s(\tau)} e^{c \cdot (z - \mathbf{1})} } \\ \\ &=& \displaystyle{\sum_{\tau \in T} c^{t(\tau)} r(\tau) \, ({z}^{t(\tau)} - {z}^{s(\tau)}) e^{c \cdot (z - \mathbf{1})} } \end{array}$

So,

$\displaystyle{ (\dot{c} \cdot (z - \mathbf{1})) \, e^{c \cdot (z - \mathbf{1})} =\sum_{\tau \in T} c^{t(\tau)} r(\tau) \, ({z}^{t(\tau)} - {z}^{s(\tau)}) e^{c \cdot (z - \mathbf{1})} }$

As a result, we get

$\displaystyle{ \dot{c}\cdot z -\dot{c}\cdot\mathbf{1} = \sum_{\tau \in T} c^{s(\tau)} r(\tau) \, ({z}^{t(\tau)} - {z}^{s(\tau)}) }.$

Comparing the coefficients of all $z^\ell,$ we obtain the following. For $\ell = 0,$ which is the only complex of degree zero, we get

$\displaystyle { \sum_{\tau: t(\tau)=0} r(\tau) c^{s(\tau)} - \sum_{\tau\;:\; s(\tau)= 0} r(\tau) c^{s(\tau)} = -\dot{c}\cdot\mathbf{1} }$

For the complexes $\ell$ of degree one, we get these equations:

$\displaystyle { \sum_{\tau\;:\; t(\tau)=(1,0,0,\dots)} r(\tau) c^{s(\tau)} - \sum_{\tau \;:\;s(\tau)=(1,0,0,\dots)} r(\tau) c^{s(\tau)}= \dot{c_1} }$

$\displaystyle { \sum_{\tau\; :\; t(\tau)=(0,1,0,\dots)} r(\tau) c^{s(\tau)} - \sum_{\tau\;:\; s(\tau)=(0,1,0,\dots)} r(\tau) c^{s(\tau)} = \dot{c_2} }$

and so on. For all the remaining complexes $\ell$ we have

$\displaystyle { \sum_{\tau\;:\; t(\tau)=\ell} r(\tau) c^{s(\tau)} = \sum_{\tau \;:\; s(\tau)=\ell} r(\tau) c^{s(\tau)} }.$

This says that the total rate at which this complex is produced equals the total rate at which it’s destroyed. So, our solution of the master equation is complex balanced for all complexes $\ell$ of degree greater than one. This is our necessary condition.                                                                                   █

To illustrate the theorem, I’ll consider three simple examples. The third example shows that the condition in the theorem, though necessary, is not sufficient. Note that our proof also gives a necessary and sufficient condition for a coherent state to remain coherent: namely, that all the equations we listed hold, not just initially but for all times. But this condition seems a bit complicated.

### Introducing amoebae into a Petri dish

Suppose that there is an inexhaustible supply of amoebae, randomly floating around in a huge pond. Each time an amoeba comes into our collection area, we catch it and add it to the population of amoebae in the Petri dish. Suppose that the rate constant for this process is 3.

So, the Hamiltonian is $3(a^\dagger -1).$ If we start with a coherent state, say

$\displaystyle { \Psi(0)=\frac{e^{cz}}{e^c} }$

then

$\displaystyle { \Psi(t) = e^{3(a^\dagger -1)t} \; \frac{e^{cz}}{e^c} = \frac{e^{(c+3t)z}}{e^{c+3t}} }$

which is coherent at all times.

We can see that the condition of the theorem is satisfied, as all the complexes in the reaction network have degree 0 or 1.

### Amoebae reproducing and competing

This example shows a Petri dish with one species, amoebae, and two transitions: fission and competition. We suppose that the rate constant for fission is 2, while that for competition is 1. The Hamiltonian is then

$H= 2({a^\dagger}^2-a^\dagger)a + (a^\dagger-{a^\dagger}^2)a^2$

If we start off with the coherent state

$\displaystyle{\Psi(0) = \frac{e^{2z}}{e^2}}$

we find that

$\displaystyle {\Psi(t)=e^{2(z^2-z)2+(z-z^2)4} \; \Psi(0)}=\Psi(0)$

which is coherent. It should be noted that the chosen initial state

$\displaystyle{ \frac{e^{2z}}{e^2}}$

was a complex balanced equilibrium solution. So, the Anderson–Craciun–Kurtz Theorem applies to this case.

### Amoebae reproducing, competing, and being introduced

This is a combination of the previous two examples, where apart from ongoing reproduction and competition, amoebae are being introduced into the dish with a rate constant 3.

As in the above examples, we might think that coherent states could remain coherent forever here too. Let’s check that.

Assuming that this was true, if

$\displaystyle{\Psi(t) = \frac{e^{c(t)z}}{e^{c(t)}} }$

then $c(t)$ would have to satisfy the following:

$\dot{c}(t) = c(t)^2 + 3 -2c(t)$

and

$c(t)^2=2c(t)$

Using the second equation, we get

$\dot{c}(t) = 3 \Rightarrow c = 3t+ c_0$

But this is certainly not a solution of the second equation. So, here we find that initially coherent states do not remain remain coherent for all times.

However, if we choose

$\displaystyle{\Psi(0) = \frac{e^{2z}}{e^2}}$

then this coherent state is complex balanced except for complexes of degree 1, since it was in the previous example, and the only new feature of this example, at time zero, is that single amoebas are being introduced—and these are complexes of degree 1. So, the condition of the theorem does hold.

So, the condition in the theorem is necessary but not sufficient. However, it is easy to check, and we can use it to show that in many cases, coherent states must cease to be coherent.

## The Large-Number Limit for Reaction Networks (Part 2)

6 July, 2013

I’ve been talking a lot about ‘stochastic mechanics’, which is like quantum mechanics but with probabilities replacing amplitudes. In Part 1 of this mini-series I started telling you about the ‘large-number limit’ in stochastic mechanics. It turns out this is mathematically analogous to the ‘classical limit’ of quantum mechanics, where Planck’s constant $\hbar$ goes to zero.

There’s a lot more I need to say about this, and lots more I need to figure out. But here’s one rather easy thing.

In quantum mechanics, ‘coherent states’ are a special class of quantum states that are very easy to calculate with. In a certain precise sense they are the best quantum approximations to classical states. This makes them good tools for studying the classical limit of quantum mechanics. As $\hbar \to 0,$ they reduce to classical states where, for example, a particle has a definite position and momentum.

We can borrow this strategy to study the large-number limit of stochastic mechanics. We’ve run into coherent states before in our discussions here. Now let’s see how they work in the large-number limit!

### Coherent states

For starters, let’s recall what coherent states are. We’ve got $k$ different kinds of particles, and we call each kind a species. We describe the probability that we have some number of particles of each kind using a ‘stochastic state’. For starters, this is a formal power series in variables $z_1, \dots, z_k.$ We write it as

$\displaystyle{\Psi = \sum_{\ell \in \mathbb{N}^k} \psi_\ell z^\ell }$

where $z^\ell$ is an abbreviation for

$z_1^{\ell_1} \cdots z_k^{\ell_k}$

But for $\Psi$ to be a stochastic state the numbers $\psi_\ell$ need to be probabilities, so we require that

$\psi_\ell \ge 0$

and

$\displaystyle{ \sum_{\ell \in \mathbb{N}^k} \psi_\ell = 1}$

Sums of coefficients like this show up so often that it’s good to have an abbreviation for them:

$\displaystyle{ \langle \Psi \rangle = \sum_{\ell \in \mathbb{N}^k} \psi_\ell}$

Now, a coherent state is a stochastic state where the numbers of particles of each species are independent random variables, and the number of the $i$th species is distributed according to a Poisson distribution.

Since we can pick ithe means of these Poisson distributions to be whatever we want, we get a coherent state $\Psi_c$ for each list of numbers $c \in [0,\infty)^k:$

$\displaystyle{ \Psi_c = \frac{e^{c \cdot z}}{e^c} }$

Here I’m using another abbreviation:

$e^{c} = e^{c_1 + \cdots + c_k}$

If you calculate a bit, you’ll see

$\displaystyle{ \Psi_c = e^{-(c_1 + \cdots + c_k)} \, \sum_{n \in \mathbb{N}^k} \frac{c_1^{n_1} \cdots c_k^{n_k}} {n_1! \, \cdots \, n_k! } \, z_1^{n_1} \cdots z_k^{n_k} }$

Thus, the probability of having $n_i$ things of the $i$th species is equal to

$\displaystyle{ e^{-c_i} \, \frac{c_i^{n_i}}{n_i!} }$

This is precisely the definition of a Poisson distribution with mean equal to $c_i.$

What are the main properties of coherent states? For starters, they are indeed states:

$\langle \Psi_c \rangle = 1$

More interestingly, they are eigenvectors of the annihilation operators

$a_i = \displaystyle{ \frac{\partial}{\partial z_i} }$

since when you differentiate an exponential you get back an exponential:

$\begin{array}{ccl} a_i \Psi_c &=& \displaystyle{ \frac{\partial}{\partial z_i} \frac{e^{c \cdot z}}{e^c} } \\ \\ &=& c_i \Psi_c \end{array}$

We can use this fact to check that in this coherent state, the mean number of particles of the $i$th species really is $c_i.$ For this, we introduce the number operator

$N_i = a_i^\dagger a_i$

where $a_i^\dagger$ is the creation operator:

$(a_i^\dagger \Psi)(z) = z_i \Psi(z)$

The number operator has the property that

$\langle N_i \Psi \rangle$

is the mean number of particles of the $i$th species. If we calculate this for our coherent state $\Psi_c,$ we get

$\begin{array}{ccl} \langle a_i^\dagger a_i \Psi_c \rangle &=& c_i \langle a_i^\dagger \Psi_c \rangle \\ \\ &=& c_i \langle \Psi_c \rangle \\ \\ &=& c_i \end{array}$

Here in the second step we used the general rule

$\langle a_i^\dagger \Phi \rangle = \langle \Phi \rangle$

which is easy to check.

### Rescaling

Now let’s see how coherent states work in the large-numbers limit. For this, let’s use the rescaled annihilation, creation and number operators from Part 1. They look like this:

$A_i = \hbar \, a_i$

$C_i = a_i^\dagger$

$\widetilde{N}_i = C_i A_i$

Since

$\widetilde{N}_i = \hbar N_i$

the point is that the rescaled number operator counts particles not one at a time, but in bunches of size $1/\hbar.$ For example, if $\hbar$ is the reciprocal of Avogadro’s number, we are counting particles in ‘moles’. So, $\hbar \to 0$ corresponds to a large-number limit.

To flesh out this idea some more, let’s define rescaled coherent states:

$\widetilde{\Psi}_c = \Psi_{c/\hbar}$

These are eigenvectors of the rescaled annihilation operators:

$\begin{array}{ccl} A_i \widetilde{\Psi}_c &=& \hbar a_i \Psi_{c/\hbar} \\ \\ &=& c_i \Psi_{c/\hbar} \\ \\ &=& c_i \widetilde{\Psi}_c \end{array}$

This in turn means that

$\begin{array}{ccl} \langle \widetilde{N}_i \widetilde{\Psi}_c \rangle &=& \langle C_i A_i \widetilde{\Psi}_c \rangle \\ \\ &=& c_i \langle C_i \widetilde{\Psi}_c \rangle \\ \\ &=& c_i \langle \widetilde{\Psi}_c \rangle \\ \\ &=& c_i \end{array}$

Here we used the general rule

$\langle C_i \Phi \rangle = \langle \Phi \rangle$

which holds because the ‘rescaled’ creation operator $C_i$ is really just the usual creation operator, which obeys this rule.

What’s the point of all this fiddling around? Simply this. The equation

$\langle \widetilde{N}_i \widetilde{\Psi}_c \rangle = c_i$

says the expected number of particles of the $i$th species in the state $\widetilde{\Psi}_c$ is $c_i,$ if we count these particles not one at a time, but in bunches of size $1/\hbar.$

### A simple test

As a simple test of this idea, let’s check that as $\hbar \to 0,$ the standard deviation of the number of particles in the state $\Psi_c$ goes to zero… where we count particle using the rescaled number operator.

The variance of the rescaled number operator is, by definition,

$\langle \widetilde{N}_i^2 \widetilde{\Psi}_c \rangle - \langle \widetilde{N}_i^2 \widetilde{\Psi}_c \rangle^2$

and the standard deviation is the square root of the variance.

We already know the mean of the rescaled number operator:

$\langle \widetilde{N}_i \widetilde{\Psi}_c \rangle = c_i$

So, the main thing we need to calculate is the mean of its square:

$\langle \widetilde{N}_i^2 \widetilde{\Psi}_c \rangle$

For this we will use the commutation relation derived last time:

$[A_i , C_i] = \hbar$

This implies

$\begin{array}{ccl} \widetilde{N}_i^2 &=& C_i A_i C_i A_i \\ \\ &=& C_i (C_i A_i + \hbar) A_i \\ \\ &=& C_i^2 A_i^2 + \hbar C_i A_i \end{array}$

so

$\begin{array}{ccl} \langle \widetilde{N}_i^2\widetilde{\Psi}_c \rangle &=& \langle (C_i^2 A_i^2 + \hbar C_i A_i) \Psi_c \rangle \\ \\ &=& c_i^2 + \hbar c_i \end{array}$

where we used our friends

$A_i \Psi_c = c_i \Psi_c$

and

$\langle C_i \Phi \rangle = \langle \Phi \rangle$

So, the variance of the rescaled number of particles is

$\begin{array}{ccl} \langle \widetilde{N}_i^2 \widetilde{\Psi}_c \rangle - \langle \widetilde{N}_i \widetilde{\Psi}_c \rangle^2 &=& c_i^2 + \hbar c_i - c_i^2 \\ \\ &=& \hbar c_i \end{array}$

and the standard deviation is

$(\hbar c_i)^{1/2}$

Good, it goes to zero as $\hbar \to 0!$ And the square root is just what you’d expect if you’ve thought about stuff like random walks or the central limit theorem.

### A puzzle

I feel sure that in any coherent state, not only the variance but also all the higher moments of the rescaled number operators go to zero as $\hbar \to 0.$ Can you prove this?

Here I mean the moments after the mean has been subtracted. The $p$th moment is then

$\langle (\widetilde{N}_i - c_i)^p \; \widetilde{\Psi}_c \rangle$

I want this to go to zero as $\hbar \to 0.$

Here’s a clue that should help. First, there’s a textbook formula for the higher moments of Poisson distributions without the mean subtracted. If I understand it correctly, it gives this:

$\displaystyle{ \langle N_i^m \; \Psi_c \rangle = \sum_{j = 1}^m {c_i}^j \; \left\{ \begin{array}{c} m \\ j \end{array} \right\} }$

Here

$\displaystyle{ \left\{ \begin{array}{c} m \\ j \end{array} \right\} }$

is the number of ways to partition an $m$-element set into $j$ nonempty subsets. This is called Stirling’s number of the second kind. This suggests that there’s some fascinating combinatorics involving coherent states. That’s exactly the kind of thing I enjoy, so I would like to understand this formula someday… but not today! I just want something to go to zero!

If I rescale the above formula, I seem to get

$\begin{array}{ccl} \langle \widetilde{N}_i^m \; \widetilde{\Psi}_c \rangle &=& \hbar^m \langle N_i^m \Psi_{c/\hbar} \rangle \\ \\ &=& \hbar^m \; \displaystyle{ \sum_{j = 1}^m \left(\frac{c_i}{\hbar}\right)^j \left\{ \begin{array}{c} m \\ j \end{array} \right\} } \end{array}$

We could plug this formula into

$\langle (\widetilde{N}_i - c_i)^p \; \widetilde{\Psi}_c \rangle = \displaystyle{ \sum_{m = 0}^p \, \binom{m}{p} \; \langle \widetilde{N}_i^m \; \widetilde{\Psi}_c \rangle \, (-c_i)^{p - m} }$

and then try to show the result goes to zero as $\hbar \to 0.$ But I don’t have the energy to do that… not right now, anyway!

Maybe you do. Or maybe you can think of a better approach to solving this problem. The answer must be well-known, since the large-number limit of a Poisson distribution is a very important thing.

## Relative Entropy (Part 2)

2 July, 2013

In the first part of this mini-series, I describe how various ideas important in probability theory arise naturally when you start doing linear algebra using only the nonnegative real numbers.

But after writing it, I got an email from a rather famous physicist saying he got “lost at line two”. So, you’ll be happy to hear that the first part is not a prerequisite for the remaining parts! I wrote it just to intimidate that guy.

Tobias Fritz and I have proved a theorem characterizing the concept of relative entropy, which is also known as ‘relative information’, ‘information gain’ or—most terrifying and least helpful of all—‘Kullback-Leibler divergence’. In this second part I’ll introduce two key players in this theorem. The first, $\mathrm{FinStat},$ is a category where:

• an object consists of a system with finitely many states, and a probability distribution on those states

and

• a morphism consists of a deterministic ‘measurement process’ mapping states of one system to states of another, together with a ‘hypothesis’ that lets the observer guess a probability distribution of states of the system being measured, based on what they observe.

The second, $\mathrm{FP},$ is a subcategory of $\mathrm{FinStat}.$ It has all the same objects, but only morphisms where the hypothesis is ‘optimal’. This means that if the observer measures the system many times, and uses the probability distribution of their observations together with their hypothesis to guess the probability distribution of states of the system, they get the correct answer (in the limit of many measurements).

In this part all I will really do is explain precisely what $\mathrm{FinStat}$ and $\mathrm{FP}$ are. But to whet your appetite, let me explain how we can use them to give a new characterization of relative entropy!

Suppose we have any morphism in $\mathrm{FinStat}.$ In other words: suppose we have a deterministic measurement process, together with a hypothesis that lets the observer guess a probability distribution of states of the system being measured, based on what they observe.

Then we have two probability distributions on the states of the system being measured! First, the ‘true’ probability distribution. Second, the probability that the observer will guess based on their observations.

Whenever we have two probability distributions on the same set, we can compute the entropy of the first relative to to the second. This describes how surprised you’ll be if you discover the probability distribution is really the first, when you thought it was the second.

So: any morphism in $\mathrm{FinStat}$ will have a relative entropy. It will describe how surprised the observer will be when they discover the true probability distribution, given what they had guessed.

But this amount of surprise will be zero if their hypothesis was ‘optimal’ in the sense I described. So, the relative entropy will vanish on morphisms in $\mathrm{FP}.$

Our theorem says this fact almost characterizes the concept of relative entropy! More precisely, it says that any convex-linear lower semicontinuous functor

$F : \mathrm{FinStat} \to [0,\infty]$

that vanishes on the subcategory $\mathrm{FP}$ must equal some constant times the relative entropy.

Don’t be scared! This should not make sense to you yet, since I haven’t said how I’m thinking of $[0,+\infty]$ as a category, nor what a ‘convex-linear lower semicontinuous functor’ is, nor how relative entropy gives one. I will explain all that later. I just want you to get a vague idea of where I’m going.

Now let me explain the categories $\mathrm{FinStat}$ and $\mathrm{FP}.$ We need to warm up a bit first.

### FinStoch

A stochastic map $f : X \leadsto Y$ is different from an ordinary function, because instead of assigning a unique element of $Y$ to each element of $X,$ it assigns a probability distribution on $Y$ to each element of $X.$ So you should imagine it as being like a function ‘with random noise added’, so that $f(x)$ is not a specific element of $Y,$ but instead has a probability of taking on different values. This is why I’m using a weird wiggly arrow to denote a stochastic map.

More formally:

Definition. Given finite sets $X$ and $Y,$ a stochastic map $f : X \leadsto Y$ assigns a real number $f_{yx}$ to each pair $x \in X, y \in Y,$ such that fixing any element $x,$ the numbers $f_{yx}$ form a probability distribution on $Y.$ We call $f_{yx}$ the probability of $y$ given $x.$

In more detail:

$f_{yx} \ge 0$ for all $x \in X,$ $y \in Y.$

and

$\displaystyle{ \sum_{y \in Y} f_{yx} = 1}$ for all $x \in X.$

Note that we can think of $f : X \leadsto Y$ as a $Y \times X$-shaped matrix of numbers. A matrix obeying the two properties above is called stochastic. This viewpoint is nice because it reduces the problem of composing stochastic maps to matrix multiplication. It’s easy to check that multiplying two stochastic matrices gives a stochastic matrix. So, composing stochastic maps gives a stochastic map.

We thus get a category:

Definition. Let $\mathrm{FinStoch}$ be the category of finite sets and stochastic maps between them.

In case you’re wondering why I’m restricting attention to finite sets, it’s merely because I want to keep things simple. I don’t want to worry about whether sums or integrals converge.

### FinProb

Now take your favorite 1-element set and call it $1.$ A function $p : 1 \to X$ is just a point of $X.$ But a stochastic map $p : 1 \leadsto X$ is something more interesting: it’s a probability distribution on $X.$

Why? Because it gives a probability distribution on $X$ for each element of $1,$ but that set has just one element.

Last time I introduced the rather long-winded phrase finite probability measure space to mean a finite set with a probability distribution on it. But now we’ve seen a very quick way to describe such a thing within $\mathrm{FinStoch}:$

And this gives a quick way to think about a measure-preserving function between finite probability measure spaces! It’s just a commutative triangle like this:

Note that the horizontal arrow $f: X \to Y$ is not wiggly. The straight arrow means it’s an honest function, not a stochastic map. But a function is a special case of a stochastic map! So it makes sense to compose a straight arrow with a wiggly arrow—and the result is, in general, a wiggly arrow. So, it makes sense to demand that this triangle commutes, and this says that the function $f: X \to Y$ is measure-preserving.

Let me work through the details, in case they’re not clear.

First: how is a function a special case of a stochastic map? Here’s how. If we start with a function $f: X \to Y,$ we get a matrix of numbers

$f_{yx} = \delta_{y,f(x)}$

where $\delta$ is the Kronecker delta. So, each element $x \in X$ gives a probability distribution that’s zero except at $f(x).$

Given this, we can work out what this commuting triangle really says:

If use $p_x$ to stand for the probability distribution that $p: 1 \leadsto X$ puts on $X,$ and similarly for $q_y,$ the commuting triangle says

$\displaystyle{ q_y = \sum_{x \in X} \delta_{y,f(x)} p_x}$

or in other words:

$\displaystyle{ q_y = \sum_{x \in X : f(x) = y} p_x }$

or if you like:

$\displaystyle{ q_y = \sum_{x \in f^{-1}(y)} p_x }$

In this situation people say $q$ is $p$ pushed forward along $f$, and they say $f$ is a measure-preserving function.

So, we’ve used $\mathrm{FinStoch}$ to describe another important category:

Definition. Let $\mathrm{FinProb}$ be the category of finite probability measure spaces and measure-preserving functions between them.

I can’t resist mentioning another variation:

A commuting triangle like this is a measure-preserving stochastic map. In other words, $p$ gives a probability measure on $X,$ $q$ gives a probability measure on $Y,$ and $f: X \leadsto Y$ is a stochastic map with

$\displaystyle{ q_y = \sum_{x \in X} f_{yx} p_x }$

### FinStat

The category we really need for relative entropy is a bit more subtle. An object is a finite probability measure space:

but a morphism looks like this:

The whole diagram doesn’t commute, but the two equations I wrote down hold. The first equation says that $f: X \to Y$ is a measure-preserving function. In other words, this triangle, which we’ve seen before, commutes:

The second equation says that $f \circ s$ is the identity, or in math jargon, $s$ is a section for $f.$

But what does that really mean?

The idea is that $X$ is the set of ‘states’ of some system, while $Y$ is a set of possible ‘observations’ you might make. The function $f$ is a ‘measurement process’. You ‘measure’ the system using $f,$ and if the system is in the the state $x$ you get the observation $f(x).$ The probability distribution $p$ says the probability that the system is any given state, while $q$ says the probability that you get any given observation when you do your measurement.

Note: are assuming for now that that there’s no random noise in the observation process! That’s why $f$ is a function instead of a stochastic map.

But what about $s?$ That’s the fun part: $s$ describes your ‘hypothesis’ about the system’s state given a particular measurement! If you measure the system and get a result $y \in Y,$ you guess it’s in the state $x$ with probability $s_{xy}.$

And we don’t want this hypothesis to be really dumb: that’s what

$f \circ s = 1_Y$

says. You see, this equation says that

$\sum_{x \in X} \delta_{y', f(x)} s_{xy} = \delta_{y' y}$

or in other words:

$\sum_{x \in f^{-1}(y')} s_{xy} = \delta_{y' y}$

If you think about it, this implies $s_{xy} = 0$ unless $f(x) = y.$

So, if you make an observation $y,$ you will guess the system is in state $x$ with probability zero unless $f(x) = y.$ In short, you won’t make a really dumb guess about the system’s state.

Here’s how we compose morphisms:

We get a measure-preserving function $g \circ f : X \to Z$ and a stochastic map going back, $s \circ t : Z \to Z.$ You can check that these obey the required equations:

$g \circ f \circ p = r$

$g \circ f \circ s \circ t = 1_Z$

So, we get a category:

Definition. Let $\mathrm{FinStat}$ be the category where an object is a finite probability measure space:

a morphism is a diagram obeying these equations:

and composition is defined as above.

### FP

As we’ve just seen, a morphism in $\mathrm{FinStat}$ consists of a ‘measurement process’ $f$ and a ‘hypothesis’ $s:$

But sometimes we’re lucky and our hypothesis is optimal, in the sense that

$s \circ q = p$

Conceptually, this says that if you take the probability distribution $q$ on our observations and use it to guess a probability distribution for the system’s state using our hypothesis $s,$ you get the correct answer: $p.$

Mathematically, it says that this diagram commutes:

In other words, $s$ is a measure-preserving stochastic map.

There’s a subcategory of $\mathrm{FinStat}$ with all the same objects, but only these ‘optimal’ morphisms. It’s important, but the name we have for it is not very exciting:

Definition. Let $\mathrm{FP}$ be the subcategory of $\mathrm{FinStat}$ where an object is a finite probability measure space

and a morphism is a diagram obeying these equations:

Why do we call this category $\mathrm{FP}$? Because it’s a close relative of $\mathrm{FinProb},$ where a morphism, you’ll remember, looks like this:

The point is that for a morphism in $\mathrm{FP},$ the conditions on $s$ are so strong that they completely determine it unless there are observations that happen with probability zero—that is, unless there are $y \in Y$ with $q_y = 0.$ To see this, note that

$s \circ q = p$

actually says

$\sum_{y \in Y} s_{xy} q_y = p_x$

for any choice of $x \in X.$ But we’ve already seen $s_{xy} = 0$ unless $f(x) = y,$ so the sum has just one term, and the equation says

$s_{x,f(x)} q_{f(x)} = p_x$

We can solve this for $s_{x,f(x)},$ so $s$ is completely determined… unless $q_{f(x)} = 0.$

This covers the case when $y = f(x).$ We also can’t figure out $s_{x,y}$ if $y$ isn’t in the image of $f.$

So, to be utterly precise, $s$ is determined by $p, q$ and $f$ unless there’s an element $y \in Y$ that has $q_y = 0.$ Except for this special case, a morphism in $\mathrm{FP}$ is just a morphism in $\mathrm{FinProb}.$ But in this special case, a morphism in $\mathrm{FP}$ has a little extra information: an arbitrary probability distribution on the inverse image of each point $y$ with this property.

In short, $\mathrm{FP}$ is the same as $\mathrm{FinProb}$ except that our observer’s ‘optimal hypothesis’ must provide a guess about the state of the system given an observation, even in cases of observations that occur with probability zero.

I’m going into these nitpicky details for two reasons. First, we’ll need $\mathrm{FP}$ for our characterization of relative entropy. But second, Tom Leinster already ran into this category in his work on entropy and category theory! He discussed it here:

• Tom Leinster, An operadic introduction to entropy.

Despite the common theme of entropy, he arrived at it from a very different starting-point.

### Conclusion

So, I hope that next time I can show you something like this:

and you’ll say “Oh, that’s a probability distribution on the states of some system!” Intuitively, you should think of the wiggly arrow $p$ as picking out a ‘random element’ of the set $X.$

I hope I can show you this:

and you’ll say “Oh, that’s a deterministic measurement process, sending a probability distribution on the states of the measured system to a probability distribution on observations!”

I hope I can show you this:

and you’ll say “Oh, that’s a deterministic measurement process, together with a hypothesis about the system’s state, given what is observed!”

And I hope I can show you this:

and you’ll say “Oh, that’s a deterministic measurement process, together with an optimal hypothesis about the system’s state, given what is observed!”

I don’t count on it… but I can hope.

### Postscript

And speaking of unrealistic hopes, if I were really optimistic I would hope you noticed that $\mathrm{FinStoch}$ and $\mathrm{FinProb},$ which underlie the more fancy categories I’ve discussed today, were themselves constructed starting from linear algebra over the nonnegative numbers $[0,\infty)$ in Part 1. That ‘foundational’ work is not really needed for what we’re doing now. However, I like the fact that we’re ultimately getting the concept of relative entropy starting from very little: just linear algebra, using only nonnegative numbers!