Network Theory (Part 9)

13 September, 2011

jointly written with Brendan Fong

Last time we reviewed the rate equation and the master equation. Both of them describe processes where things of various kinds can react and turn into other things. But:

• In the rate equation, we assume the number of things varies continuously and is known precisely.

• In the master equation, we assume the number of things varies discretely and is known only probabilistically.

This should remind you of the difference between classical mechanics and quantum mechanics. But the master equation is not quantum, it’s stochastic: it involves probabilities, but there’s no uncertainty principle going on.

Still, a lot of the math is similar.

Now, given an equilibrium solution to the rate equation—one that doesn’t change with time—we’ll try to find a solution to the master equation with the same property. We won’t always succeed—but we often can! The theorem saying how was proved here:

• D. F. Anderson, G. Craciun and T. G. Kurtz, Product-form stationary distributions for deficiency zero chemical reaction networks.

To emphasize the analogy to quantum mechanics, we’ll translate their proof into the language of annihilation and creation operators. In particular, our equilibrium solution of the master equation is just like what people call a ‘coherent state’ in quantum mechanics.

So, if you know about quantum mechanics and coherent states, you should be happy. But if you don’t, fear not!—we’re not assuming you do.

The rate equation

To construct our equilibrium solution of the master equation, we need a special type of solution to our rate equation. We call this type a ‘complex balanced solution’. This means that not only is the net rate of production of each species zero, but the net rate of production of each possible bunch of species is zero.

Before we make this more precise, let’s remind ourselves of the basic setup.

We’ll consider a stochastic Petri net with a finite set S of species and a finite set T of transitions. For convenience let’s take S = \{1,\dots, k\}, so our species are numbered from 1 to k. Then each transition \tau has an input vector m(\tau) \in \mathbb{N}^k and output vector n(\tau) \in \mathbb{N}^k. These say how many things of each species go in, and how many go out. Each transition also has rate constant r(\tau) \in [0,\infty), which says how rapidly it happens.

The rate equation concerns a vector x(t) \in [0,\infty)^k whose ith component is the number of things of the ith species at time t. Note: we’re assuming this number of things varies continuously and is known precisely! This should remind you of classical mechanics. So, we’ll call x(t), or indeed any vector in [0,\infty)^k, a classical state.

The rate equation says how the classical state x(t) changes with time:

\displaystyle{  \frac{d x}{d t} = \sum_{\tau \in T} r(\tau)\, (n(\tau)-m(\tau)) \, x^{m(\tau)} }

You may wonder what x^{m(\tau)} means: after all, we’re taking a vector to a vector power! It’s just an abbreviation, which we’ve seen plenty of times before. If x \in \mathbb{R}^k is a list of numbers and m \in \mathbb{N}^k is a list of natural numbers, we define

x^m = x_1^{m_1} \cdots x_k^{m_k}

We’ll also use this notation when x is a list of operators.

Complex balance

The vectors m(\tau) and n(\tau) are examples of what chemists call complexes. A complex is a bunch of things of each species. For example, if the set S consists of three species, the complex (1,0,5) is a bunch consisting of one thing of the first species, none of the second species, and five of the third species.

For our Petri net, the set of complexes is the set \mathbb{N}^k, and the complexes of particular interest are the input complex m(\tau) and the output complex n(\tau) of each transition \tau.

We say a classical state c \in [0,\infty)^k is complex balanced if for all complexes \kappa \in \mathbb{N}^k we have

\displaystyle{ \sum_{\{\tau : m(\tau) = \kappa\}} r(\tau) c^{m(\tau)} =\sum_{\{\tau : n(\tau) = \kappa\}} r(\tau) c^{m(\tau)}  }

The left hand side of this equation, which sums over the transitions with input complex \kappa, gives the rate of consumption of the complex \kappa. The right hand side, which sums over the transitions with output complex \kappa, gives the rate of production of \kappa . So, this equation requires that the net rate of production of the complex \kappa is zero in the classical state c .

Puzzle. Show that if a classical state c is complex balanced, and we set x(t) = c for all t, then x(t) is a solution of the rate equation.

Since x(t) doesn’t change with time here, we call it an equilibrium solution of the rate equation. Since x(t) = c is complex balanced, we call it complex balanced equilibrium solution.

The master equation

We’ve seen that any complex balanced classical state gives an equilibrium solution of the rate equation. The Anderson–Craciun–Kurtz theorem says that it also gives an equilibrium solution of the master equation.

The master equation concerns a formal power series

\displaystyle{ \Psi(t) = \sum_{n \in \mathbb{N}^k} \psi_n(t) z^n }

where

z^n = z_1^{n_1} \cdots z_k^{n_k}

and

\psi_n(t) = \psi_{n_1, \dots,n_k}(t)

is the probability that at time t we have n_1 things of the first species, n_2 of the second species, and so on.

Note: now we’re assuming this number of things varies discretely and is known only probabilistically! So, we’ll call \Psi(t), or indeed any formal power series where the coefficients are probabilities summing to 1, a stochastic state. Earlier we just called it a ‘state’, but that would get confusing now: we’ve got classical states and stochastic states, and we’re trying to relate them.

The master equation says how the stochastic state \Psi(t) changes with time:

\displaystyle{ \frac{d}{d t} \Psi(t) = H \Psi(t) }

where the Hamiltonian H is:

\displaystyle{ H = \sum_{\tau \in T} r(\tau) \, \left({a^\dagger}^{n(\tau)} - {a^\dagger}^{m(\tau)} \right) \, a^{m(\tau)}  \label{master} }

The notation here is designed to neatly summarize some big products of annihilation and creation operators. For any vector n \in \mathbb{N}^k, we have

a^n = a_1^{n_1} \cdots  a_k^{n_k}

and

\displaystyle{  {a^\dagger}^n = {a_1^\dagger }^{n_1} \cdots  {a_k^\dagger}^{n_k} }

Coherent states

Now suppose c \in [0,\infty)^k is a complex balanced equilibrium solution of the rate equation. We want to get an equilibrium solution of the master equation. How do we do it?

For any c \in [0,\infty)^k there is a stochastic state called a coherent state, defined by

\displaystyle{ \Psi_c = \frac{e^{c z}}{e^c} }

Here we are using some very terse abbreviations. Namely, we are defining

e^{c} = e^{c_1} \cdots e^{c_k}

and

e^{c z} = e^{c_1 z_1} \cdots e^{c_k z_k}

Equivalently,

\displaystyle{ e^{c z} = \sum_{n \in \mathbb{N}^k} \frac{c^n}{n!}z^n }

where c^n and z^n are defined as products in our usual way, and

n! = n_1! \, \cdots \, n_k!

Either way, if you unravel the abbrevations, here’s what you get:

\displaystyle{  \Psi_c = e^{-(c_1 + \cdots + c_k)} \, \sum_{n \in \mathbb{N}^k} \frac{c_1^{n_1} \cdots c_k^{n_k}} {n_1! \, \cdots \, n_k! } \, z_1^{n_1} \cdots z_k^{n_k} }

Maybe now you see why we like the abbreviations.

The name ‘coherent state’ comes from quantum mechanics. In quantum mechanics, we think of a coherent state \Psi_c as the ‘quantum state’ that best approximates the classical state c. But we’re not doing quantum mechanics now, we’re doing probability theory. \Psi_c isn’t a ‘quantum state’, it’s a stochastic state.

In probability theory, people like Poisson distributions. In the state \Psi_c, the probability of having n_i things of the ith species is equal to

\displaystyle{  e^{-c_i} \, \frac{c_i^{n_i}}{n_i!} }

This is precisely the definition of a Poisson distribution with mean equal to c_i. We can multiply a bunch of factors like this, one for each species, to get

\displaystyle{ e^{-c} \frac{c^n}{n!} }

This is the probability of having n_1 things of the first species, n_2 things of the second, and so on, in the state \Psi_c. So, the state \Psi_c is a product of independent Poisson distributions. In particular, knowing how many things there are of one species says nothing all about how many things there are of any other species!

It is remarkable that such a simple state can give an equilibrium solution of the master equation, even for very complicated stochastic Petri nets. But it’s true—at least if c is complex balanced.

The Anderson–Craciun–Kurtz theorem

Now we’re ready to state and prove the big result:

Theorem (Anderson–Craciun–Kurtz). Suppose c \in [0,\infty)^k is a complex balanced equilibrium solution of the rate equation. Then H \Psi_c = 0.

It follows that \Psi_c is an equilibrium solution of the master equation. In other words, if we take \Psi(t) = \Psi_c for all times t, the master equation holds:

\displaystyle{ \frac{d}{d t} \Psi(t) = H \Psi(t) }

since both sides are zero.

Proof. To prove the Anderson–Craciun–Kurtz theorem, we just need to show that H \Psi_c = 0. Since \Psi_c is a constant times e^{c z}, it suffices to show H e^{c z} = 0. Remember that

\displaystyle{ H e^{c z} =  \sum_{\tau \in T} r(\tau) \left( {a^\dagger}^{n(\tau)} -{a^\dagger}^{m(\tau)} \right) \, a^{m(\tau)} \, e^{c z} }

Since the annihilation operator a_i is given by differentiation with respect to z_i, while the creation operator a^\dagger_i is just multiplying by z_i, we have:

\displaystyle{ H e^{c z} = \sum_{\tau \in T} r(\tau) \, c^{m(\tau)} \left( z^{n(\tau)} - z^{m(\tau)} \right) e^{c z} }

Expanding out e^{c z} we get:

\displaystyle{ H e^{c z} = \sum_{i \in \mathbb{N}^k} \sum_{\tau \in T} r(\tau)c^{m(\tau)}\left(z^{n(\tau)}\frac{c^i}{i!}z^i - z^{m(\tau)}\frac{c^i}{i!}z^i\right) }

Shifting indices and defining negative powers to be zero:

\displaystyle{ H e^{c z}  = \sum_{i \in \mathbb{N}^k} \sum_{\tau \in T} r(\tau)c^{m(\tau)}\left(\frac{c^{i-n(\tau)}}{(i-n(\tau))!}z^i - \frac{c^{i-m(\tau)}}{(i-m(\tau))!}z^i\right) }

So, to show H e^{c z} = 0, we need to show this:

\displaystyle{ \sum_{i \in \mathbb{N}^k} \sum_{\tau \in T} r(\tau)c^{m(\tau)}\frac{c^{i-n(\tau)}}{(i-n(\tau))!}z^i =\sum_{i \in \mathbb{N}^k} \sum_{\tau \in T} r(\tau)c^{m(\tau)}\frac{c^{i-m(\tau)}}{(i-m(\tau))!}z^i }

We do this by splitting the sum over T according to output and then input complexes, making use of the complex balanced condition:

\displaystyle{ \sum_{i \in \mathbb{N}^k} \sum_{\kappa \in \mathbb{N}^k} \sum_{\{\tau : n(\tau)=\kappa\}}  r(\tau)c^{m(\tau)}\frac{c^{i-n(\tau)}}{(i-n(\tau))!} \, z^i  = }

\displaystyle{ \sum_{i \in \mathbb{N}^k} \sum_{\kappa \in \mathbb{N}^k} \frac{c^{i-\kappa}}{(i-\kappa)!}\, z^i \sum_{\{\tau : n(\tau) = \kappa\}}  r(\tau)c^{m(\tau)} = }

\displaystyle{\sum_{i \in \mathbb{N}^k} \sum_{\kappa \in \mathbb{N}^k} \frac{c^{i-\kappa}}{(i-\kappa)!}\, z^i \sum_{\{\tau : m(\tau) = \kappa\}}  r(\tau)c^{m(\tau)}  = }

\displaystyle{ \sum_{i \in \mathbb{N}^k} \sum_{\kappa \in \mathbb{N}^k} \sum_{\{\tau : m(\tau) = \kappa\}}  r(\tau)c^{m(\tau)}\frac{c^{i-m(\tau)}}{(i-m(\tau))!}\, z^i }

This completes the proof! It’s just algebra, but it seems a bit magical, so we’re trying to understand it better.

I hope you see how amazing this result is. If you know quantum mechanics and coherent states you’ll understand what I mean. A coherent state is the "best quantum approximation" to a classical state, but we don’t expect this quantum state to be exactly time-independent when the corresponding classical state is, except in very special cases, like when the Hamiltonian is quadratic in the creation and annihilation operators. Here we are getting a result like that much more generally… but only given the "complex balanced" condition.

An example

We’ve already seen one example of the theorem, back in Part 7. We had this stochastic Petri net:

We saw that the rate equation is just the logistic equation, familiar from population biology. The equilibrium solution is complex balanced, because pairs of amoebas are getting created at the same rate as they’re getting destroyed, and single amoebas are getting created at the same rate as they’re getting destroyed.

So, the Anderson–Craciun–Kurtz theorem guarantees that there’s an equilibrium solution of the master equation where the number of amoebas is distributed according to a Poisson distribution. And, we actually checked that this was true!

Next time we’ll look at another example.


Network Theory (Part 8)

9 September, 2011

Summer vacation is over. Time to get back to work!

This month, before he goes to Oxford to begin a master’s program in Mathematics and the Foundations of Computer Science, Brendan Fong is visiting the Centre for Quantum Technologies and working with me on stochastic Petri nets. He’s proved two interesting results, which he wants to explain.

To understand what he’s done, you need to know how to get the rate equation and the master equation from a stochastic Petri net. We’ve almost seen how. But it’s been a long time since the last article in this series, so today I’ll start with some review. And at the end, just for fun, I’ll say a bit more about how Feynman diagrams show up in this theory.

Since I’m an experienced teacher, I’ll assume you’ve forgotten everything I ever said.

(This has some advantages. I can change some of my earlier terminology—improve it a bit here and there—and you won’t even notice.)

Stochastic Petri nets

Definition. A Petri net consists of a set S of species and a set T of transitions, together with a function

i : S \times T \to \mathbb{N}

saying how many things of each species appear in the input for each transition, and a function

o: S \times T \to \mathbb{N}

saying how many things of each species appear in the output.

We can draw pictures of Petri nets. For example, here’s a Petri net with two species and three transitions:

It should be clear that the transition ‘predation’ has one wolf and one rabbit as input, and two wolves as output.

A ‘stochastic’ Petri net goes further: it also says the rate at which each transition occurs.

Definition. A stochastic Petri net is a Petri net together with a function

r: T \to [0,\infty)

giving a rate constant for each transition.

Master equation versus rate equation

Starting from any stochastic Petri net, we can get two things. First:

• The master equation. This says how the probability that we have a given number of things of each species changes with time.

• The rate equation. This says how the expected number of things of each species changes with time.

The master equation is stochastic: it describes how probabilities change with time. The rate equation is deterministic.

The master equation is more fundamental. It’s like the equations of quantum electrodynamics, which describe the amplitudes for creating and annihilating individual photons. The rate equation is less fundamental. It’s like the classical Maxwell equations, which describe changes in the electromagnetic field in a deterministic way. The classical Maxwell equations are an approximation to quantum electrodynamics. This approximation gets good in the limit where there are lots of photons all piling on top of each other to form nice waves.

Similarly, the rate equation can be derived from the master equation in the limit where the number of things of each species become large, and the fluctuations in these numbers become negligible.

But I won’t do this derivation today! Nor will I probe more deeply into the analogy with quantum field theory, even though that’s my ultimate goal. Today I’m content to remind you what the master equation and rate equation are.

The rate equation is simpler, so let’s do that first.

The rate equation

Suppose we have a stochastic Petri net with k different species. Let x_i be the number of things of the ith species. Then the rate equation looks like this:

\displaystyle{ \frac{d x_i}{d t} = ??? }

It’s really a bunch of equations, one for each 1 \le i \le k. But what is the right-hand side?

The right-hand side is a sum of terms, one for each transition in our Petri net. So, let’s start by assuming our Petri net has just one transition.

Suppose the ith species appears as input to this transition m_i times, and as output n_i times. Then the rate equation is

\displaystyle{ \frac{d x_i}{d t} = r (n_i - m_i) x_1^{m_1} \cdots x_k^{m_k} }

where r is the rate constant for this transition.

That’s really all there is to it! But we can make it look nicer. Let’s make up a vector

x = (x_1, \dots , x_k) \in [0,\infty)^k

that says how many things there are of each species. Similarly let’s make up an input vector

m = (m_1, \dots, m_k) \in \mathbb{N}^k

and an output vector

n = (n_1, \dots, n_k) \in \mathbb{N}^k

for our transition. To be cute, let’s also define

\displaystyle{ x^m = x_1^{m_1} \cdots x_k^{m_k} }

Then we can write the rate equation for a single transition like this:

\displaystyle{ \frac{d x}{d t} = r (n-m) x^m }

Next let’s do a general stochastic Petri net, with lots of transitions. Let’s write T for the set of transitions and r(\tau) for the rate constant of the transition \tau \in T. Let n(\tau) and m(\tau) be the input and output vectors of the transition \tau. Then the rate equation is:

\displaystyle{ \frac{d x}{d t} = \sum_{\tau \in T} r(\tau) (n(\tau) - m(\tau)) x^{m(\tau)} }

For example, consider our rabbits and wolves:

Suppose

• the rate constant for ‘birth’ is \beta,

• the rate constant for ‘predation’ is \gamma,

• the rate constant for ‘death’ is \delta.

Let x_1(t) be the number of rabbits and x_2(t) the number of wolves at time t. Then the rate equation looks like this:

\displaystyle{ \frac{d x_1}{d t} = \beta x_1 - \gamma x_1 x_2 }

\displaystyle{ \frac{d x_2}{d t} = \gamma x_1 x_2 - \delta x_2 }

If you stare at this, and think about it, it should make perfect sense. If it doesn’t, go back and read Part 2.

The master equation

Now let’s do something new. In Part 6 I explained how to write down the master equation for a stochastic Petri net with just one species. Now let’s generalize that. Luckily, the ideas are exactly the same.

So, suppose we have a stochastic Petri net with k different species. Let \psi_{n_1, \dots, n_k} be the probability that we have n_1 things of the first species, n_2 of the second species, and so on. The master equation will say how all these probabilities change with time.

To keep the notation clean, let’s introduce a vector

n = (n_1, \dots, n_k) \in \mathbb{N}^k

and let

\psi_n = \psi_{n_1, \dots, n_k}

Then, let’s take all these probabilities and cook up a formal power series that has them as coefficients: as we’ve seen, this is a powerful trick. To do this, we’ll bring in some variables z_1, \dots, z_k and write

\displaystyle{ z^n = z_1^{n_1} \cdots z_k^{n_k} }

as a convenient abbreviation. Then any formal power series in these variables looks like this:

\displaystyle{ \Psi = \sum_{n \in \mathbb{N}^k} \psi_n z^n }

We call \Psi a state if the probabilities sum to 1 as they should:

\displaystyle{ \sum_n \psi_n = 1 }

The simplest example of a state is a monomial:

\displaystyle{  z^n = z_1^{n_1} \cdots z_k^{n_k} }

This is a state where we are 100% sure that there are n_1 things of the first species, n_2 of the second species, and so on. We call such a state a pure state, since physicists use this term to describe a state where we know for sure exactly what’s going on. Sometimes a general state, one that might not be pure, is called mixed.

The master equation says how a state evolves in time. It looks like this:

\displaystyle{ \frac{d}{d t} \Psi(t) = H \Psi(t) }

So, I just need to tell you what H is!

It’s called the Hamiltonian. It’s a linear operator built from special operators that annihilate and create things of various species. Namely, for each state 1 \le i \le k we have an annihilation operator:

\displaystyle{ a_i \Psi = \frac{d}{d z_i} \Psi }

and a creation operator:

\displaystyle{ a_i^\dagger \Psi = z_i \Psi }

How do we build H from these? Suppose we’ve got a stochastic Petri net whose set of transitions is T. As before, write r(\tau) for the rate constant of the transition \tau \in T, and let n(\tau) and m(\tau) be the input and output vectors of this transition. Then:

\displaystyle{ H = \sum_{\tau \in T} r(\tau) \, ({a^\dagger}^{n(\tau)} - {a^\dagger}^{m(\tau)}) \, a^{m(\tau)}  }

where as usual we’ve introduce some shorthand notations to keep from going insane. For example:

a^{m(\tau)} = a_1^{m_1(\tau)} \cdots  a_k^{m_k(\tau)}

and

{a^\dagger}^{m(\tau)} = {a_1^\dagger }^{m_1(\tau)} \cdots  {a_k^\dagger}^{m_k(\tau)}

Now, it’s not surprising that each transition \tau contributes a term to H. It’s also not surprising that this term is proportional to the rate constant r(\tau). The only tricky thing is the expression

\displaystyle{ ({a^\dagger}^{n(\tau)} - {a^\dagger}^{m(\tau)})\, a^{m(\tau)} }

How can we understand it? The basic idea is this. We’ve got two terms here. The first term:

\displaystyle{  {a^\dagger}^{n(\tau)} a^{m(\tau)} }

describes how m_i(\tau) things of the ith species get annihilated, and n_i(\tau) things of the ith species get created. Of course this happens thanks to our transition \tau. The second term:

\displaystyle{  - {a^\dagger}^{m(\tau)} a^{m(\tau)} }

is a bit harder to understand, but it says how the probability that nothing happens—that we remain in the same pure state—decreases as time passes. Again this happens due to our transition \tau.

In fact, the second term must take precisely the form it does to ensure ‘conservation of total probability’. In other words: if the probabilities \psi_n sum to 1 at time zero, we want these probabilities to still sum to 1 at any later time. And for this, we need that second term to be what it is! In Part 6 we saw this in the special case where there’s only one species. The general case works the same way.

Let’s look at an example. Consider our rabbits and wolves yet again:

and again suppose the rate constants for birth, predation and death are \beta, \gamma and \delta, respectively. We have

\displaystyle{ \Psi = \sum_n \psi_n z^n }

where

\displaystyle{ z^n = z_1^{n_1} z_2^{n_2} }

and \psi_n = \psi_{n_1, n_2} is the probability of having n_1 rabbits and n_2 wolves. These probabilities evolve according to the equation

\displaystyle{ \frac{d}{d t} \Psi(t) = H \Psi(t) }

where the Hamiltonian is

H = \beta B + \gamma C + \delta D

and B, C and D are operators describing birth, predation and death, respectively. (B stands for birth, D stands for death… and you can call predation ‘consumption’ if you want something that starts with C. Besides, ‘consumer’ is a nice euphemism for ‘predator’.) What are these operators? Just follow the rules I described:

\displaystyle{ B = {a_1^\dagger}^2 a_1 - a_1^\dagger a_1 }

\displaystyle{ C = {a_2^\dagger}^2 a_1 a_2 - a_1^\dagger a_2^\dagger a_1 a_2 }

\displaystyle{ D = a_2 -  a_2^\dagger a_2 }

In each case, the first term is easy to understand:

• Birth annihilates one rabbit and creates two rabbits.

• Predation annihilates one rabbit and one wolf and creates two wolves.

• Death annihilates one wolf.

The second term is trickier, but I told you how it works.

Feynman diagrams

How do we solve the master equation? If we don’t worry about mathematical rigor too much, it’s easy. The solution of

\displaystyle{ \frac{d}{d t} \Psi(t) = H \Psi(t) }

should be

\displaystyle{ \Psi(t) = e^{t H} \Psi(0) }

and we can hope that

\displaystyle{ e^{t H} = 1 + t H + \frac{(t H)^2}{2!} + \cdots }

so that

\displaystyle{ \Psi(t) = \Psi(0) + t H \Psi(0) + \frac{t^2}{2!} H^2 \Psi(0) + \cdots }

Of course there’s always the question of whether this power series converges. In many contexts it doesn’t, but that’s not necessarily a disaster: the series can still be asymptotic to the right answer, or even better, Borel summable to the right answer.

But let’s not worry about these subtleties yet! Let’s just imagine our rabbits and wolves, with Hamiltonian

H = \beta B + \gamma C + \delta D

Now, imagine working out

\Psi(t) = \displaystyle{ \Psi(0) + t H \Psi(0) + \frac{t^2}{2!} H^2 \Psi(0) + \frac{t^3}{3!} H^3 \Psi(0) + \cdots }

We’ll get lots of terms involving products of B, C and D hitting our original state \Psi(0). And we can draw these as diagrams! For example, suppose we start with one rabbit and one wolf. Then

\Psi(0) = z_1 z_2

And suppose we want to compute

\displaystyle{ H^3 \Psi(0) = (\beta B + \gamma C + \delta D)^3 \Psi(0) }

as part of the task of computing \Psi(t). Then we’ll get lots of terms: 27, in fact, though many will turn out to be zero. Let’s take one of these terms, for example the one proportional to:

D C B \Psi(0)

We can draw this as a sum of Feynman diagrams, including this:

In this diagram, we start with one rabbit and one wolf at top. As we read the diagram from top to bottom, first a rabbit is born (B), then predation occur (C), and finally a wolf dies (D). The end result is again a rabbit and a wolf.

This is just one of four Feynman diagrams we should draw in our sum for D C B \Psi(0), since either of the two rabbits could have been eaten, and either wolf could have died. So, the end result of computing

\displaystyle{ H^3 \Psi(0) }

will involve a lot of Feynman diagrams… and of course computing

\displaystyle{ \Psi(t) = \Psi(0) + t H \Psi(0) + \frac{t^2}{2!} H^2 \Psi(0) + \frac{t^3}{3!} H^3 \Psi(0) + \cdots }

will involve even more, even if we get tired and give up after the first few terms. So, this Feynman diagram business may seem quite tedious… and it may not be obvious how it helps.

But it does, sometimes!

Now is not the time for me to describe ‘practical’ benefits of Feynman diagrams. Instead, I’ll just point out one conceptual benefit. We started with what seemed like a purely computational chore, namely computing

\displaystyle{ \Psi(t) = \Psi(0) + t H \Psi(0) + \frac{t^2}{2!} H^2 \Psi(0) + \cdots }

But then we saw—at least roughly—how this series has a clear meaning! It can be written as a sum over diagrams, each of which represents a possible history of rabbits and wolves. So, it’s what physicists call a ‘sum over histories’.

Feynman invented the idea of a sum over histories in the context of quantum field theory. At the time this idea seemed quite mind-blowing, for various reasons. First, it involved elementary particles instead of everyday things like rabbits and wolves. Second, it involved complex ‘amplitudes’ instead of real probabilities. Third, it actually involved integrals instead of sums. And fourth, a lot of these integrals diverged, giving infinite answers that needed to be ‘cured’ somehow!

Now we’re seeing a sum over histories in a more down-to-earth context without all these complications. A lot of the underlying math is analogous… but now there’s nothing mind-blowing about it: it’s quite easy to understand. So, we can use this analogy to demystify quantum field theory a bit. On the other hand, thanks to this analogy, all sorts of clever ideas invented by quantum field theorists will turn out to have applications to biology and chemistry! So it’s a double win.


Bayesian Computations of Expected Utility

19 August, 2011

GiveWell is an organization that rates charities. They’ve met people who argue that

charities working on reducing the risk of sudden human extinction must be the best ones to support, since the value of saving the human race is so high that “any imaginable probability of success” would lead to a higher expected value for these charities than for others.

For example, say I have a dollar to spend on charity. One charity says that with this dollar they can save the life of one child in Somalia. Another says that with this dollar they can increase by .000001% our chance of saving 1 billion people from the effects of a massive asteroid colliding with the Earth.

Naively, in terms of the expected number of lives saved, the latter course of action seems 10 times better, since

.000001% × 1 billion = 10

But is it really better?

It’s a subtle question, with all sorts of complicating factors, like why should I trust these guys?

I’m not ready to present a thorough analysis of this sort of question today. But I would like to hear what you think about it. And I’d like you to read what the founder of Givewell has to say about it:

• Holden Karnofsky, Why we can’t take expected value estimates literally (even when they’re unbiased), 18 August 2011.

He argues against what he calls an ‘explicit expected value’ or ‘EEV’ approach:

The mistake (we believe) is estimating the “expected value” of a donation (or other action) based solely on a fully explicit, quantified formula, many of whose inputs are guesses or very rough estimates. We believe that any estimate along these lines needs to be adjusted using a “Bayesian prior”; that this adjustment can rarely be made (reasonably) using an explicit, formal calculation; and that most attempts to do the latter, even when they seem to be making very conservative downward adjustments to the expected value of an opportunity, are not making nearly large enough downward adjustments to be consistent with the proper Bayesian approach.

His focus, in short, is on the fact that anyone saying “this money can increase by .000001% our chance of saving 1 billion people from an asteroid impact” is likely to be pulling those numbers from thin air. If they can’t really back up their numbers with a lot of hard evidence, then our lack of confidence in their estimate should be taken into account somehow.

His article spends a lot of time analyzing a less complex but still very interesting example:

It seems fairly clear that a restaurant with 200 Yelp reviews, averaging 4.75 stars, ought to outrank a restaurant with 3 Yelp reviews, averaging 5 stars. Yet this ranking can’t be justified in an explicit expected utility framework, in which options are ranked by their estimated average/expected value.

This is the only question I really want to talk about today. Actually I’ll focus on a similar question that Tim van Beek posed on this blog:

You have two kinds of fertilizer, A and B. You know that of 4 trees who got A, three thrived and one died. Of 36 trees that got B, 24 thrived and 12 died. Which fertilizer would you buy?

So, 3/4 of the trees getting fertilizer A thrived, while only 2/3 of those getting fertilizer B thrived. That makes fertilizer A seem better. However, the sample size is considerably larger for fertilizer B, so we may feel more confident about the results in this case. Which should we choose?

Nathan Urban tackled the problem in an interesting way. Let me sketch what he did before showing you his detailed work.

Suppose that before doing any experiments at all, we assume the probability \pi that a fertilizer will make a tree thrive is a number uniformly distributed between 0 and 1. This assumption is our “Bayesian prior”.

Note: I’m not saying this prior is “correct”. You are allowed to choose a different prior! Choosing a different prior will change your answer to this puzzle. That can’t be helped. We need to make some assumption to answer this kind of puzzle; we are simply making it explicit here.

Starting from this prior, Nathan works out the probability that \pi has some value given that when we apply the fertilizer to 4 trees, 3 thrive. That’s the black curve below. He also works out the probability that \pi has some value given that when we apply the fertilizer to 36 trees, 24 thrive. That’s the red curve:

The red curve, corresponding to the experiment with 36 trees, is much more sharply peaked. That makes sense. It means that when we do more experiments, we become more confident that we know what’s going on.

We still have to choose a criterion to decide which fertilizer is best! This is where ‘decision theory’ comes in. For example, suppose we want to maximize the expected number of the trees that thrive. Then Nathan shows that fertilizer A is slightly better, despite the smaller sample size.

However, he also shows that if fertilizer A succeeded 4 out of 5 times, while fertilizer B succeeded 7 out of 9 times, the same evaluation procedure would declare fertilizer B better! Its percentage success rate is less: about 78% instead of 80%. However, the sample size is larger. And in this particular case, given our particular Bayesian prior and given what we are trying to maximize, that’s enough to make fertilizer B win.

So if someone is trying to get you to contribute to a charity, there are many interesting issues involved in deciding if their arguments are valid or just a bunch of… fertilizer.

Here is Nathan’s detailed calculation:

It’s fun to work out an official ‘correct’ answer mathematically, as John suggested. Of course, this ends up being a long way of confirming the obvious—and the answer is only as good as the assumptions—but I think it’s interesting anyway. In this case, I’ll work it out by maximizing expected utility in Bayesian decision theory, for one choice of utility function. This dodges the whole risk aversion point, but it opens discussion for how the assumptions might be modified to account for more real-world considerations. Hopefully others can spot whether I’ve made mistakes in the derivations.

In Bayesian decision theory, the first thing you do is write down the data-generating process and then compute a posterior distribution for what is unknown.

In this case, we may assume the data-generating process (likelihood function) is a binomial distribution \mathrm{Bin}(s,n|\pi) for s successes in n trials, given a probability of success \pi. Fertilizer A corresponds to s=3, n=4 and fertilizer B corresponds to s=24, n=36.

The probability of success \pi is unknown, and we want to infer its posterior conditional on the data, p(\pi|s,n). To compute a posterior we need to assume a prior on \pi.

It turns out that the Beta distribution is conjugate to a binomial likelihood, meaning that if we assume a Beta distributed prior, the then the posterior is also Beta distributed. If the prior is \pi \sim \mathrm{Beta}(\alpha_0,\beta_0) then the posterior is

\pi \sim \mathrm{Beta}(\alpha=\alpha_0+s,\beta=\beta_0+n-s).

One choice for a prior is a uniform prior on [0,1], which corresponds to a \mathrm{Beta}(1,1) distribution. There are of course other prior choices which will lead to different conclusions. For this prior, the posterior is \mathrm{Beta}(\pi; s+1, n-s+1). The posterior mode is

(\alpha-1)/(\alpha+\beta-2) = s/n

and the posterior mean is

\alpha/(\alpha+\beta) = (s+1)/(n+2).

So, what is the inference for fertilizers A and B? I made a graph of the posterior distributions. You can see that the inference for fertilizer B is sharper, as expected, since there is more data. But the inference for fertilizer A tends towards higher success rates, which can be quantified.

Fertilizer A has a posterior mode of 3/4 = 0.75 and B has a mode of 2/3 = 0.667, corresponding to the sample proportions. The mode isn’t the only measure of central tendency we could use. The means are 0.667 for A and 0.658 for B; the medians are 0.686 for A and 0.661 for B. No matter which of the three statistics we choose, fertilizer A looks better than fertilizer B.

But we haven’t really done “decision theory” yet. We’ve just compared point estimators. Actually, we have done a little decision theory, implicitly. It turns out that picking the mean corresponds to the estimator which minimizes the expected squared error in \pi, where “squared error” can be thought of formally as a loss function in decision theory. Picking the median corresponds to minimizing the expected absolute loss, and picking the mode corresponds to minimizing the minimizing the 0-1 loss (where you lose nothing if you guess \pi exactly and lose 1 otherwise).

Still, these don’t really correspond to a decision theoretic view of the problem. We don’t care about the quantity \pi at all, let alone some point estimator of it. We only care about \pi indirectly, insofar as it helps us predict something about what the fertilizer will do to new trees. For that, we have to move from the posterior distribution p(\pi|s,n) to the predictive distribution

p(y|s,n) = \int p(y|\pi,n)\,p(\pi|s,n)\,d\pi ,

where y is a random variable indicating whether a new tree will thrive under treatment. Here I assume that the success of new trees follows the same binomial distribution as in the experimental group.

For a Beta posterior, the predictive distribution is beta-binomial, and the expected number of successes for a new tree is equal to the mean of the Beta distribution for \pi – i.e. the posterior mean we computed before, (s+1)/(n+2). If we introduce a utility function such that we are rewarded 1 util for a thriving tree and 0 utils for non-thriving tree, then the expected utility is equal to the expected success rate. Therefore, under these assumptions, we should choose the fertilizer that maximizes the quantity (s+1)/(n+2), which, as we’ve seen, favors fertilizer A (0.667) over fertilizer B (0.658).

An interesting mathematical question is, does this ever work out to a “non-obvious” conclusion? That is, if fertilizer A has a sample success rate which is greater than fertilizer B’s sample success rate, but expected utility maximization prefers fertilizer B? Mathematically, we’re looking for a set {s,s',n,n'} such that s/n>s'/n' but (s+1)/(n+2) < (s'+1)/(n'+2). (Also there are obvious constraints on s and s'.) The answer is yes. For example, if fertilizer A has 4 of 5 successes while fertilizer B has 7 of 9 successes.

By the way, on a quite different note: NASA currently rates the chances of the asteroid Apophis colliding with the Earth in 2036 at 4.3 × 10-6. It estimates that the energy of such a collision would be comparable with a 510-megatonne thermonuclear bomb. This is ten times larger than the largest bomb actually exploded, the Tsar Bomba. The Tsar Bomba, in turn, was ten times larger than all the explosives used in World War II.



There’s an interesting Chinese plan to deflect Apophis if that should prove necessary. It is, however, quite a sketchy plan. I expect people will make more detailed plans shortly before Apophis comes close to the Earth in 2029.


A Characterization of Entropy

2 June, 2011

Over at the n-Category Café some of us have been trying an experiment: writing a math paper in full public view, both on that blog and on its associated wiki: the nLab. One great thing about doing things this way is that people can easily chip in with helpful suggestions. It’s also more fun! Both these tend to speed the process.

Like Frankenstein’s monster, our paper’s main result was initially jolted into life by huge blasts of power: in this case, not lightning but category theory. It was awesome to behold, but too scary for this blog.

First Tom Leinster realized that the concept of entropy fell out — unexpectedly, but very naturally — from considerations involving ‘operads’, which are collections of abstract operations. He was looking at a particular operad where the operations are ‘convex linear combinations’, and he discovered that this operad has entropy lurking in its heart. Then Tobias Fritz figured out a nice way to state Tom’s result without mentioning operads. By now we’ve taught the monster table manners, found it shoes that fit, and it’s ready for polite society:

• John Baez, Tobias Fritz and Tom Leinster, A characterization of entropy in terms of information loss.

The idea goes like this. Say you’ve got a finite set X with a probability measure p on it, meaning a number 0 \le p_i \le 1 for each point i \in X, obeying

\sum_{i \in X} p_i = 1

Then the Shannon entropy of p is defined by

S(p) = - \sum_{i \in X} p_i \, \ln(p_i)

This funny-looking formula can be justified in many ways. Our new way involves focusing not on entropy itself, but on changes in entropy. This makes sense for lots of reasons. For example, in physics we don’t usually measure entropy directly. Instead, we measure changes in entropy, using the fact that a system at temperature T absorbing a tiny amount of heat \Delta Q in a reversible way will experience an entropy change of \Delta Q / T. But our real reason for focusing on changes in entropy is that it gives a really slick theorem.

Suppose we have two finite sets with probability measures, say (X,p) and (Y,q). Then we define a morphism f: (X,p) \to (Y,q) to be a measure-preserving function: in other words, one for which the probability q_j of any point in Y is the sum of the probabilities p_i of the points in X with f(i) = j.

A morphism of this sort is a deterministic process that carries one random situation to another. For example, if I have a random integer between -10 and 10, chosen according to some probability distribution, and I square it, I get a random integer between 0 and 100. A process of this sort always decreases the entropy: given any morphism f: (X,p) \to (Y,q), we have

S(q) \le S(p)

Since the second law of thermodynamics says that entropy always increases, this may seem counterintuitive or even paradoxical! But there’s no paradox here. It makes more intuitive sense if you think of entropy as information, and the function f as some kind of data processing that doesn’t introduce any additional randomness. Such a process can only decrease the amount of information. For example, squaring the number -5 gives the same answer as squaring 5, so if I tell you “this number squared is 25”, I’m giving you less information than if I said “this number is -5”.

For this reason, we call the difference S(p) - S(q) the information loss of the morphism f : (X,p) \to (Y,q). And here’s our characterization of Shannon entropy in terms of information loss:

First, let’s write a morphism f: (X,p) \to (Y,q) as f : p \to q for short. Suppose F is a function that assigns to any such morphism a number F(f) \in [0,\infty), which we think of as its information loss. And suppose that F obeys three axioms:

1. Functoriality. Whenever we can compose morphisms f and g, we demand that

F(f \circ g) = F(f) + F(g)

In other words: when we do a process consisting of two stages, the amount of information lost in the whole process is the sum of the amounts lost in each stage!

2. Convex linearity. Suppose we have two finite sets equipped with probability measures, say p and q, and a real number \lambda \in [0, 1]. Then there is a probability measure \lambda p \oplus (1 - \lambda) q on the disjoint union of the two sets, obtained by weighting the two measures by \lambda and 1 - \lambda, respectively. Similarly, given morphisms f: p \to p' and g: q \to q' there is an obvious morphism from \lambda p \oplus (1 - \lambda) q to \lambda p' \oplus (1 - \lambda) q'. Let’s call this morphism \lambda f \oplus (1 - \lambda) g. We demand that

F(\lambda f \oplus (1 - \lambda) g) = \lambda F(f) + (1 - \lambda) F(g)

In other words: if we flip a probability-λ coin to decide whether to do one process or another, the information lost is λ times the information lost by the first process plus (1 – λ) times the information lost by the second!

3. Continuity. The same function between finite sets can be thought of as a measure-preserving map in different ways, by changing the measures on these sets. In this situation the quantity F(f) should depend continuously on the measures in question.

In other words: if we slightly change what we do a process to, the information it loses changes only slightly.

Then we conclude that there exists a constant c \ge 0 such that for any morphism f: (X,p) \to (Y,q), we have

F(f) = c(S(p) - S(q))

In other words: the information loss is some multiple of the change in Shannon entropy!

What’s pleasing about this theorem is that the three axioms are pretty natural, and it’s hard to see the formula

S(p) = - \sum_{i \in X} p_i \, \ln(p_i)

hiding in them… but it’s actually there.

(We also prove a version of this theorem for Tsallis entropy, in case you care. This obeys a mutant version of axiom 2, namely:

F(\lambda f \oplus (1 - \lambda) g) = \lambda^\alpha F(f) + (1 - \lambda)^\alpha F(g)

where \alpha is a parameter with 0 < \alpha< \infty. Tsallis entropy is a close relative of Rényi entropy, which I discussed here earlier. Just as Rényi entropy is a kind of q-derivative of the free energy, the Tsallis entropy is a q-derivative of the partition function. I’m not sure either of them are really important, but when you’re trying to uniquely characterize Shannon entropy, it’s nice for it to have some competitors to fight against, and these are certainly the main two. Both of them depend on a parameter and reduce to the Shannon entropy at a certain value of that parameter.)


Network Theory (Part 4)

6 April, 2011

Last time I explained the rate equation of a stochastic Petri net. But now let’s get serious: let’s see what’s really stochastic— that is, random—about a stochastic Petri net. For this we need to forget the rate equation (temporarily) and learn about the ‘master equation’. This is where ideas from quantum field theory start showing up!

A Petri net has a bunch of states and a bunch of transitions. Here’s an example we’ve already seen, from chemistry:

The states are in yellow, the transitions in blue. A labelling of our Petri net is a way of putting some number of things in each state. We can draw these things as little black dots:

In this example there are only 0 or 1 things in each state: we’ve got one atom of carbon, one molecule of oxygen, one molecule of sodium hydroxide, one molecule of hydrochloric acid, and nothing else. But in general, we can have any natural number of things in each state.

In a stochastic Petri net, the transitions occur randomly as time passes. For example, as time passes we could see a sequence of transitions like this:

 

 

 

 

 

 

Each time a transition occurs, the number of things in each state changes in an obvious way.

The Master Equation

Now, I said the transitions occur ‘randomly’, but that doesn’t mean there’s no rhyme or reason to them! The miracle of probability theory is that it lets us state precise laws about random events. The law governing the random behavior of a stochastic Petri net is called the ‘master equation’.

In a stochastic Petri net, each transition has a rate constant, a nonnegative real number. Roughly speaking, this determines the probability of that transition.

A bit more precisely: suppose we have a Petri net that is labelled in some way at some moment. Then the probability that a given transition occurs in a short time \Delta t is approximately:

• the rate constant for that transition, times

• the time \Delta t, times

• the number of ways the transition can occur.

More precisely still: this formula is correct up to terms of order (\Delta t)^2. So, taking the limit as \Delta t \to 0, we get a differential equation describing precisely how the probability of the Petri net having a given labelling changes with time! And this is the master equation.

Now, you might be impatient to actually see the master equation, but that would be rash. The true master doesn’t need to see the master equation. It sounds like a Zen proverb, but it’s true. The raw beginner in mathematics wants to see the solutions of an equation. The more advanced student is content to prove that the solution exists. But the master is content to prove that the equation exists.

A bit more seriously: what matters is understanding the rules that inevitably lead to some equation: actually writing it down is then straightforward.

And you see, there’s something I haven’t explained yet: “the number of ways the transition can occur”. This involves a bit of counting. Consider, for example, this Petri net:

Suppose there are 10 rabbits and 5 wolves.

• How many ways can the birth transition occur? Since birth takes one rabbit as input, it can occur in 10 ways.

• How many ways can predation occur? Since predation takes one rabbit and one wolf as inputs, it can occur in 10 × 5 = 50 ways.

• How many ways can death occur? Since death takes one wolf as input, it can occur in 5 ways.

Or consider this one:

Suppose there are 10 hydrogen atoms and 5 oxygen atoms. How many ways can they form a water molecule? There are 10 ways to pick the first hydrogen, 9 ways to pick the second hydrogen, and 5 ways to pick the oxygen. So, there are

10 \times 9 \times 5 = 450

ways.

Note that we’re treating the hydrogen atoms as distinguishable, so there are 10 \times 9 ways to pick them, not \frac{10 \times 9}{2} = \binom{9}{2}. In general, the number of ways to choose M distinguishable things from a collection of L is the falling power

L^{\underline{M}} = L \cdot (L - 1) \cdots (L - M + 1)

where there are M factors in the product, but each is 1 less than the preceding one—hence the term ‘falling’.

Okay, now I’ve given you all the raw ingredients to work out the master equation for any stochastic Petri net. The previous paragraph was a big fat hint. One more nudge and you’re on your own:

Puzzle. Suppose we have a stochastic Petri net with k states and just one transition, whose rate constant is r. Suppose the ith state appears m_i times as the input of this transition and n_i times as the output. A labelling of this stochastic Petri net is a k-tuple of natural numbers \ell = (\ell_1, \dots, \ell_k) saying how many things are in each state. Let \psi_\ell(t) be the probability that the labelling is \ell at time t. Then the master equation looks like this:

\frac{d}{d t} \psi_{\ell}(t)  = \sum_{\ell'} H_{\ell \ell'} \psi_{\ell'}(t)

for some matrix of real numbers H_{\ell \ell'}. What is this matrix?

You can write down a formula for this matrix using what I’ve told you. And then, if you have a stochastic Petri net with more transitions, you can just compute the matrix for each transition using this formula, and add them all up.

Someday I’ll tell you the answer to this puzzle, but I want to get there by a strange route: I want to guess the master equation using ideas from quantum field theory!

Some clues

Why?

Well, if we think about a stochastic Petri net whose labelling undergoes random transitions as I’ve described, you’ll see that any possible ‘history’ for the labelling can be drawn in a way that looks like a Feynman diagram. In quantum field theory, Feynman diagrams show how things interact and turn into other things. But that’s what stochastic Petri nets do, too!

For example, if our Petri net looks like this:

then a typical history can be drawn like this:

Some rabbits and wolves come in on top. They undergo some transitions as time passes, and go out on the bottom. The vertical coordinate is time, while the horizontal coordinate doesn’t really mean anything: it just makes the diagram easier to draw.

If we ignore all the artistry that makes it cute, this Feynman diagram is just a graph with states as edges and transitions as vertices. Each transition occurs at a specific time.

We can use these Feynman diagrams to compute the probability that if we start it off with some labelling at time t_1, our stochastic Petri net will wind up with some other labelling at time t_2. To do this, we just take a sum over Feynman diagrams that start and end with the given labellings. For each Feynman diagram, we integrate over all possible times at which the transitions occur. And what do we integrate? Just the product of the rate constants for those transitions!

That was a bit of a mouthful, and it doesn’t really matter if you followed it in detail. What matters is that it sounds a lot like stuff you learn when you study quantum field theory!

That’s one clue that something cool is going on here. Another is the master equation itself:

\frac{d}{dt} \psi_{\ell}(t) = \sum_{\ell'} H_{\ell \ell'} \psi_{\ell'}(t)

This looks a lot like Schrödinger’s equation, the basic equation describing how a quantum system changes with the passage of time.

We can make it look even more like Schrödinger’s equation if we create a vector space with the labellings \ell as a basis. The numbers \psi_\ell(t) will be the components of some vector \psi(t) in this vector space. The numbers H_{\ell \ell'} will be the matrix entries of some operator H on that vector space. And the master equation becomes:

\frac{d}{d t} \psi(t) = H \psi(t)

Compare Schrödinger’s equation:

i \frac{d}{d t} \psi(t) = H \psi(t)

The only visible difference is that factor of i!

But of course this is linked to another big difference: in the master equation \psi describes probabilities, so it’s a vector in a real vector space. In quantum theory \psi describes probability amplitudes, so it’s a vector in a complex Hilbert space.

Apart from this huge difference, everything is a lot like quantum field theory. In particular, our vector space is a lot like the Fock space one sees in quantum field theory. Suppose we have a quantum particle that can be in k different states. Then its Fock space is the Hilbert space we use to describe an arbitrary collection of such particles. It has an orthonormal basis denoted

| \ell_1 \cdots \ell_k \rangle

where \ell_1, \dots, \ell_k are natural numbers saying how many particles there are in each state. So, any vector in Fock space looks like this:

\psi = \sum_{\ell_1, \dots, \ell_k} \; \psi_{\ell_1 , \dots, \ell_k} \,  | \ell_1 \cdots \ell_k \rangle

But if write the whole list \ell_1, \dots, \ell_k simply as \ell, this becomes

\psi = \sum_{\ell} \psi_{\ell}   | \ell \rangle

This is almost like what we’ve been doing with Petri nets!—except I hadn’t gotten around to giving names to the basis vectors.

In quantum field theory class, I learned lots of interesting operators on Fock space: annihilation and creation operators, number operators, and so on. So, when I bumped into this master equation

\frac{d}{d t} \psi(t) = H \psi(t)

it seemed natural to take the operator H and write it in terms of these. There was an obvious first guess, which didn’t quite work… but thinking a bit harder eventually led to the right answer. Later, it turned out people had already thought about similar things. So, I want to explain this.

When I first started working on this stuff, I was focused on the difference between collections of indistinguishable things, like bosons or fermions, and collections of distinguishable things, like rabbits or wolves.

But with the benefit of hindsight, it’s even more important to think about the difference between ordinary quantum theory, which is all about probability amplitudes, and the game we’re playing now, which is all about probabilities. So, next time I’ll explain how we need to modify quantum theory so that it’s about probabilities. This will make it easier to guess a nice formula for H.


This Week’s Finds (Week 309)

17 February, 2011

In the next issues of This Week’s Finds, I’ll return to interviewing people who are trying to help humanity deal with some of the risks we face.

First I’ll talk to the science fiction author and astrophysicist Gregory Benford. I’ll ask him about his ideas on “geoengineering” — proposed ways of deliberately manipulating the Earth’s climate to counteract the effects of global warming.

After that, I’ll spend a few weeks asking Eliezer Yudkowsky about his ideas on rationality and “friendly artificial intelligence”. Yudkowsky believes that the possibility of dramatic increases in intelligence, perhaps leading to a technological singularity, should command more of our attention than it does.

Needless to say, all these ideas are controversial. They’re exciting to some people — and infuriating, terrifying or laughable to others. But I want to study lots of scenarios and lots of options in a calm, level-headed way without rushing to judgement. I hope you enjoy it.

This week, I want to say a bit more about the Hopf bifurcation!

Last week I talked about applications of this mathematical concept to climate cycles like the El Niño – Southern Oscillation. But over on the Azimuth Project, Graham Jones has explained an application of the same math to a very different subject:

Quantitative ecology, Azimuth Project.

That’s one thing that’s cool about math: the same patterns show up in different places. So, I’d like to take advantage of his hard work and show you how a Hopf bifurcation shows up in a simple model of predator-prey interactions.

Suppose we have some rabbits that reproduce endlessly, with their numbers growing at a rate proportional to their population. Let x(t) be the number of animals at time t. Then we have:

\frac{d x}{d t} = r x

where r is the growth rate. This gives exponential growth: it has solutions like

x(t) = x_0 e^{r t}

To get a slightly more realistic model, we can add ‘limits to growth’. Instead of a constant growth rate, let’s try a growth rate that decreases as the population increases. Let’s say it decreases in a linear way, and drops to zero when the population hits some value K. Then we have

\frac{d x}{d t} = r (1-x/K) x

This is called the “logistic equation”. K is known as the “carrying capacity”. The idea is that the environment has enough resources to support this population. If the population is less, it’ll grow; if it’s more, it’ll shrink.

If you know some calculus you can solve the logistic equation by hand by separating the variables and integrating both sides; it’s a textbook exercise. The solutions are called “logistic functions”, and they look sort of like this:



The above graph shows the simplest solution:

x = \frac{e^t}{e^t + 1}

of the simplest logistic equation in the world:

\frac{ d x}{d t} = (1 - x)x

Here the carrying capacity is 1. Populations less than 1 sound a bit silly, so think of it as 1 million rabbits. You can see how the solution starts out growing almost exponentially and then levels off. There’s a very different-looking solution where the population starts off above the carrying capacity and decreases. There’s also a silly solution involving negative populations. But whenever the population starts out positive, it approaches the carrying capacity.

The solution where the population just stays at the carrying capacity:

x = 1

is called a “stable equilibrium”, because it’s constant in time and nearby solutions approach it.

But now let’s introduce another species: some wolves, which eat the rabbits! So, let x be the number of rabbits, and y the number of wolves. Before the rabbits meet the wolves, let’s assume they obey the logistic equation:

\frac{ d x}{d t} = x(1-x/K)

And before the wolves meet the rabbits, let’s assume they obey this equation:

\frac{ d y}{d t} = -y

so that their numbers would decay exponentially to zero if there were nothing to eat.

So far, not very interesting. But now let’s include a term that describes how predators eat prey. Let’s say that on top of the above effect, the predators grow in numbers, and the prey decrease, at a rate proportional to:

x y/(1+x).

For small numbers of prey and predators, this means that predation increases nearly linearly with both x and y. But if you have one wolf surrounded by a million rabbits in a small area, the rate at which it eats rabbits won’t double if you double the number of rabbits! So, this formula includes a limit on predation as the number of prey increases.

Okay, so let’s try these equations:

\frac{ d x}{d t} = x(1-x/K) - 4x y/(x+1)

and

\frac{ d y}{d t} = -y + 2x y/(x+1)

The constants 4 and 2 here have been chosen for simplicity rather than realism.

Before we plunge ahead and get a computer to solve these equations, let’s see what we can do by hand. Setting d x/d t = 0 gives the interesting parabola

y = \frac{1}{4}(1-x/K)(x+1)

together with the boring line x = 0. (If you start with no prey, that’s how it will stay. It takes bunny to make bunny.)

Setting d y/d t = 0 gives the interesting line

x=1

together with the boring line y = 0.

The interesting parabola and the interesting line separate the x y plane into four parts, so these curves are called separatrices. They meet at the point

y = \frac{1}{2} (1 - 1/K)

which of course is an equilibrium, since d x / d t = d y / d t = 0 there. But when K < 1 this equilibrium occurs at a negative value of y, and negative populations make no sense.

So, if K < 1 there is no equilibrium population, and with a bit more work one can see the problem: the wolves die out. For larger values of K there is an equilibrium population. But the nature of this equilibrium depends on K: that’s the interesting part.

We could figure this out analytically, but let’s look at two of Graham’s plots. Here’s a solution when K = 2.5:

The gray curves are the separatrices. The red curve shows a solution of the equations, with the numbers showing the passage of time. So, you can see that the solution spirals in towards the equilibrium. That’s what you expect of a stable equilibrium.

Here’s a picture when K = 3.5:

The red and blue curves are two solutions, again numbered to show how time passes. The red curve spirals in towards the dotted gray curve. The blue one spirals out towards it. The gray curve is also a solution. It’s called a “stable limit cycle” because it’s periodic, and nearby solutions move closer and closer to it.

With a bit more work, we could show analytically that whenever 1 < K < 3 there is a stable equilibrium. As we increase K, when K passes 3 this stable equilibrium suddenly becomes a tiny stable limit cycle. This is a Hopf bifurcation!

Now, what if we add noise? We saw the answer last week: where we before had a stable equilibrium, we now can get irregular cycles — because the noise keeps pushing the solution away from the equilibrium!

Here’s how it looks for K=2.5 with white noise added:

The following graph shows a longer run in the noisy K=2.5 case, with rabbits (x) in black and wolves (y) in gray. Click on the picture to make it bigger:



There is irregular periodicity — and as you’d expect, the predators tends to lag behind the prey. A burst in the rabbit population causes a rise in the wolf population; a lot of wolves eat a lot of rabbits; a crash in rabbits causes a crash in wolves.

This sort of phenomenon is actually seen in nature sometimes. The most famous case involves the snowshoe hare and the lynx in Canada. It was first noted by MacLulich:

• D. A. MacLulich, Fluctuations in the Numbers of the Varying Hare (Lepus americanus), University of Toronto Studies Biological Series 43, University of Toronto Press, Toronto, 1937.

The snowshoe hare is also known as the “varying hare”, because its coat varies in color quite dramatically. In the summer it looks like this:



In the winter it looks like this:



The Canada lynx is an impressive creature:



But don’t be too scared: it only weighs 8-11 kilograms, nothing like a tiger or lion.

Down in the United States, the same species lynx went extinct in Colorado around 1973 — but now it’s back!

• Colorado Division of Wildlife, Success of the Lynx Reintroduction Program, 27 September, 2010.

In Canada, at least, the lynx rely for the snowshoe hare for 60% to 97% of their diet. I suppose this is one reason the hare has evolved such magnificent protective coloration. This is also why the hare and lynx populations are tightly coupled. They rise and crash in irregular cycles that look a bit like what we saw in our simplified model:



This cycle looks a bit more strongly periodic than Graham’s graph, so to fit this data, we might want to choose parameters that give a limit cycle rather than a stable equilibrium.

But I should warn you, in case it’s not obvious: everything about population biology is infinitely more complicated than the models I’ve showed you so far! Some obvious complications: snowshoe hare breed in the spring, their diet varies dramatically over the course of year, and the lynx also eat rodents and birds, carrion when it’s available, and sometimes even deer. Some less obvious ones: the hare will eat dead mice and even dead hare when they’re available, and the lynx can control the size of their litter depending on the abundance of food. And I’m sure all these facts are just the tip of the iceberg. So, it’s best to think of models here as crude caricatures designed to illustrate a few features of a very complex system.

I hope someday to say a bit more and go a bit deeper. Do any of you know good books or papers to read, or fascinating tidbits of information? Graham Jones recommends this book for some mathematical aspects of ecology:

• Michael R. Rose, Quantitative Ecological Theory, Johns Hopkins University Press, Maryland, 1987.

Alas, I haven’t read it yet.

Also: you can get Graham’s R code for predator-prey simulations at the Azimuth Project.


Under carefully controlled experimental circumstances, the organism will behave as it damned well pleases. – the Harvard Law of Animal Behavior


Algorithmic Thermodynamics (Part 1)

12 October, 2010

My grad student Mike Stay and I put a new paper on the arXiv:

• John Baez and Mike Stay, Algorithmic thermodynamics.

Mike has a masters degree in computer science, and he’s working for Google on a project called Caja. This is a system for letting people write programs in JavaScript while protecting the end users from dirty tricks the programmers might have tried. With me, he’s mainly working on the intersection of computer science and category theory, trying to bring 2-categories into the game. But Mike also knows a lot about algorithmic information theory, a subject with fascinating connections to thermodynamics. So, it was natural for us to work on that too.

Let me just tell you a little about what we did.



Around 1948, the electrical engineer Claude Shannon came up with a mathematical theory of information. Here is a quote that gives a flavor of his thoughts:

The whole point of a message is that it should contain something new.

Say you have a source of information, for example a mysterious radio transmission from an extraterrestrial civilization. Suppose every day you get a signal sort of like this:

00101101011111101010101011011101110

How much information are you getting? If the message always looks like this, presumably not much:

00000000000000000000000000000000000

Shannon came up with a precise formula for the information. But beware: it’s not really a formula for the information of a particular message. It’s a formula for the average information of a message chosen from some probability distribution. It’s this:

- \sum_i p_i \;\mathrm{log}(p_i)

where we sum over all possible messages, and p_i is the probability of the i th message.

So, for example, suppose you keep getting the same message. Then every message has probability 0 except for one message, which has probability 1. Then either p_i or \ln(p_i) is zero, so the information is zero.

That seems vaguely plausible in the example where every day we get this message:

000000000000000000000000000000000000000

It may seem less plausible if every day we get this message:

011010100010100010100010000010100000100

It looks like the aliens are trying to tell us which numbers are prime! 1 is not prime, 2 is, 3 is, 4 is not, 5 is, and so on. Aren’t we getting some information?

Maybe so: this could be considered a defect of Shannon information. On the other hand, you might be willing to admit that if we keep getting the same message every day, the second time we get it we’re not getting any new information. Or, you might not be willing to admit this — it’s actually a subtle issue, and I don’t feel like arguing.

But at the very least, you’ll probably admit that the second time you get the same message, you get less new information than the first time. The third time you get even less, and so on. So it’s quite believable that in the long run, the average amount of new information per message approaches 0 in this case. For Shannon, information means new information.

On the other hand, suppose we are absolutely unable to to predict each new bit we get from the aliens. Suppose our ability to predict the next bit is no better than our ability to predict whether a fair coin comes up heads or tails. Then Shannon’s formula says we are getting the same amount of new information with every bit: namely, \mathrm{log}(2).

If we take the logarithm using base 2 here, we get 1 — so we say we’re getting one bit of information. If we take it using base e, as physicists prefer, we get \ln(2) — and we say we’re getting \ln(2) nats of information. One bit equals \ln(2) nats.

There’s a long road from these reflections to a full justification of the specific formula for Shannon information! To begin marching down that road you can read his original paper, A mathematical theory of communication.

Anyway: it soon became clear that Shannon’s formula was almost the same as the usual formula for “entropy”, which goes back to Josiah Willard Gibbs. Gibbs was actually the first engineer to get a Ph.D. in the United States, back in 1863… but he’s mainly famous as a mathematician, physicist and chemist.



Entropy is a measure of disorder. Suppose we have a box of stuff — solid, liquid, gas, whatever. There are many possible states this stuff can be in: for example, the atoms can be in different positions, and have different velocities. Suppose we only know the probability that the stuff is in any one of the allowed states. If the ith state is occupied with probability p_i, Gibbs said the entropy of our box of stuff is

- k \sum_i p_i \; \mathrm{log}(p_i)

Here k is a constant called Boltzmann’s constant.

There’s a wonderful story here, which I don’t have time to tell in detail. The way I wrote Shannon’s formula for information and Gibbs’ formula for entropy, you’d think only a moron would fail to instantly grasp that they’re basically the same. But historically, it took some work.

The appearance of Bolzmann’s constant hints at why. It shows up because people had ideas about entropy, and the closely related concept of temperature, long before they realized the full mathematical meaning of these concepts! So entropy traditionally came in units of “joules/kelvin”, and physicists had a visceral understanding of it. But dividing by Boltzmann’s constant, we can translate that notion of entropy into the modern, more abstract way of thinking of entropy as information!

Henceforth I’ll work in units where k = 1, as modern mathematical physicists do, and treat entropy and information as the same concept.

Closely related to information and entropy is a third concept, which I will call Kolmogorov complexity, though it was developed by many people — including Martin-Löf, Solomonoff, Kolmogorov, Levin, and Chaitin — and it has many names, including descriptive complexity, Kolmogorov-Chaitin complexity, algorithmic entropy, and program-size complexity. You may be intimidated by all these names, but you shouldn’t be: when lots of people keep rediscovering something and giving it new names, it’s usually because this thing is pathetically simple.

So, what is Kolmogorov complexity? It’s a way of measuring the information in a single message rather than a probability distribution of messages. And it’s just the length of the shortest computer program that prints out this message.

I suppose some clarification may be needed here:

1) I’m only talking about programs without any input, that either calculate, print out a message, and halt… or calculate forever and never halt.

2) Of course the length of the shortest program that prints out the desired message depends on the programming language. But there are theorems saying it doesn’t depend “too much” on the language. So don’t worry about it: just pick your favorite language and stick with that.

If you think about it, Kolmogorov complexity is a really nice concept. The Kolmogorov complexity of a string of a million 0’s is a lot less than a million: you can write a short program that says “print a million 0’s”. But the Kolmogov complexity of a highly unpredictable string of a million 0’s and 1’s is about a million: you basically need to include that string in your program and then say “print this string”. Those are the two extremes, but in general the complexity will be somewhere in between.

Various people — the bigshots listed above, and others too — soon realized that Kolmogorov complexity is deeply connected to Shannon information. They have similar properties, but they’re also directly related. It’s another great story, and I urge you to learn it. For that, I recommend:

• Ming Li and Paul Vitanyi, An Introduction to Kolmogorov Complexity and Its Applications, Springer, Berlin, 2008.

To understand the relation a bit better, Mike and I started thinking about probability measures on the set of programs. People had already thought about this — but we thought about it a bit more the way physicists do.

Physicists like to talk about something called a Gibbs ensemble. Suppose we have a set X and a function

F: X \to \mathbb{R}

Then the Gibbs ensemble is the probability distribution on X that maximizes entropy subject to the condition that F has some specified average, or “expected value”.

So, to find the Gibbs ensemble, we need to find a probability distribution p : X \to \mathbb{R} that maximizes

- \sum_{i \in X} p_i \; \mathrm{log}(p_i)

subject to the constraint that

\sum_{i \in X} p_i \; F(i) = \langle F \rangle

where \langle F \rangle is some number, the expected value of F. Finding the probability distribution that does the job is an exercise in Lagrange multipliers. I won’t do it. There’s a nice formula for the answer, but we won’t need it here.

What you really need to know is something more important: why Gibbs invented the Gibbs ensemble! He invented it to solve some puzzles that sound completely impossible at first.

For example, suppose you have a box of stuff and you don’t know which state it’s in. Suppose you only know the expected value of its energy, say \langle E \rangle. What’s the probability that this stuff is in its ith state?

This may sound like an insoluble puzzle: how can we possibly know? But Gibbs proposed an answer! He said, basically: find the probability distribution p that maximizes entropy subject to the constraint that the mean value of energy is \langle E \rangle. Then the answer is p_i.

In other words: use the Gibbs ensemble.

Now let’s come back to Kolmogorov complexity.

Imagine randomly picking a program out of a hat. What’s the probability that you pick a certain program? Again, this sounds like an impossible puzzle. But you can answer it if you know the expected value of the length of this program! Then you just use the Gibbs ensemble.

What does this have to do with Kolmogorov complexity? Here I’ll be a bit fuzzy, because the details are in our paper, and I want you to read that.

Suppose we start out with the Gibbs ensemble I just mentioned. In other words, we have a program in a hat, but all you know is the expected value of its length.

But now suppose I tell you the message that this program prints out. Now you know more. How much more information do you have now? The Kolmogorov complexity of the message — that’s how much!

(Well, at least this is correct up to some error bounded by a constant.)

The main fuzzy thing in what I just said is “how much more information do you have?” You see, I’ve explained information, but I haven’t explained “information gain”. Information is a quantity you compute from one probability distribution. Information gain is a quantity you compute from two. More precisely,

- \sum_i p_i \; \mathrm{log}(p_i/q_i)

is the information you gain when you thought the probability distribution was q, but then someone comes along and tells you it’s p.

In fact, we argue that information gain is more fundamental than information. This is a Bayesian idea: q is your “prior”, the probability distribution you thought was true, and the information you get upon hearing the distribution is p should be defined relative to this prior. When people think they’re talking about information without any prior, they are really using a prior that’s so bland that they don’t notice it’s there: a so-called “uninformative prior”.

But I digress. To help you remember the story so far, let me repeat myself. Up to a bounded error, the Kolmogorov complexity of a message is the information gained when you start out only knowing the expected length of a program, and then learn which message the program prints out.

But this is just the beginning of the story. We’ve seen how Kolmogorov complexity is related to Gibbs ensembles. Now that we have Gibbs ensembles running around, we can go ahead and do thermodynamics! We can talk about quantities analogous to temperature, pressure, and so on… and all the usual thermodynamic relations hold! We can even take the ideas about steam engines and apply them to programs!

But for that, please read our paper. Here’s the abstract, which says what we really do:

Algorithmic entropy can be seen as a special case of entropy as studied in statistical mechanics. This viewpoint allows us to apply many techniques developed for use in thermodynamics to the subject of algorithmic information theory. In particular, suppose we fix a universal prefix-free Turing machine and let X be the set of programs that halt for this machine. Then we can regard X as a set of ‘microstates’, and treat any function on X as an ‘observable’. For any collection of observables, we can study the Gibbs ensemble that maximizes entropy subject to constraints on expected values of these observables. We illustrate this by taking the log runtime, length, and output of a program as observables analogous to the energy E, volume V and number of molecules N in a container of gas. The conjugate variables of these observables allow us to define quantities which we call the ‘algorithmic temperature’ T, ‘algorithmic pressure’ P and `algorithmic potential’ μ, since they are analogous to the temperature, pressure and chemical potential. We derive an analogue of the fundamental thermodynamic relation dE = T dS – P d V + μ dN, and use it to study thermodynamic cycles analogous to those for heat engines. We also investigate the values of T, P and μ for which the partition function converges. At some points on the boundary of this domain of convergence, the partition function becomes uncomputable. Indeed, at these points the partition function itself has nontrivial algorithmic entropy.


Follow

Get every new post delivered to your Inbox.

Join 3,095 other followers