## The Game of Googol

20 July, 2015

Here’s a puzzle from a recent issue of Quanta, an online science magazine:

Puzzle 1: I write down two different numbers that are completely unknown to you, and hold one in my left hand and one in my right. You have absolutely no idea how I generated these two numbers. Which is larger?

You can point to one of my hands, and I will show you the number in it. Then you can decide to either select the number you have seen or switch to the number you have not seen, held in the other hand. Is there a strategy that will give you a greater than 50% chance of choosing the larger number, no matter which two numbers I write down?

At first it seems the answer is no. Whatever number you see, the other number could be larger or smaller. There’s no way to tell. So obviously you can’t get a better than 50% chance of picking the hand with the largest number—even if you’ve seen one of those numbers!

But “obviously” is not a proof. Sometimes “obvious” things are wrong!

It turns out that, amazingly, the answer to the puzzle is yes! You can find a strategy to do better than 50%. But the strategy uses randomness. So, this puzzle is a great illustration of the power of randomness.

If you want to solve it yourself, stop now or read Quanta magazine for some clues—they offered a small prize for the best answer:

• Pradeep Mutalik, Can information rise from randomness?, Quanta, 7 July 2015.

Greg Egan gave a nice solution in the comments to this magazine article, and I’ll reprint it below along with two followup puzzles. So don’t look down there unless you want a spoiler.

I should add: the most common mistake among educated readers seems to be assuming that the first player, the one who chooses the two numbers, chooses them according to some probability distribution. Don’t assume that. They are simply arbitrary numbers.

### The history of this puzzle

I’d seen this puzzle before—do you know who invented it? On G+, Hans Havermann wrote:

I believe the origin of this puzzle goes back to (at least) John Fox and Gerald Marnie’s 1958 betting game ‘Googol’. Martin Gardner mentioned it in his February 1960 column in Scientific American. Wikipedia mentions it under the heading ‘Secretary problem’. Gardner suggested that a variant of the game was proposed by Arthur Cayley in 1875.﻿

Actually the game of Googol is a generalization of the puzzle that we’ve been discussing. Martin Gardner explained it thus:

Ask someone to take as many slips of paper as he pleases, and on each slip write a different positive number. The numbers may range from small fractions of 1 to a number the size of a googol (1 followed by a hundred 0s) or even larger. These slips are turned face down and shuffled over the top of a table. One at a time you turn the slips face up. The aim is to stop turning when you come to the number that you guess to be the largest of the series. You cannot go back and pick a previously turned slip. If you turn over all the slips, then of course you must pick the last one turned.

So, the puzzle I just showed you is the special case when there are just 2 slips of paper. I seem to recall that Gardner incorrectly dismissed this case as trivial!

There’s been a lot of work on Googol. Julien Berestycki writes:

I heard about this puzzle a few years ago from Sasha Gnedin. He has a very nice paper about this

• Alexander V. Gnedin, A solution to the game of Googol, Annals of Probability (1994), 1588–1595.

One of the many beautiful ideas in this paper is that it asks what is the best strategy for the guy who writes the numbers! It also cites a paper by Gnedin and Berezowskyi (of oligarchic fame). ﻿

### Egan’s solution

Okay, here is Greg Egan’s solution, paraphrased a bit:

Pick some function $f : \mathbb{R} \to \mathbb{R}$ such that:

$\displaystyle{ \lim_{x \to -\infty} f(x) = 0 }$

$\displaystyle{ \lim_{x \to +\infty} f(x) = 1 }$

$f$ is monotonically increasing: if $x > y$ then $f(x) > f(y)$

There are lots of functions like this, for example

$\displaystyle{f(x) = \frac{e^x}{e^x + 1} }$

Next, pick one of the first player’s hands at random. If the number you are shown is $x,$ compute $f(x).$ Then generate a uniformly distributed random number $z$ between 0 and 1. If $z$ is less than or equal to $f(x)$ guess that $x$ is the larger number, but if $z$ is greater than $f(x)$ guess that the larger number is in the other hand.

The probability of guessing correctly can be calculated as the probability of seeing the larger number initially and then, correctly, sticking with it, plus the probability of seeing the smaller number initially and then, correctly, choosing the other hand.

This is

$\frac{1}{2} f(x) + \frac{1}{2} (1 - f(y)) = \frac{1}{2} + \frac{1}{2} (f(x) - f(y))$

This is strictly greater than $\frac{1}{2}$ since $x > y$ so $f(x) - f(y) > 0$.

So, you have a more than 50% chance of winning! But as you play the game, there’s no way to tell how much more than 50%. If the numbers on the other players hands are very large, or very small, your chance will be just slightly more than 50%.

### Followup puzzles

Here are two more puzzles:

Puzzle 2: Prove that no deterministic strategy can guarantee you have a more than 50% chance of choosing the larger number.

Puzzle 3: There are perfectly specific but ‘algorithmically random’ sequences of bits, which can’t predicted well by any program. If we use these to generate a uniform algorithmically random number between 0 and 1, and use the strategy Egan describes, will our chance of choosing the larger number be more than 50%, or not?﻿

But watch out—here come Egan’s solutions to those!

### Solutions

Egan writes:

Here are my answers to your two puzzles on G+.

Puzzle 2: Prove that no deterministic strategy can guarantee you have a more than 50% chance of choosing the larger number.

Answer: If we adopt a deterministic strategy, that means there is a function $S: \mathbb{R} \to \{0,1\}$ that tells us whether on not we stick with the number x when we see it. If $S(x)=1$ we stick with it, if $S(x)=0$ we swap it for the other number.

If the two numbers are $x$ and $y,$ with $x > y,$ then the probability of success will be:

$P = 0.5 + 0.5(S(x)-S(y))$

This is exactly the same as the formula we obtained when we stuck with $x$ with probability $f(x),$ but we have specialised to functions $S$ valued in $\{0,1\}.$

We can only guarantee a more than 50% chance of choosing the larger number if $S$ is monotonically increasing everywhere, i.e. $S(x) > S(y)$ whenever $x > y.$ But this is impossible for a function valued in $\{0,1\}.$ To prove this, define $x_0$ to be any number in $[1,2]$ such that $S(x_0)=0;$ such an $x_0$ must exist, otherwise $S$ would be constant on $[1,2]$ and hence not monotonically increasing. Similarly define $x_1$ to be any number in $[-2,-1]$ such that $S(x_1) = 1.$ We then have $x_0 > x_1$ but $S(x_0) < S(x_1).$

Puzzle 3: There are perfectly specific but ‘algorithmically random’ sequences of bits, which can’t predicted well by any program. If we use these to generate a uniform algorithmically random number between 0 and 1, and use the strategy Egan describes, will our chance of choosing the larger number be more than 50%, or not?﻿

Answer: As Philip Gibbs noted, a deterministic pseudo-random number generator is still deterministic. Using a specific sequence of algorithmically random bits

$(b_1, b_2, \dots )$

to construct a number $z$ between $0$ and $1$ means $z$ takes on the specific value:

$z_0 = \sum_i b_i 2^{-i}$

So rather than sticking with $x$ with probability $f(x)$ for our monotonically increasing function $f,$ we end up always sticking with $x$ if $z_0 \le f(x),$ and always swapping if $z_0 > f(x).$ This is just using a function $S:\mathbb{R} \to \{0,1\}$ as in Puzzle 2, with:

$S(x) = 0$ if $x < f^{-1}(z_0)$

$S(x) = 1$ if $x \ge f^{-1}(z_0)$

So all the same consequences as in Puzzle 2 apply, and we cannot guarantee a more than 50% chance of choosing the larger number.

Puzzle 3 emphasizes the huge gulf between ‘true randomness’, where we only have a probability distribution of numbers $z,$ and the situation where we have a specific number $z_0,$ generated by any means whatsoever.

We could generate $z_0$ using a pseudorandom number generator, radioactive decay of atoms, an oracle whose randomness is certified by all the Greek gods, or whatever. No matter how randomly $z_0$ is generated, once we have it, we know there exist choices for the first player that will guarantee our defeat!

This may seem weird at first, but if you think about simple games of luck you’ll see it’s completely ordinary. We can have a more than 50% chance of winning such a game even if for any particular play we make the other player has a move that ensures our defeat. That’s just how randomness works.

## The Large-Number Limit for Reaction Networks (Part 3)

8 October, 2014

joint with Arjun Jain

We used to talk about reaction networks quite a lot here. When Arjun Jain was visiting the CQT, we made a lot of progress understanding how the master equation reduces to the rate equation in the limit where there are very large numbers of things of each kind. But we never told you the end of the story, and by now it’s been such a long time that you may need a reminder of some basic facts!

So…

The rate equation treats the number of things of each kind as continuous—a nonnegative real number—and says how it changes in a deterministic way.

The master equation treats the number of things of each kind as discrete—a nonnegative integer—and says how it changes in a probabilistic way.

You can think of the master equation as the ‘true’ description, and the rate equation as an approximation that’s good in some limit where there are large numbers of molecules — or more precisely, where the probability distribution of having some number of molecules of each kind is sharply peaked near some large value.

You may remember that in the master equation, the state of a chemical system is described by a vector $\psi$ in a kind of ‘Fock space’, while time evolution is described with the help of an operator on this space, called the ‘Hamiltonian’ $H$:

$\displaystyle{ \frac{d}{dt} \psi(t) = H \psi(t) }$

The Hamiltonian is built from annihilation and creation operators, so all of this looks very much like quantum field theory. The details are here, and we won’t try to review them all:

• John Baez and Jacob Biamonte, Quantum Techniques for Stochastic Mechanics.

The point is this: the ‘large-number limit’ where the master equation reduces to the rate equation smells a lot like the ‘classical limit’ of quantum field theory, where the description of light in terms of photons reduces to the good old Maxwell equations. So, maybe we can understand the large-number limit by borrowing techniques from quantum field theory!

How do we take the classical limit of quantum electromagnetism and get the classical Maxwell equations? For simplicity let’s ignore charged particles and consider the ‘free electromagnetic field': just photons, described by the quantum version of Maxwell’s equations. When we take the classical limit we let Planck’s constant $\hbar$ go to zero: that much is obvious. However, that’s not all! The energy of each photon is proportional to $\hbar,$ so to take the classical limit and get a solution of the classical Maxwell’s equations with nonzero energy we also need to increase the number of photons. We cleverly do this in such a way that the total energy remains constant as $\hbar \to 0.$

So, in quantum electromagnetism the classical limit is also a large-number limit!

That’s a good sign. It suggests the same math will also apply to our reaction network problem.

But then we hit an apparent roadblock. What’s the analogue of Planck’s constant in chemical reaction networks? What should go to zero?

We told you the answer to that puzzle a while ago: it’s the reciprocal of Avogadro’s number!

You see, chemists measure numbers of molecules in ‘moles’. There’s a large number of molecules in each mole: Avogadro’s number. If we let the reciprocal of Avogadro’s number go to zero, we are taking a limit where chemistry becomes ‘continuous’ and the discreteness of molecules goes away. Of course this is just a mathematical trick, but it’s a very useful one.

So, we got around that roadblock. And then something nice happened.

When taking the classical limit of quantum electromagnetism, we focus attention on certain quantum states that are the ‘best approximations’ to classical states. These are called ‘coherent states’, and it’s very easy to study how the behave as we simultaneously let $\hbar \to 0$ and let the expected number of photons go to infinity.

And the nice thing is that these coherent states are also important in chemistry! But because chemistry involves probabilities rather than amplitudes, they have a different name: ‘Poisson distributions’. On this blog, Brendan Fong used them to give a new proof of a great result in mathematical chemistry, the Anderson–Craciun–Kurtz theorem.

So, we have most of the concepts and tools in place, and we can tackle the large-number limit using quantum techniques.

You can review the details here:

So, after a quick refresher on the notation, we’ll plunge right in.

As you’ll see, we solve the problem except for one important technical detail: passing a derivative through a limit! This means our main result is not a theorem. Rather, it’s an idea for how to prove a theorem. Or if we act like physicists, we can call it a theorem.

### Review of notation

The rate equation says

$\displaystyle{\frac{d}{dt}{x}(t) = \sum_{\tau \in T} r(\tau) (t(\tau)-s(\tau)) {x}(t)^{s(\tau)} }$

where:

$x(t) \in \mathbb{R}^k$ is a vector describing concentrations of $k$ different species at time $t.$ In chemistry these species could be different kinds of molecules.

• Each $\tau \in T$ is a transition, or in chemistry, a reaction.

$s(\tau) \in \mathbb{N}^k$ is a vector of natural numbers saying how many items of each species appear as inputs to the reaction $\tau.$ This is called the source of the reaction.

$t(\tau) \in \mathbb{N}^k$ is a vector of natural numbers saying how many items of each species appear as outputs of the reaction $\tau.$ This is called the target of the reaction. So, $t(\tau) - s(\tau)$ says the net change of the number of items of each species in the reaction $\tau.$

• The rate at which the reaction $\tau$ occurs is proportional to the rate constant $r(\tau)$ times the number

$x(t)^{s(\tau)}$

Here we are raising a vector to a vector power and getting a number, using this rule:

$x^r = x_1^{r_1} \cdots x_k^{r_k}$

where $r$ is any vector of natural numbers $(r_1, \dots, r_k),$ and $x$ is any vector of nonnegative real numbers. From now on we’ll call a vector of natural numbers a multi-index.

In this paper:

• John Baez, Quantum techniques for reaction networks.

it was shown that the master equation implies

$\displaystyle{\frac{d}{dt}\langle N_i \psi(t)\rangle = \sum_{\tau \in T} r(\tau) (t_i(\tau)-s_i(\tau)) \; \langle N^{\, \underline{s(\tau)}} \, \psi(t)\rangle }$

Here:

$\psi(t)$ is the stochastic state saying the probability of having any particular number of items of each species at each time $t.$ We won’t review the precise details of how this work; for that reread the relevant bit of Part 8.

$N_i$ is the $i$th number operator, defined using annihilation and creation operators as in quantum mechanics:

$N_i = a_i^\dagger a_i$

For the annihilation and creation operators, see Part 8.

$\langle N_i \psi(t) \rangle$ is the expected number of items of the $i$th species at time $t.$

• Similarly, $\langle N^{\underline{s(\tau)}} \psi(t)\rangle$ is the expected value of a certain product of operators. For any multi-index $r$ we define the falling power

$N_i^{\underline{r}_i} = N_i (N_i - 1) (N_i - 2) \cdots (N_i - r_i +1)$

and then we define

$N^{\underline{r}} = N_1^{\underline{r_1}} \cdots N_k^{\underline{r_k}}$

### The large-number limit

Okay. Even if you don’t understand any of what we just said, you’ll see the master and rate equation look similar. The master equation implies this:

$\displaystyle{\frac{d}{dt}\langle N_i \psi(t)\rangle = \sum_{\tau \in T} r(\tau) (t_i(\tau)-s_i(\tau)) \; \langle N^{\, \underline{s(\tau)}} \, \psi(t)\rangle }$

while the rate equation says this:

$\displaystyle{\frac{d}{dt}{x}(t) = \sum_{\tau \in T} r(\tau) (t(\tau)-s(\tau)) \; {x}(t)^{s(\tau)} }$

So, we will try to get from the first to the second second with the help of a ‘large-number limit’.

We start with a few definitions. We introduce an adjustable dimensionless quantity which we call $\hbar.$ This is just a positive number, which has nothing to do with quantum theory except that we’re using a mathematical analogy to quantum mechanics to motivate everything we’re doing.

Definition. The rescaled number operators are defined as $\widetilde{N}_i = \hbar N_i.$ This can be thought of as a rescaling of the number of objects, so that instead of counting objects individually, we count them in bunches of size $1/\hbar.$

Definition. For any multi-index $r$ we define the rescaled falling power of the number operator $N_i$ by:

$\widetilde{N}_i^{\underline{r_i}} = \widetilde{N}_i (\widetilde{N}_i - \hbar)(\widetilde{N}_i-2\hbar)\cdots (\widetilde{N}_i-r_i\hbar+\hbar)$

and also define

$\widetilde{N}^{\underline{r}} = \widetilde{N}_1^{\underline{r_1}} \; \cdots \;\widetilde{N}_k^{\underline{r_k}}$

for any multi-index $r.$

Using these, we get the following equation:

$\displaystyle{\frac{1}{\hbar}\frac{d}{dt} \langle \widetilde{N}_i \psi(t)\rangle = \sum_{\tau \in T} r(\tau) (t_i(\tau)-s_i(\tau)) \; \langle \widetilde{N}^{\underline{s(\tau)}} \psi(t)\rangle \; \frac{1}{\hbar^{|s(\tau)|}} }$

where for any multi-index $r$ we set

$|r| = r_1 + \cdots + r_k$

This suggests a way to rescale the rate constants in the master equation:

Definition. The rescaled rate constants are

$\displaystyle{\widetilde{r}(\tau) = \frac{r(\tau)}{\hbar^{|s(\tau)|-1}}}$

From here onwards, we change our viewpoint. We consider the rescaled rate constants $\widetilde{r}(\tau)$ to be fixed, instead of the original rate constants $r(\tau).$ So, as we decrease $\hbar,$ we are studying situations where the original rate constants change to ensure that the rescaled rate constants stays fixed!

So, we switch to working with a rescaled master equation:

Definition. The rescaled master equation is:

$\displaystyle{\frac{d}{dt} \langle \widetilde{N}_i\widetilde{\psi}(t)\rangle = \sum_{\tau \in T} \widetilde{r}(\tau) (t_i(\tau)-s_i(\tau)) \; \langle \widetilde{N}^{\underline{s(\tau)}} \widetilde{\psi}(t)\rangle }$

This is really a one-parameter family of equations, depending on $\hbar\in (0,\infty).$ We write a solution of the rescaled master equation as $\widetilde{\psi}(t),$ but it is really one solution for each value of $\hbar.$

Following the same procedure as above, we can rescale the rate equation, using the same definition of the rescaled rate constants:

Definition. The rescaled number of objects of the $i\mathrm{th}$ species is defined as $\widetilde{x_i}=\hbar x_i,$ where $x_i$ is the original number of objects of the $i\mathrm{th}$ species. Here again, we are counting in bunches of $1/\hbar.$

Using this to rescale the rate equation, we get

Definition. The rescaled rate equation is

$\displaystyle{\frac{d}{dt}\widetilde{x}(t) = \sum_{\tau \in T} \widetilde{r}(\tau) (t(\tau)-s(\tau))\; \widetilde{x}(t)^{s(\tau)} }$

where

$\widetilde{x}(t)=(\widetilde{x_1}(t),\widetilde{x_2}(t),\dots, \widetilde{x_k}(t))$

Therefore, to go from the rescaled master equation to the rescaled rate equation, we require that

$\langle\widetilde{N}^{\underline{r}}\, \widetilde{\psi}(t)\rangle \to \langle\widetilde{N}\widetilde{\psi}(t)\rangle^r$

as $\hbar \to 0.$ If this holds, we can identify $\langle\widetilde{N}\widetilde{\psi}(t)\rangle$ with $\widetilde{x}(t)$ and get the rate equation from the master equation!

To this end, we introduce the following crucial idea:

Definition. A semiclassical family of states, $\widetilde{\psi},$ is defined as a one-parameter family of states depending on $\hbar \in (0,\infty)$ such that for some $\widetilde{c} \in [0,\infty)^k$ we have

$\langle\widetilde{N}^r\widetilde{\psi}\rangle \to \widetilde{c}^{\, r}$

for every $r\in \mathbb{N}^k$ as $\hbar \to 0.$

In particular, this implies

$\langle\widetilde{N}_i\widetilde{\psi}\rangle \to \widetilde{c}_i$

for every index $i.$

Intuitively, a semiclassical family is a family of probability distributions that becomes more and more sharply peaked with a larger and larger mean as $\hbar$ decreases. We would like to show that in this limit, the rescaled master equation gives the rescaled rate equation.

We make this precise in the following propositions.

Proposition 1. If $\widetilde{\psi}$ is a semiclassical family as defined above, then in the $\hbar \to 0$ limit, we have $\langle\widetilde{N}^{\underline{r}}\widetilde{\psi}\rangle \to \widetilde{c}^{\; r}$ as well.

Proof. For each index $i,$

$\displaystyle{ \langle\widetilde{N}_i^{\; \underline{r_i}}\, \widetilde{\psi}\rangle = \displaystyle{ \langle \widetilde{N}_i (\widetilde{N}_i - \hbar)(\widetilde{N}_i-2\hbar)\cdots(\widetilde{N}_i-r_i\hbar+\hbar)\,\widetilde{\psi}\rangle} }$

$\displaystyle{ = \Big\langle\Big(\widetilde{N}_i^r + \hbar\frac{r_i(r_i-1)}{2}\widetilde{N}_i^{r_i-1}+\cdots + \hbar^{r_i-1}(r_i-1)!\Big)\,\widetilde{\psi}\Big\rangle }$

By the definition of a semiclassical family,

$\displaystyle{ \lim_{\hbar\to 0} \langle\Big(\widetilde{N}_i^{r_i} + \hbar\frac{r_i(r_i-1)}{2}\widetilde{N}_i^{r_i-1}+ \cdots + \hbar^{r_i-1}(r_i-1)!\Big)\;\widetilde{\psi}\rangle} = \widetilde{c}_i^{\; r_i}$

since every term but the first approaches zero. Thus, we have

$\displaystyle{ \lim_{\hbar \to 0} \langle\widetilde{N}_i^{\; \underline{r_i}}\, \widetilde{\psi}\rangle = \widetilde{c}_i^{\; r_i} }$

A similar but more elaborate calculation shows that

$\displaystyle{ \lim_{\hbar \to 0} \langle\widetilde{N}_1^{\, \underline{r_1}} \cdots\widetilde{N}_k^{\, \underline{r_k}} \, \widetilde{\psi}\rangle = \lim_{\hbar \to 0} \langle\widetilde{N}_1^{\, r_1}\cdots \widetilde{N}_k^{\, r_k} \widetilde{\psi}\rangle= \lim_{\hbar \to 0}\langle\widetilde{N}^{\, r} \, \widetilde{\psi}\rangle }$

or in other words

$\langle\widetilde{N}^{\, \underline{r}}\,\widetilde{\psi}\rangle \to \widetilde{c}^{\, r}$

as $\hbar \to 0.$   █

Proposition 2. If $\widetilde{\psi}$ is a semiclassical family of states, then

$\displaystyle{ \langle (\widetilde{N}-\widetilde{c})^{r}\, \widetilde{\psi}\rangle \to 0 }$

for any multi-index $r.$

Proof. Consider the $r_i\mathrm{th}$ centered moment of the $i\mathrm{th}$ number operator:

$\displaystyle{\langle(\widetilde{N}_i-\widetilde{c}_i)^{r_i}\widetilde{\psi}\rangle = \sum_{p =0}^{r_i} {r_i \choose p}\langle\widetilde{N}_i^p\widetilde{\psi}\rangle(-\widetilde{c}_i)^{r_i-p} }$

Taking the limit as $\hbar$ goes to zero, this becomes

$\begin{array}{ccl} \displaystyle{ \lim_{\hbar \to 0}\sum_{p =0}^{r_i} {r_i \choose p}\langle\widetilde{N}_i^p\widetilde{\psi}\rangle(-\widetilde{c}_i)^{r_i-p} } &=& \displaystyle{ \sum_{p =0}^{r_i} {r_i \choose p}(\widetilde{c}_i)^p(-\widetilde{c}_i)^{r_i-p} } \\ \\ &=& (\widetilde{c}_i-\widetilde{c}_i)^{r_i} \\ \\ &=& 0 \end{array}$

For a general multi-index $r$ we can prove the same sort of thing with a more elaborate calculation. First note that

$\langle (\widetilde{N}-\widetilde{c})^{r}\widetilde{\psi}\rangle=\langle(\widetilde{N_1}-\widetilde{c_1})^{r_1} \cdots (\widetilde{N_k}-\widetilde{c_k})^{r_k})\widetilde{\psi}\rangle$

The right-hand side can be expanded as

$\displaystyle{ \langle(\sum_{p_1 =0}^{r_1} {r_1 \choose p_1}\widetilde{N}_1^{p_1}(-\widetilde{c}_1)^{r_1-p_1} ) \cdots (\sum_{p_k =0}^{r_k} {r_k \choose p_k}\widetilde{N}_k^{p_k}(-\widetilde{c}_k)^{r_k-p_k} )\widetilde{\psi} }\rangle$

We can write this more tersely as

$\displaystyle{ \sum_{p=0}^r} {r\choose p} \langle \widetilde{N}^p\widetilde{\psi}\rangle (-\widetilde{c})^{r-p}$

where for any multi-index $r$ we define

$\displaystyle{{\sum_{p=0}^{r}}= \sum_{p_1 =0}^{r_1} \cdots \sum_{p_k =0}^{r_k} }$

and for any multi-indices $r, p$ we define

$\displaystyle{ {r \choose p}={r_1 \choose p_1}{r_2 \choose p_2}\cdots {r_k \choose p_k}}$

Now using the definition of a semiclassical state, we see

$\displaystyle{ \lim_{\hbar \to 0} \sum_{p=0}^r} {r\choose p} \langle \widetilde{N}^p\widetilde{\psi}\rangle (-\widetilde{c})^{r-p}= \displaystyle{ \sum_{p=0}^r} {r\choose p} (\widetilde{c})^{p} (-\widetilde{c})^{r-p}$

But this equals zero, as the last expression, expanded, is

$\displaystyle{ (\widetilde{c})^r \left( \sum_{p_1=0}^{r_1} {r_1\choose p_1} (-1)^{r_1-p_1}\right) \cdots \left( \sum_{p_k=0}^{r_k} {r_k\choose p_k} (-1)^{r_k-p_k} \right) }$

where each individual sum is zero.   █

Here is the theorem that would finish the job if we could give a fully rigorous proof:

“Theorem.” If $\widetilde{\psi}(t)$ is a solution of the rescaled master equation and also a semiclassical family for the time interval $[t_0,t_1],$ then $\widetilde{x}(t) = \langle \widetilde{N} \widetilde{\psi}(t) \rangle$ is a solution of the rescaled rate equation for $t \in [t_0,t_1].$

Proof sketch. We sketch a proof that relies on the assumption that we can pass the $\hbar \to 0$ limit through a time derivative. Of course, to make this rigorous, we would need to justify this. Perhaps it is true only in certain cases.

Assuming that we can pass the limit through the derivative:

$\displaystyle{\lim_{\hbar \to 0}\frac{d}{dt} \langle \widetilde{N}\widetilde{\psi}(t)\rangle = \lim_{\hbar \to 0} \sum_{\tau \in T} \widetilde{r}(\tau) (t(\tau)-s(\tau))\langle \widetilde{N}^{\, \underline{s(\tau)}} \, \widetilde{\psi}(t)\rangle }$

and thus

$\displaystyle{\frac{d}{dt}\lim_{\hbar \to 0} \langle \widetilde{N}\widetilde{\psi}(t)\rangle = \sum_{\tau \in T} \widetilde{r}(\tau) (t(\tau)-s(\tau)) \lim_{\hbar \to 0}\langle \widetilde{N}^{\, \underline{s(\tau)}} \, \widetilde{\psi}(t)\rangle }$

and thus

$\displaystyle{\frac{d}{dt}\widetilde{x}(t) = \sum_{\tau \in T} \widetilde{r}(\tau) (t(\tau)-s(\tau))\widetilde{x}^{\, s(\tau)} }.$

As expected, we obtain the rescaled rate equation.   █

Another question is this: if we start with a semiclassical family of states as our initial data, does it remain semiclassical as we evolve it in time? This will probably be true only in certain cases.

### An example: rescaled coherent states

The best-behaved semiclassical states are the coherent states.
Consider the family of coherent states

$\displaystyle{\widetilde{\psi}_{\widetilde{c}} = \frac{e^{(\widetilde{c}/\hbar) z}}{e^{\widetilde{c}/\hbar}}}$

using the notation developed in the earlier mentioned paper. In that paper it was shown that for any multi-index $m$ and any coherent state $\Psi$ we have

$\langle N^{\underline{m}}\Psi\rangle = \langle N\Psi \rangle^m$

Using this result for $\widetilde{\psi}_{\widetilde{c}}$ we get

$\displaystyle{\langle \widetilde{N}^{\underline{m}}\widetilde{\psi}_{\widetilde{c}}\rangle~=~\hbar^{|m|}\langle N^{\underline{m}}\widetilde{\psi}_{\widetilde{c}}\rangle~=~\hbar^{|m|}\langle N\widetilde{\psi}_{\widetilde{c}}\rangle^m~=~\hbar^{|m|}\frac{\widetilde{c}^m}{\hbar^{|m|}}~=~\widetilde{c}^m}$

Since $\langle\widetilde{N}^{\underline{m}}\widetilde{\psi}_{\widetilde{c}}\rangle$ equals $\langle \widetilde{N}^{m} \widetilde{\psi}_{\widetilde{c}}\rangle$ plus terms of order $\hbar,$ as $\hbar \to 0$ we have

$\langle\widetilde{N}^{\underline{m}}\widetilde{\psi}_{\widetilde{c}}\rangle~\to~\langle\widetilde{N}^{m}\widetilde{\psi}_{\widetilde{c}}\rangle=\widetilde{c}^{m}$

showing that our chosen $\widetilde{\psi}_{\widetilde{c}}$ is indeed a semiclassical family.

## Exploring Climate Data (Part 1)

1 August, 2014

joint with Dara O Shayda

Emboldened by our experiments in El Niño analysis and prediction, people in the Azimuth Code Project have been starting to analyze weather and climate data. A lot of this work is exploratory, with no big conclusions. But it’s still interesting! So, let’s try some blog articles where we present this work.

This one will be about the air pressure on the island of Tahiti and in a city called Darwin in Australia: how they’re correlated, and how each one varies. This article will also be a quick introduction to some basic statistics, as well as ‘continuous wavelet transforms’.

### Darwin, Tahiti and El Niños

The El Niño Southern Oscillation is often studied using the air pressure in Darwin, Australia versus the air pressure in Tahiti. When there’s an El Niño, it gets stormy in the eastern Pacific so the air temperatures tend to be lower in Tahiti and higher in Darwin. When there’s a La Niña, it’s the other way around:

The Southern Oscillation Index or SOI is a normalized version of the monthly mean air pressure anomaly in Tahiti minus that in Darwin. Here anomaly means we subtract off the mean, and normalized means that we divide by the standard deviation.

So, the SOI tends to be negative when there’s an El Niño. On the other hand, when there’s an El Niño the Niño 3.4 index tends to be positive—this says it’s hotter than usual in a certain patch of the Pacific.

Here you can see how this works:

When the Niño 3.4 index is positive, the SOI tends to be negative, and vice versa!

It might be fun to explore precisely how well correlated they are. You can get the data to do that by clicking on the links above.

But here’s another question: how similar are the air pressure anomalies in Darwin and in Tahiti? Do we really need to take their difference, or are they so strongly anticorrelated that either one would be enough to detect an El Niño?

You can get the data to answer such questions here:

Southern Oscillation Index based upon annual standardization, Climate Analysis Section, NCAR/UCAR. This includes links to monthly sea level pressure anomalies in Darwin and Tahiti, in either ASCII format (click the second two links) or netCDF format (click the first one and read the explanation).

In fact this website has some nice graphs already made, which I might as well show you! Here’s the SOI and also the sum of the air pressure anomalies in Darwin and Tahiti, normalized in some way:

(Click to enlarge.)

If the sum were zero, the air pressure anomalies in Darwin and Tahiti would contain the same information and life would be simple. But it’s not!

How similar in character are the air pressure anomalies in Darwin and Tahiti? There are many ways to study this question. Dara tackled it by taking the air pressure anomaly data from 1866 to 2012 and computing some ‘continuous wavelet transforms’ of these air pressure anomalies. This is a good excuse for explaining how a continuous wavelet transform works.

### Very basic statistics

It helps to start with some very basic statistics. Suppose you have a list of numbers

$x = (x_1, \dots, x_n)$

You probably know how to take their mean, or average. People often write this with angle brackets:

$\displaystyle{ \langle x \rangle = \frac{1}{n} \sum_{i = 1}^n x_i }$

You can also calculate the mean of their squares:

$\displaystyle{ \langle x^2 \rangle = \frac{1}{n} \sum_{i = 1}^n x_i^2 }$

If you were naive you might think $\langle x^2 \rangle = \langle x \rangle^2,$ but in fact we have:

$\langle x^2 \rangle \ge \langle x \rangle^2$

and they’re equal only if all the $x_i$ are the same. The point is that if the numbers $x_i$ are spread out, the squares of the big ones (positive or negative) contribute more to the average of the squares than if we had averaged them out before squaring. The difference

$\langle x^2 \rangle - \langle x \rangle^2$

is called the variance; it says how spread out our numbers are. The square root of the variance is the standard deviation:

$\sigma_x = \sqrt{\langle x^2 \rangle - \langle x \rangle^2 }$

and this has the slight advantage that if you multiply all the numbers $x_i$ by some constant $c,$ the standard deviation gets multiplied by $|c|.$ (The variance gets multiplied by $c^2.$)

We can generalize the variance to a situation where we have two lists of numbers:

$x = (x_1, \dots, x_n)$

$y = (y_1, \dots, y_n)$

Namely, we can form the covariance

$\langle x y \rangle - \langle x \rangle \langle y \rangle$

This reduces to the variance when $x = y.$ It measures how much $x$ and $y$ vary together — ‘hand in hand’, as it were. A bit more precisely: if $x_i$ is greater than its mean value mainly for $i$ such that $y_i$ is greater than its mean value, the covariance is positive. On the other hand, if $x_i$ tends to be greater than average when $y_i$ is smaller than average — like with the air pressures at Darwin and Tahiti — the covariance will be negative.

For example, if

$x = (1,-1), \quad y = (1,-1)$

then they ‘vary hand in hand’, and the covariance

$\langle x y \rangle - \langle x \rangle \langle y \rangle = 1 - 0 = 1$

is positive. But if

$x = (1,-1), \quad y = (-1,1)$

then one is positive when the other is negative, so the covariance

$\langle x y \rangle - \langle x \rangle \langle y \rangle = -1 - 0 = -1$

is negative.

Of course the covariance will get bigger if we multiply both $x$ and $y$ by some big number. If we don’t want this effect, we can normalize the covariance and get the correlation:

$\displaystyle{ \frac{ \langle x y \rangle - \langle x \rangle \langle y \rangle }{\sigma_x \sigma_y} }$

which will always be between $-1$ and $1.$

For example, if we compute the correlation between the air pressure anomalies at Darwin and Tahiti, measured monthly from 1866 to 2012, we get
-0.253727. This indicates that when one goes up, the other tends to go down. But since we’re not getting -1, it means they’re not completely locked into a linear relationship where one is some negative number times the other.

Okay, we’re almost ready for continuous wavelet transforms! Here is the main thing we need to know. If the mean of either $x$ or $y$ is zero, the formula for covariance simplifies a lot, to

$\displaystyle{ \langle x y \rangle = \frac{1}{n} \sum_{i = 1}^n x_i y_i }$

So, this quantity says how much the numbers $x_i$ ‘vary hand in hand’ with the numbers $y_i,$ in the special case when one (or both) has mean zero.

We can do something similar if $x, y : \mathbb{R} \to \mathbb{R}$ are functions of time defined for all real numbers $t.$ The sum becomes an integral, and we have to give up on dividing by $n.$ We get:

$\displaystyle{ \int_{-\infty}^\infty x(t) y(t)\; d t }$

This is called the inner product of the functions $x$ and $y,$ and often it’s written $\langle x, y \rangle,$ but it’s a lot like the covariance.

### Continuous wavelet transforms

What are continuous wavelet transforms, and why should we care?

People have lots of tricks for studying ‘signals’, like series of numbers $x_i$ or functions $x : \mathbb{R} \to \mathbb{R}.$ One method is to ‘transform’ the signal in a way that reveals useful information. The Fourier transform decomposes a signal into sines and cosines of different frequencies. This lets us see how much power the signal has at different frequencies, but it doesn’t reveal how the power at different frequencies changes with time. For that we should use something else, like the Gabor transform explained by Blake Pollard in a previous post.

Sines and cosines are great, but we might want to look for other patterns in a signal. A ‘continuous wavelet transform’ lets us scan a signal for appearances of a given pattern at different times and also at different time scales: a pattern could go by quickly, or in a stretched out slow way.

To implement the continuous wavelet transform, we need a signal and a pattern to look for. The signal could be a function $x : \mathbb{R} \to \mathbb{R}.$ The pattern would then be another function $y: \mathbb{R} \to \mathbb{R},$ usually called a wavelet.

Here’s an example of a wavelet:

If we’re in a relaxed mood, we could call any function that looks like a bump with wiggles in it a wavelet. There are lots of famous wavelets, but this particular one is the fourth derivative of a certain Gaussian. Mathematica calls this particular wavelet DGaussianWavelet[4], and you can look up the formula under ‘Details’ on their webpage.

However, the exact formula doesn’t matter at all now! If we call this wavelet $y,$ all that matters is that it’s a bump with wiggles on it, and that its mean value is 0, or more precisely:

$\displaystyle{ \int_{-\infty}^\infty y(t) \; d t = 0 }$

As we saw in the last section, this fact lets us take our function $x$ and the wavelet $y$ and see how much they ‘vary hand it hand’ simply by computing their inner product:

$\displaystyle{ \langle x , y \rangle = \int_{-\infty}^\infty x(t) y(t)\; d t }$

Loosely speaking, this measures the ‘amount of $y$-shaped wiggle in the function $x$’. It’s amazing how hard it is to say something in plain English that perfectly captures the meaning of a simple formula like the above one—so take the quoted phrase with a huge grain of salt. But it gives a rough intuition.

Our wavelet $y$ happens to be centered at $t = 0.$ However, we might be interested in $y$-shaped wiggles that are centered not at zero but at some other number $s.$ We could detect these by shifting the function $y$ before taking its inner product with $x$:

$\displaystyle{ \int_{-\infty}^\infty x(t) y(t-s)\; d t }$

We could also be interested in measuring the amount of some stretched-out or squashed version of a $y$-shaped wiggle in the function $x.$ Again we could do this by changing $y$ before taking its inner product with $x$:

$\displaystyle{ \int_{-\infty}^\infty x(t) \; y\left(\frac{t}{P}\right) \; d t }$

When $P$ is big, we get a stretched-out version of $y.$ People sometimes call $P$ the period, since the period of the wiggles in $y$ will be proportional to this (though usually not equal to it).

Finally, we can combine these ideas, and compute

$\displaystyle{ \int_{-\infty}^\infty x(t) \; y\left(\frac{t- s}{P}\right)\; dt }$

This is a function of the shift $s$ and period $P$ which says how much of the $s$-shifted, $P$-stretched wavelet $y$ is lurking in the function $x.$ It’s a version of the continuous wavelet transform!

Mathematica implements this idea for time series, meaning lists of numbers $x = (x_1,\dots,x_n)$ instead of functions $x : \mathbb{R} \to \mathbb{R}.$ The idea is that we think of the numbers as samples of a function $x$:

$x_1 = x(\Delta t)$

$x_2 = x(2 \Delta t)$

and so on, where $\Delta t$ is some time step, and replace the integral above by a suitable sum. Mathematica has a function ContinuousWaveletTransform that does this, giving

$\displaystyle{ w(s,P) = \frac{1}{\sqrt{P}} \sum_{i = 1}^n x_i \; y\left(\frac{i \Delta t - s}{P}\right) }$

The factor of $1/\sqrt{P}$ in front is a useful extra trick: it’s the right way to compensate for the fact that when you stretch out out your wavelet $y$ by a factor of $P,$ it gets bigger. So, when we’re doing integrals, we should define the continuous wavelet transform of $y$ by:

$\displaystyle{ w(s,P) = \frac{1}{\sqrt{P}} \int_{-\infty}^\infty x(t) y(\frac{t- s}{P})\; dt }$

### The results

Dara Shayda started with the air pressure anomaly at Darwin and Tahiti, measured monthly from 1866 to 2012. Taking DGaussianWavelet[4] as his wavelet, he computed the continuous wavelet transform $w(s,P)$ as above. To show us the answer, he created a scalogram:

This is a 2-dimensional color plot showing roughly how big the continuous wavelet transform $w(s,P)$ is for different shifts $s$ and periods $P.$ Blue means it’s very small, green means it’s bigger, yellow means even bigger and red means very large.

Tahiti gave this:

You’ll notice that the patterns at Darwin and Tahiti are similar in character, but notably different in detail. For example, the red spots, where our chosen wavelet shows up strongly with period of order ~100 months, occur at different times.

Puzzle 1. What is the meaning of the ‘spikes’ in these scalograms? What sort of signal would give a spike of this sort?

Puzzle 2. Do a Gabor transform, also known as a ‘windowed Fourier transform’, of the same data. Blake Pollard explained the Gabor transform in his article Milankovitch vs the Ice Ages. This is a way to see how much a signal wiggles at a given frequency at a given time: we multiply the signal by a shifted Gaussian and then takes its Fourier transform.

Puzzle 3. Read about continuous wavelet transforms. If we want to reconstruct our signal $x$ from its continuous wavelet transform, why should we use a wavelet $y$ with

$\displaystyle{\int_{-\infty}^\infty y(t) \; d t = 0 ? }$

In fact we want a somewhat stronger condition, which is implied by the above equation when the Fourier transform of $y$ is smooth and integrable:

Continuous wavelet transform, Wikipedia.

### Another way to understand correlations

David Tweed mentioned another approach from signal processing to understanding the quantity

$\displaystyle{ \langle x y \rangle = \frac{1}{n} \sum_{i = 1}^n x_i y_i }$

If we’ve got two lists of data $x$ and $y$ that we want to compare to see if they behave similarly, the first thing we ought to do is multiplicatively scale each one so they’re of comparable magnitude. There are various possibilities for assigning a scale, but a reasonable one is to ensure they have equal ‘energy’

$\displaystyle{ \sum_{i=1}^n x_i^2 = \sum_{i=1}^n y_i^2 }$

(This can be achieved by dividing each list by its standard deviation, which is equivalent to what was done in the main derivation above.) Once we’ve done that then it’s clear that looking at

$\displaystyle{ \sum_{i=1}^n (x_i-y_i)^2 }$

gives small values when they have a very good match and progressively bigger values as they become less similar. Observe that

$\begin{array}{ccl} \displaystyle{\sum_{i=1}^n (x_i-y_i)^2 } &=& \displaystyle{ \sum_{i=1}^n (x_i^2 - 2 x_i y_i + y_i^2) }\\ &=& \displaystyle{ \sum_{i=1}^n x_i^2 - 2 \sum_{i=1}^n x_i y_i + \sum_{i=1}^n y_i^2 } \end{array}$

Since we’ve scaled things so that $\sum_{i=1}^n x_i^2$ and $\sum_{i=1}^n y_i^2$ are constants, we can see that when $\sum_{i=1}^n x_i y_i$ becomes bigger,

$\displaystyle{ \sum_{i=1}^n (x_i-y_i)^2 }$

becomes smaller. So,

$\displaystyle{\sum_{i=1}^n x_i y_i}$

serves as a measure of how close the lists are, under these assumptions.

## The Stochastic Resonance Program (Part 1)

10 May, 2014

guest post by David Tanzer

At the Azimuth Code Project, we are aiming to produce educational software that is relevant to the Earth sciences and the study of climate. Our present software takes the form of interactive web pages, which allow you to experiment with the parameters of models and view their outputs. But to fully understand the meaning of a program, we need to know about the concepts and theories that inform it. So we will be writing articles to explain both the programs themselves and the math and science behind them.

In this two-part series, I’ll explain this program:

Check it out—it runs on your browser! It was created by Allan Erskine and Glyn Adgie. In the Azimuth blog article Increasing the Signal-to-Noise Ratio with More Noise, Glyn Adgie and Tim van Beek give a nice explanation of the idea of stochastic resonance, which includes some clear and exciting graphs.

My goal today is give a compact, developer-oriented introduction to stochastic resonance, which will set the context for the next blog article, where I’ll dissect the program itself. By way of introduction, I am a software developer with research training in computer science. It’s a new area for me, and any clarifications will be welcome!

### The concept of stochastic resonance

Stochastic resonance is a phenomenon, occurring under certain circumstances, in which a noise source may amplify the effect of a weak signal. This concept was used in an early hypothesis about the timing of ice-age cycles, and has since been applied to a wide range of phenomena, including neuronal detection mechanisms and patterns of traffic congestion.

Suppose we have a signal detector whose internal, analog state is driven by an input signal, and suppose the analog states are partitioned into two regions, called “on” and “off” — this is a digital state, abstracted from the analog state. With a light switch, we could take the force as the input signal, the angle as the analog state, and the up/down classification of the angle as the digital state.

Consider the effect of a periodic input signal on the digital state. Suppose the wave amplitude is not large enough to change the digital state, yet large enough to drive the analog state close to the digital state boundary. Then, a bit of random noise, occurring near the peak of an input cycle, may “tap” the system over to the other digital state. So we will see a probability of state-transitions that is synchronized with the input signal. In a complex way, the noise has amplified the input signal.

But it’s a pretty funky amplifier! Here is a picture from the Azimuth library article on stochastic resonance:

Stochastic resonance has been found in the signal detection mechanisms of neurons. There are, for example, cells in the tails of crayfish that are tuned to low-frequency signals in the water caused by predator motions. These signals are too weak to cross the firing threshold for the neurons, but with the right amount of noise, they do trigger the neurons.

See:

Stochastic resonance, Azimuth Library.

Stochastic resonance in neurobiology, David Lyttle.

### Bistable stochastic resonance and Milankovitch theories of ice-age cycles

Stochastic resonance was originally formulated in terms of systems that are bistable — where each digital state is the basin of attraction of a stable equilibrium.

An early application of stochastic resonance was to a hypothesis, within the framework of bistable climate dynamics, about the timing of the ice-age cycles. Although it has not been confirmed, it remains of interest (1) historically, (2) because the timing of ice-age cycles remains an open problem, and (3) because the Milankovitch hypothesis upon which it rests is an active part of the current research.

In the bistable model, the climate states are a cold, “snowball” Earth and a hot, iceless Earth. The snowball Earth is stable because it is white, and hence reflects solar energy, which keeps it frozen. The iceless Earth is stable because it is dark, and hence absorbs solar energy, which keeps it melted.

The Milankovitch hypothesis states that the drivers of climate state change are long-duration cycles in the insolation — the solar energy received in the northern latitudes — caused by periodic changes in the Earth’s orbital parameters. The north is significant because that is where the glaciers are concentrated, and so a sufficient “pulse” in northern temperatures could initiate a state change.

Three relevant astronomical cycles have been identified:

• Changing of the eccentricity of the Earth’s elliptical orbit, with a period of 100 kiloyears

• Changing of the obliquity (tilt) of the Earth’s axis, with a period of 41 kiloyears

• Precession (swiveling) of the Earth’s axis, with a period of 23 kiloyears

In the stochastic resonance hypothesis, the Milankovitch signal is amplified by random events to produce climate state changes. In more recent Milankovitch theories, a deterministic forcing mechanism is used. In a theory by Didier Paillard, the climate is modeled with three states, called interglacial, mild glacial and full glacial, and the state changes depend on the volume of ice as well as the insolation.

See:

Milankovitch cycle, Azimuth Library.

Mathematics of the environment (part 10), John Baez. This gives an exposition of Paillard’s theory.

### Bistable systems defined by a potential function

Any smooth function with two local minima can be used to define a bistable system. For instance, consider the function $V(x) = x^4/4 - x^2/2$:

To define the bistable system, construct a differential equation where the time derivative of x is set to the negative of the derivative of the potential at x:

$dx/dt = -V'(x) = -x^3 + x = x(1 - x^2)$

So, for instance, where the potential graph is sloping upward as $x$ increases, $-V'(x)$ is negative, and this sends $X(t)$ ‘downhill’ towards the minimum.

The roots of $V'(x)$ yield stable equilibria at 1 and -1, and an unstable equilibrium at 0. The latter separates the basins of attraction for the stable equilibria.

### Discrete stochastic resonance

Now let’s look at a discrete-time model which exhibits stochastic resonance. This is the model used in the Azimuth demo program.

We construct the discrete-time derivative, using the potential function, a sampled sine wave, and a normally distributed random number:

$\Delta X_t = -V'(X_t) * \Delta t + \mathrm{Wave}(t) + \mathrm{Noise}(t) =$
$X_t (1 - X_t^2) \Delta t + \alpha * \sin(\omega t) + \beta * \mathrm{GaussianSample}(t)$

where $\Delta t$ is a constant and $t$ is restricted to multiples of $\Delta t.$

This equation is the discrete-time counterpart to a continuous-time stochastic differential equation.

Next time, we will look into the Azimuth demo program itself.

## Noether’s Theorem: Quantum vs Stochastic

3 May, 2014

guest post by Ville Bergholm

In 1915 Emmy Noether discovered an important connection between the symmetries of a system and its conserved quantities. Her result has become a staple of modern physics and is known as Noether’s theorem.

The theorem and its generalizations have found particularly wide use in quantum theory. Those of you following the Network Theory series here on Azimuth might recall Part 11 where John Baez and Brendan Fong proved a version of Noether’s theorem for stochastic systems. Their result is now published here:

• John Baez and Brendan Fong, A Noether theorem for stochastic mechanics, J. Math. Phys. 54:013301 (2013).

One goal of the network theory series here on Azimuth has been to merge ideas appearing in quantum theory with other disciplines. John and Brendan proved their stochastic version of Noether’s theorem by exploiting ‘stochastic mechanics’ which was formulated in the network theory series to mathematically resemble quantum theory. Their result, which we will outline below, was different than what would be expected in quantum theory, so it is interesting to try to figure out why.

Recently Jacob Biamonte, Mauro Faccin and myself have been working to try to get to the bottom of these differences. What we’ve done is prove a version of Noether’s theorem for Dirichlet operators. As you may recall from Parts 16 and 20 of the network theory series, these are the operators that generate both stochastic and quantum processes. In the language of the series, they lie in the intersection of stochastic and quantum mechanics. So, they are a subclass of the infinitesimal stochastic operators considered in John and Brendan’s work.

The extra structure of Dirichlet operators—compared with the wider class of infinitesimal stochastic operators—provided a handle for us to dig a little deeper into understanding the intersection of these two theories. By the end of this article, astute readers will be able to prove that Dirichlet operators generate doubly stochastic processes.

Before we get into the details of our proof, let’s recall first how conservation laws work in quantum mechanics, and then contrast this with what John and Brendan discovered for stochastic systems. (For a more detailed comparison between the stochastic and quantum versions of the theorem, see Part 13 of the network theory series.)

### The quantum case

I’ll assume you’re familiar with quantum theory, but let’s start with a few reminders.

In standard quantum theory, when we have a closed system with $n$ states, the unitary time evolution of a state $|\psi(t)\rangle$ is generated by a self-adjoint $n \times n$ matrix $H$ called the Hamiltonian. In other words, $|\psi(t)\rangle$ satisfies Schrödinger’s equation:

$i \hbar \displaystyle{\frac{d}{d t}} |\psi(t) \rangle = H |\psi(t) \rangle.$

The state of a system starting off at time zero in the state $|\psi_0 \rangle$ and evolving for a time $t$ is then given by

$|\psi(t) \rangle = e^{-i t H}|\psi_0 \rangle.$

The observable properties of a quantum system are associated with self-adjoint operators. In the state $|\psi \rangle,$ the expected value of the observable associated to a self-adjoint operator $O$ is

$\langle O \rangle_{\psi} = \langle \psi | O | \psi \rangle$

This expected value is constant in time for all states if and only if $O$ commutes with the Hamiltonian $H$:

$[O, H] = 0 \quad \iff \quad \displaystyle{\frac{d}{d t}} \langle O \rangle_{\psi(t)} = 0 \quad \forall \: |\psi_0 \rangle, \forall t.$

In this case we say $O$ is a ‘conserved quantity’. The fact that we have two equivalent conditions for this is a quantum version of Noether’s theorem!

### The stochastic case

In stochastic mechanics, the story changes a bit. Now a state $|\psi(t)\rangle$ is a probability distribution: a vector with $n$ nonnegative components that sum to 1. Schrödinger’s equation gets replaced by the master equation:

$\displaystyle{\frac{d}{d t}} |\psi(t) \rangle = H |\psi(t) \rangle$

If we start with a probability distribution $|\psi_0 \rangle$ at time zero and evolve it according to this equation, at any later time have

$|\psi(t)\rangle = e^{t H} |\psi_0 \rangle.$

We want this always be a probability distribution. To ensure that this is so, the Hamiltonian $H$ must be infinitesimal stochastic: that is, a real-valued $n \times n$ matrix where the off-diagonal entries are nonnegative and the entries of each column sum to zero. It no longer needs to be self-adjoint!

When $H$ is infinitesimal stochastic, the operators $e^{t H}$ map the set of probability distributions to itself whenever $t \ge 0,$ and we call this family of operators a continuous-time Markov process, or more precisely a Markov semigroup.

In stochastic mechanics, we say an observable $O$ is a real diagonal $n \times n$ matrix, and its expected value is given by

$\langle O\rangle_{\psi} = \langle \hat{O} | \psi \rangle$

where $\hat{O}$ is the vector built from the diagonal entries of $O.$ More concretely,

$\langle O\rangle_{\psi} = \displaystyle{ \sum_i O_{i i} \psi_i }$

where $\psi_i$ is the $i$th component of the vector $|\psi\rangle.$

Here is a version of Noether’s theorem for stochastic mechanics:

Noether’s Theorem for Markov Processes (Baez–Fong). Suppose $H$ is an infinitesimal stochastic operator and $O$ is an observable. Then

$[O,H] =0$

if and only if

$\displaystyle{\frac{d}{d t}} \langle O \rangle_{\psi(t)} = 0$

and

$\displaystyle{\frac{d}{d t}}\langle O^2 \rangle_{\psi(t)} = 0$

for all $t \ge 0$ and all $\psi(t)$ obeying the master equation.   █

So, just as in quantum mechanics, whenever $[O,H]=0$ the expected value of $O$ will be conserved:

$\displaystyle{\frac{d}{d t}} \langle O\rangle_{\psi(t)} = 0$

for any $\psi_0$ and all $t \ge 0.$ However, John and Brendan saw that—unlike in quantum mechanics—you need more than just the expectation value of the observable $O$ to be constant to obtain the equation $[O,H]=0.$ You really need both

$\displaystyle{\frac{d}{d t}} \langle O\rangle_{\psi(t)} = 0$

together with

$\displaystyle{\frac{d}{d t}} \langle O^2\rangle_{\psi(t)} = 0$

for all initial data $\psi_0$ to be sure that $[O,H]=0.$

So it’s a bit subtle, but symmetries and conserved quantities have a rather different relationship than they do in quantum theory.

### A Noether theorem for Dirichlet operators

But what if the infinitesimal generator of our Markov semigroup is also self-adjoint? In other words, what if $H$ is both an infinitesimal stochastic matrix but also its own transpose: $H = H^\top$? Then it’s called a Dirichlet operator… and we found that in this case, we get a stochastic version of Noether’s theorem that more closely resembles the usual quantum one:

Noether’s Theorem for Dirichlet Operators. If $H$ is a Dirichlet operator and $O$ is an observable, then

$[O, H] = 0 \quad \iff \quad \displaystyle{\frac{d}{d t}} \langle O \rangle_{\psi(t)} = 0 \quad \forall \: |\psi_0 \rangle, \forall t \ge 0$

Proof. The $\Rightarrow$ direction is easy to show, and it follows from John and Brendan’s theorem. The point is to show the $\Leftarrow$ direction. Since $H$ is self-adjoint, we may use a spectral decomposition:

$H = \displaystyle{ \sum_k E_k |\phi_k \rangle \langle \phi_k |}$

where $\phi_k$ are an orthonormal basis of eigenvectors, and $E_k$ are the corresponding eigenvalues. We then have:

$\displaystyle{\frac{d}{d t}} \langle O \rangle_{\psi(t)} = \langle \hat{O} | H e^{t H} |\psi_0 \rangle = 0 \quad \forall \: |\psi_0 \rangle, \forall t \ge 0$

$\iff \quad \langle \hat{O}| H e^{t H} = 0 \quad \forall t \ge 0$

$\iff \quad \sum_k \langle \hat{O} | \phi_k \rangle E_k e^{t E_k} \langle \phi_k| = 0 \quad \forall t \ge 0$

$\iff \quad \langle \hat{O} | \phi_k \rangle E_k e^{t E_k} = 0 \quad \forall t \ge 0$

$\iff \quad |\hat{O} \rangle \in \mathrm{Span}\{|\phi_k \rangle \, : \; E_k = 0\} = \ker \: H,$

where the third equivalence is due to the vectors $|\phi_k \rangle$ being linearly independent. For any infinitesimal stochastic operator $H$ the corresponding transition graph consists of $m$ connected components iff we can reorder (permute) the states of the system such that $H$ becomes block-diagonal with $m$ blocks. Now it is easy to see that the kernel of $H$ is spanned by $m$ eigenvectors, one for each block. Since $H$ is also symmetric, the elements of each such vector can be chosen to be ones within the block and zeros outside it. Consequently

$|\hat{O} \rangle \in \ker \: H$

implies that we can choose the basis of eigenvectors of $O$ to be the vectors $|\phi_k \rangle,$ which implies

$[O, H] = 0$

Alternatively,

$|\hat{O} \rangle \in \ker \, H$

implies that

$|\hat{O^2} \rangle \in \ker \: H \; \iff \; \cdots \; \iff \; \displaystyle{\frac{d}{d t}} \langle O^2 \rangle_{\psi(t)} = 0 \; \forall \: |\psi_0 \rangle, \forall t \ge 0,$

where we have used the above sequence of equivalences backwards. Now, using John and Brendan’s original proof, we can obtain $[O, H] = 0.$   █

In summary, by restricting ourselves to the intersection of quantum and stochastic generators, we have found a version of Noether’s theorem for stochastic mechanics that looks formally just like the quantum version! However, this simplification comes at a cost. We find that the only observables $O$ whose expected value remains constant with time are those of the very restricted type described above, where the observable has the same value in every state in a connected component.

### Puzzles

Suppose we have a graph whose graph Laplacian matrix $H$ generates a Markov semigroup as follows:

$U(t) = e^{t H}$

Puzzle 1. Suppose that also $H = H^\top,$ so that $H$ is a Dirichlet operator and hence $i H$ generates a 1-parameter unitary group. Show that the indegree and outdegree of any node of our graph must be equal. Graphs with this property are called balanced.

Puzzle 2. Suppose that $U(t) = e^{t H}$ is doubly stochastic Markov semigroup, meaning that for all $t \ge 0$ each row and each column of $U(t)$ sums to 1:

$\displaystyle{ \sum_i U(t)_{i j} = \sum_j U(t)_{i j} = 1 }$

and all the matrix entries are nonnegative. Show that the Hamiltonian $H$ obeys

$\displaystyle{\sum_i H_{i j} = \sum_j H_{i j} = 0 }$

and all the off-diagonal entries of $H$ are nonnegative. Show the converse is also true.

Puzzle 3. Prove that any doubly stochastic Markov semigroup $U(t)$ is of the form $e^{t H}$ where $H$ is the graph Laplacian of a balanced graph.

Puzzle 4. Let $O(t)$ be a possibly time-dependent observable, and write $\langle O(t) \rangle_{\psi(t)}$ for its expected value with respect to some initial state $\psi_0$ evolving according to the master equation. Show that

$\displaystyle{ \frac{d}{d t}\langle O(t)\rangle_{\psi(t)} = \left\langle [O(t), H] \right\rangle_{\psi(t)} + \left\langle \frac{\partial O(t)}{\partial t}\right\rangle_{\psi(t)} }$

This is a stochastic version of the Ehrenfest theorem.

## Network Theory III

16 March, 2014

In the last of my Oxford talks I explain how entropy and relative entropy can be understood using certain categories related to probability theory… and how these categories also let us understand Bayesian networks!

The first two parts are explanations of these papers:

• John Baez, Tobias Fritz and Tom Leinster, A characterization of entropy in terms of information loss

• John Baez and Tobias Fritz, A Bayesian characterization of relative entropy.

Somewhere around here the talk was interrupted by a fire drill, waking up the entire audience!

By the way, in my talk I mistakenly said that relative entropy is a continuous functor; in fact it’s just lower semicontinuous. I’ve fixed this in my slides.

The third part of my talk was my own interpretation of Brendan Fong’s master’s thesis:

• Brendan Fong, Causal Theories: a Categorical Perspective on Bayesian Networks.

I took a slightly different approach, by saying that a causal theory $\mathcal{C}_G$ is the free category with products on certain objects and morphisms coming from a directed acyclic graph $G$. In his thesis he said $\mathcal{C}_G$ was the free symmetric monoidal category where each generating object is equipped with a cocommutative comonoid structure. This is close to a category with finite products, though perhaps not quite the same: a symmetric monoidal category where every object is equipped with a cocommutative comonoid structure in a natural way (i.e., making a bunch of squares commute) is a category with finite products. It would be interesting to see if this difference hurts or helps.

By making this slight change, I am claiming that causal theories can be seen as algebraic theories in the sense of Lawvere. This would be a very good thing, since we know a lot about those.

You can also see the slides of this talk. Click on any picture in the slides, or any text in blue, and get more information!

## Network Theory II

12 March, 2014

Chemists are secretly doing applied category theory! When chemists list a bunch of chemical reactions like

C + O₂ → CO₂

they are secretly describing a ‘category’.

That shouldn’t be surprising. A category is simply a collection of things called objects together with things called morphisms going from one object to another, often written

f: x → y

The rules of a category say:

1) we can compose a morphism f: x → y and another morphism g: y → z to get an arrow gf: x → z,

2) (hg)f = h(gf), so we don’t need to bother with parentheses when composing arrows,

3) every object x has an identity morphism 1ₓ: x → x that obeys 1ₓ f = f and f 1ₓ = f.

Whenever we have a bunch of things (objects) and processes (arrows) that take one thing to another, we’re likely to have a category. In chemistry, the objects are bunches of molecules and the arrows are chemical reactions. But we can ‘add’ bunches of molecules and also add reactions, so we have something more than a mere category: we have something called a symmetric monoidal category.

My talk here, part of a series, is an explanation of this viewpoint and how we can use it to take ideas from elementary particle physics and apply them to chemistry! ﻿For more details try this free book:

• John Baez and Jacob Biamonte, A Course on Quantum Techniques for Stochastic Mechanics.

as well as this paper on the Anderson–Craciun–Kurtz theorem (discussed in my talk):

• John Baez and Brendan Fong, Quantum techniques for studying equilibrium in reaction networks.

You can also see the slides of this talk. Click on any picture in the slides, or any text in blue, and get more information!