## Information Geometry (Part 16)

14 January, 2016

joint with Blake Pollard

Lately we’ve been thinking about open Markov processes. These are random processes where something can hop randomly from one state to another (that’s the ‘Markov process’ part) but also enter or leave the system (that’s the ‘open’ part).

The ultimate goal is to understand the nonequilibrium thermodynamics of open systems—systems where energy and maybe matter flows in and out. If we could understand this well enough, we could understand in detail how life works. That’s a difficult job! But one has to start somewhere, and this is one place to start.

We have a few papers on this subject:

• Blake Pollard, A Second Law for open Markov processes. (Blog article here.)

• John Baez, Brendan Fong and Blake Pollard, A compositional framework for Markov processes. (Blog article here.)

• Blake Pollard, Open Markov processes: A compositional perspective on non-equilibrium steady states in biology. (Blog article here.)

However, right now we just want to show you three closely connected results about how relative entropy changes in open Markov processes.

### Definitions

An open Markov process consists of a finite set $X$ of states, a subset $B \subseteq X$ of boundary states, and an infinitesimal stochastic operator $H: \mathbb{R}^X \to \mathbb{R}^X,$ meaning a linear operator with

$H_{ij} \geq 0 \ \ \text{for all} \ \ i \neq j$

and

$\sum_i H_{ij} = 0 \ \ \text{for all} \ \ j$

For each state $i \in X$ we introduce a population $p_i \in [0,\infty).$ We call the resulting function $p : X \to [0,\infty)$ the population distribution.

Populations evolve in time according to the open master equation:

$\displaystyle{ \frac{dp_i}{dt} = \sum_j H_{ij}p_j} \ \ \text{for all} \ \ i \in X-B$

$p_i(t) = b_i(t) \ \ \text{for all} \ \ i \in B$

So, the populations $p_i$ obey a linear differential equation at states $i$ that are not in the boundary, but they are specified ‘by the user’ to be chosen functions $b_i$ at the boundary states. The off-diagonal entry $H_{ij}$ for $i \neq j$ describe the rate at which population transitions from the $j$th to the $i$th state.

A closed Markov process, or continuous-time discrete-state Markov chain, is an open Markov process whose boundary is empty. For a closed Markov process, the open master equation becomes the usual master equation:

$\displaystyle{ \frac{dp}{dt} = Hp }$

In a closed Markov process the total population is conserved:

$\displaystyle{ \frac{d}{dt} \sum_{i \in X} p_i = \sum_{i,j} H_{ij}p_j = 0 }$

This lets us normalize the initial total population to 1 and have it stay equal to 1. If we do this, we can talk about probabilities instead of populations. In an open Markov process, population can flow in and out at the boundary states.

For any pair of distinct states $i,j,$ $H_{ij}p_j$ is the flow of population from $j$ to $i.$ The net flux of population from the $j$th state to the $i$th state is the flow from $j$ to $i$ minus the flow from $i$ to $j$:

$J_{ij} = H_{ij}p_j - H_{ji}p_i$

A steady state is a solution of the open master equation that does not change with time. A steady state for a closed Markov process is typically called an equilibrium. So, an equilibrium obeys the master equation at all states, while for a steady state this may not be true at the boundary states. The idea is that population can flow in or out at the boundary states.

We say an equilibrium $p : X \to [0,\infty)$ of a Markov process is detailed balanced if all the net fluxes vanish:

$J_{ij} = 0 \ \ \text{for all} \ \ i,j \in X$

or in other words:

$H_{ij}p_j = H_{ji}p_i \ \ \text{for all} \ \ i,j \in X$

Given two population distributions $p, q : X \to [0,\infty)$ we can define the relative entropy

$\displaystyle{ I(p,q) = \sum_i p_i \ln \left( \frac{p_i}{q_i} \right)}$

When $q$ is a detailed balanced equilibrium solution of the master equation, the relative entropy can be seen as the ‘free energy’ of $p.$ For a precise statement, see Section 4 of Relative entropy in biological systems.

The Second Law of Thermodynamics implies that the free energy of a closed system tends to decrease with time, so for closed Markov processes we expect $I(p,q)$ to be nonincreasing. And this is true! But for open Markov processes, free energy can flow in from outside. This is just one of several nice results about how relative entropy changes with time.

### Results

Theorem 1. Consider an open Markov process with $X$ as its set of states and $B$ as the set of boundary states. Suppose $p(t)$ and $q(t)$ obey the open master equation, and let the quantities

$\displaystyle{ \frac{Dp_i}{Dt} = \frac{dp_i}{dt} - \sum_{j \in X} H_{ij}p_j }$

$\displaystyle{ \frac{Dq_i}{Dt} = \frac{dq_i}{dt} - \sum_{j \in X} H_{ij}q_j }$

measure how much the time derivatives of $p_i$ and $q_i$ fail to obey the master equation. Then we have

$\begin{array}{ccl} \displaystyle{ \frac{d}{dt} I(p(t),q(t)) } &=& \displaystyle{ \sum_{i, j \in X} H_{ij} p_j \left( \ln(\frac{p_i}{q_i}) - \frac{p_i q_j}{p_j q_i} \right)} \\ \\ && \; + \; \displaystyle{ \sum_{i \in B} \frac{\partial I}{\partial p_i} \frac{Dp_i}{Dt} + \frac{\partial I}{\partial q_i} \frac{Dq_i}{Dt} } \end{array}$

This result separates the change in relative entropy change into two parts: an ‘internal’ part and a ‘boundary’ part.

It turns out the ‘internal’ part is always less than or equal to zero. So, from Theorem 1 we can deduce a version of the Second Law of Thermodynamics for open Markov processes:

Theorem 2. Given the conditions of Theorem 1, we have

$\displaystyle{ \frac{d}{dt} I(p(t),q(t)) \; \le \; \sum_{i \in B} \frac{\partial I}{\partial p_i} \frac{Dp_i}{Dt} + \frac{\partial I}{\partial q_i} \frac{Dq_i}{Dt} }$

Intuitively, this says that free energy can only increase if it comes in from the boundary!

There is another nice result that holds when $q$ is an equilibrium solution of the master equation. This idea seems to go back to Schnakenberg:

Theorem 3. Given the conditions of Theorem 1, suppose also that $q$ is an equilibrium solution of the master equation. Then we have

$\displaystyle{ \frac{d}{dt} I(p(t),q) = -\frac{1}{2} \sum_{i,j \in X} J_{ij} A_{ij} \; + \; \sum_{i \in B} \frac{\partial I}{\partial p_i} \frac{Dp_i}{Dt} }$

where

$J_{ij} = H_{ij}p_j - H_{ji}p_i$

is the net flux from $j$ to $i,$ while

$\displaystyle{ A_{ij} = \ln \left(\frac{p_j q_i}{p_i q_j} \right) }$

is the conjugate thermodynamic force.

The flux $J_{ij}$ has a nice meaning: it’s the net flow of population from $j$ to $i.$ The thermodynamic force is a bit subtler, but this theorem reveals its meaning: it says how much the population wants to flow from $j$ to $i.$

More precisely, up to that factor of $1/2,$ the thermodynamic force $A_{ij}$ says how much free energy loss is caused by net flux from $j$ to $i.$ There’s a nice analogy here to water losing potential energy as it flows downhill due to the force of gravity.

### Proofs

Proof of Theorem 1. We begin by taking the time derivative of the relative information:

$\begin{array}{ccl} \displaystyle{ \frac{d}{dt} I(p(t),q(t)) } &=& \displaystyle{ \sum_{i \in X} \frac{\partial I}{\partial p_i} \frac{dp_i}{dt} + \frac{\partial I}{\partial q_i} \frac{dq_i}{dt} } \end{array}$

We can separate this into a sum over states $i \in X - B,$ for which the time derivatives of $p_i$ and $q_i$ are given by the master equation, and boundary states $i \in B,$ for which they are not:

$\begin{array}{ccl} \displaystyle{ \frac{d}{dt} I(p(t),q(t)) } &=& \displaystyle{ \sum_{i \in X-B, \; j \in X} \frac{\partial I}{\partial p_i} H_{ij} p_j + \frac{\partial I}{\partial q_i} H_{ij} q_j }\\ \\ && + \; \; \; \displaystyle{ \sum_{i \in B} \frac{\partial I}{\partial p_i} \frac{dp_i}{dt} + \frac{\partial I}{\partial q_i} \frac{dq_i}{dt}} \end{array}$

For boundary states we have

$\displaystyle{ \frac{dp_i}{dt} = \frac{Dp_i}{Dt} + \sum_{j \in X} H_{ij}p_j }$

and similarly for the time derivative of $q_i.$ We thus obtain

$\begin{array}{ccl} \displaystyle{ \frac{d}{dt} I(p(t),q(t)) } &=& \displaystyle{ \sum_{i,j \in X} \frac{\partial I}{\partial p_i} H_{ij} p_j + \frac{\partial I}{\partial q_i} H_{ij} q_j }\\ \\ && + \; \; \displaystyle{ \sum_{i \in B} \frac{\partial I}{\partial p_i} \frac{Dp_i}{Dt} + \frac{\partial I}{\partial q_i} \frac{Dq_i}{Dt}} \end{array}$

To evaluate the first sum, recall that

$\displaystyle{ I(p,q) = \sum_{i \in X} p_i \ln (\frac{p_i}{q_i})}$

so

$\displaystyle{\frac{\partial I}{\partial p_i}} =\displaystyle{1 + \ln (\frac{p_i}{q_i})} , \qquad \displaystyle{ \frac{\partial I}{\partial q_i}}= \displaystyle{- \frac{p_i}{q_i} }$

Thus, we have

$\displaystyle{ \sum_{i,j \in X} \frac{\partial I}{\partial p_i} H_{ij} p_j + \frac{\partial I}{\partial q_i} H_{ij} q_j = \sum_{i,j\in X} (1 + \ln (\frac{p_i}{q_i})) H_{ij} p_j - \frac{p_i}{q_i} H_{ij} q_j }$

We can rewrite this as

$\displaystyle{ \sum_{i,j \in X} H_{ij} p_j \left( 1 + \ln(\frac{p_i}{q_i}) - \frac{p_i q_j}{p_j q_i} \right) }$

Since $H_{ij}$ is infinitesimal stochastic we have $\sum_{i} H_{ij} = 0,$ so the first term drops out, and we are left with

$\displaystyle{ \sum_{i,j \in X} H_{ij} p_j \left( \ln(\frac{p_i}{q_i}) - \frac{p_i q_j}{p_j q_i} \right) }$

as desired.   █

Proof of Theorem 2. Thanks to Theorem 1, to prove

$\displaystyle{ \frac{d}{dt} I(p(t),q(t)) \; \le \; \sum_{i \in B} \frac{\partial I}{\partial p_i} \frac{Dp_i}{Dt} + \frac{\partial I}{\partial q_i} \frac{Dq_i}{Dt} }$

it suffices to show that

$\displaystyle{ \sum_{i,j \in X} H_{ij} p_j \left( \ln(\frac{p_i}{q_i}) - \frac{p_i q_j}{p_j q_i} \right) \le 0 }$

or equivalently (recalling the proof of Theorem 1):

$\displaystyle{ \sum_{i,j} H_{ij} p_j \left( \ln(\frac{p_i}{q_i}) + 1 - \frac{p_i q_j}{p_j q_i} \right) \le 0 }$

The last two terms on the left hand side cancel when $i = j.$ Thus, if we break the sum into an $i \ne j$ part and an $i = j$ part, the left side becomes

$\displaystyle{ \sum_{i \ne j} H_{ij} p_j \left( \ln(\frac{p_i}{q_i}) + 1 - \frac{p_i q_j}{p_j q_i} \right) \; + \; \sum_j H_{jj} p_j \ln(\frac{p_j}{q_j}) }$

Next we can use the infinitesimal stochastic property of $H$ to write $H_{jj}$ as the sum of $-H_{ij}$ over $i$ not equal to $j,$ obtaining

$\displaystyle{ \sum_{i \ne j} H_{ij} p_j \left( \ln(\frac{p_i}{q_i}) + 1 - \frac{p_i q_j}{p_j q_i} \right) - \sum_{i \ne j} H_{ij} p_j \ln(\frac{p_j}{q_j}) } =$

$\displaystyle{ \sum_{i \ne j} H_{ij} p_j \left( \ln(\frac{p_iq_j}{p_j q_i}) + 1 - \frac{p_i q_j}{p_j q_i} \right) }$

Since $H_{ij} \ge 0$ when $i \ne j$ and $\ln(s) + 1 - s \le 0$ for all $s > 0,$ we conclude that this quantity is $\le 0.$   █

Proof of Theorem 3. Now suppose also that $q$ is an equilibrium solution of the master equation. Then $Dq_i/Dt = dq_i/dt = 0$ for all states $i,$ so by Theorem 1 we need to show

$\displaystyle{ \sum_{i, j \in X} H_{ij} p_j \left( \ln(\frac{p_i}{q_i}) - \frac{p_i q_j}{p_j q_i} \right) \; = \; -\frac{1}{2} \sum_{i,j \in X} J_{ij} A_{ij} }$

We also have $\sum_{j \in X} H_{ij} q_j = 0,$ so the second
term in the sum at left vanishes, and it suffices to show

$\displaystyle{ \sum_{i, j \in X} H_{ij} p_j \ln(\frac{p_i}{q_i}) \; = \; - \frac{1}{2} \sum_{i,j \in X} J_{ij} A_{ij} }$

By definition we have

$\displaystyle{ \frac{1}{2} \sum_{i,j} J_{ij} A_{ij}} = \displaystyle{ \frac{1}{2} \sum_{i,j} \left( H_{ij} p_j - H_{ji}p_i \right) \ln \left( \frac{p_j q_i}{p_i q_j} \right) }$

This in turn equals

$\displaystyle{ \frac{1}{2} \sum_{i,j} H_{ij}p_j \ln \left( \frac{p_j q_i}{p_i q_j} \right) - \frac{1}{2} \sum_{i,j} H_{ji}p_i \ln \left( \frac{p_j q_i}{p_i q_j} \right) }$

and we can switch the dummy indices $i,j$ in the second sum, obtaining

$\displaystyle{ \frac{1}{2} \sum_{i,j} H_{ij}p_j \ln \left( \frac{p_j q_i}{p_i q_j} \right) - \frac{1}{2} \sum_{i,j} H_{ij}p_j \ln \left( \frac{p_i q_j}{p_j q_i} \right) }$

or simply

$\displaystyle{ \sum_{i,j} H_{ij} p_j \ln \left( \frac{p_j q_i}{p_i q_j} \right) }$

But this is

$\displaystyle{ \sum_{i,j} H_{ij} p_j \left(\ln ( \frac{p_j}{q_j}) + \ln (\frac{q_i}{p_i}) \right) }$

and the first term vanishes because $H$ is infinitesimal stochastic: $\sum_i H_{ij} = 0.$ We thus have

$\displaystyle{ \frac{1}{2} \sum_{i,j} J_{ij} A_{ij}} = \sum_{i,j} H_{ij} p_j \ln (\frac{q_i}{p_i} )$

as desired.   █

21 December, 2015

A few years ago, after hearing Susan Holmes speak about the mathematics of phylogenetic trees, I became interested in their connection to algebraic topology. I wrote an article about this here:

• John Baez, Operads and the tree of life, 6 July 2011.

In trying to the make the ideas precise I recruited the help of Nina Otter, who was then a graduate student at ETH Zürich. She came to Riverside and we started to work together.

Now Nina is a grad student at Oxford working on mathematical biology with Heather Harrington. I visited her last summer and we made more progress… but then she realized that our paper needed another big theorem, a result relating our topology on the space of phylogenetic trees to the topology described by Susan Holmes and her coauthors here:

• Louis J. Billera, Susan P. Holmes and Karen Vogtmann, Geometry of the space of phylogenetic trees, Advances in Applied Mathematics 27 (2001), 733–767.

It took another half year to finish things up. I could never have done this myself.

But now we’re done! Here’s our paper:

• John Baez and Nina Otter, Operads and phylogenetic trees.

### The basic idea

Trees are important, not only in mathematics, but also biology. The most important is the ‘tree of life’ relating all organisms that have ever lived on Earth. Darwin drew this sketch of it in 1837:

He wrote about it in On the Origin of Species

The affinities of all the beings of the same class have sometimes been represented by a great tree. I believe this simile largely speaks the truth. The green and budding twigs may represent existing species; and those produced during former years may represent the long succession of extinct species. At each period of growth all the growing twigs have tried to branch out on all sides, and to overtop and kill the surrounding twigs and branches, in the same manner as species and groups of species have at all times overmastered other species in the great battle for life.

Now we know that the tree of life is not really a tree in the mathematical sense. One reason is ‘endosymbiosis’: the incorporation of one organism together with its genetic material into another, as probably happened with the mitochondria in our cells and also the plastids that hold chlorophyll in plants. Another is ‘horizontal gene transfer’: the passing of genetic material from one organism to another, which happens frequently with bacteria. So, the tree of life is really a thicket:

But a tree is often a good approximation, especially for animals and plants in the last few hundred million years. Biologists who try to infer phylogenetic trees from present-day genetic data often use simple models where:

• the genotype of each species follows a random walk, but

• species branch in two at various times.

These are called ‘Markov models’. The simplest Markov model for DNA evolution is the Jukes–Cantor model. Consider one or more pieces of DNA having a total of n base pairs. We can think of this as a string of letters chosen from the set {A,T,C,G}:

…ATCGATTGAGCTCTAGCG…

As time passes, the Jukes–Cantor model says the DNA changes randomly, with each base pair having the same constant rate of randomly flipping to any other. So, we get a Markov process on the set

$X = \{\textrm{A,T,C,G}\}{}^N$

However, a species can also split in two. So, given current-day genetic data from various species, biologists try to infer the most probable tree where, starting from a common ancestor, the DNA in question undergoes a random walk most of the time but branches in two at certain times.

To formalize this, we can define a concept of ‘phylogenetic tree’. Our work is based on the definition of Billera, Holmes and Vogtmann, though we use a slightly different definition, for reasons that will soon become clear. For us, a phylogenetic tree is a rooted tree with leaves labelled by numbers $1,2, \dots, n$ and edges labelled by ‘times’ or, geometrically speaking, ‘lengths’ in $[0, \infty).$ We require that:

• the length of every edge is positive, except perhaps for ‘external edges’: that is, edges incident to the leaves or root;

• there are no 1-ary vertices.

For example, here is a phylogenetic tree with 5 leaves:

where $\ell_1, \dots, \ell_6 \ge 0$ but we demand that $\ell_7 > 0.$ We draw the vertices as dots. We do not count the leaves and the root as vertices, and we label the root with the number $0.$ We cannot collapse edges of length zero that end at leaves, since doing so would eliminate those leaves. Also note that the embedding of the tree in the plane is irrelevant, so this counts as the same phylogenetic tree:

In applications to biology, we are often interested in trees where the total distance from the root to the leaf is the same for every leaf, since all species have evolved for the same time from their common ancestor. These are mathematically interesting as well, because then the distance between any two leaves defines an ultrametric on the set of leaves. However, more general phylogenetic trees are also interesting—and they become essential when we construct an operad whose operations are phylogenetic trees.

Let $\textrm{Phyl}_n$ be the set of phylogenetic trees with n leaves. This has a natural topology, which we explain in Section 6 of our paper. For example, here is a continuous path in $\textrm{Phyl}_4$ where we only change the length of one internal edge, reducing it until it becomes zero and we can collapse it:

Phylogenetic trees reconstructed by biologists are typically binary. When a phylogenetic tree appears to have higher arity, sometimes we merely lack sufficient data to resolve a higher-arity branching into a number of binary ones. With the topology we are using on $\textrm{Phyl}_n,$ binary trees form an open dense set of $\textrm{Phyl}_n,$ except for $\textrm{Phyl}_1.$ However, trees of higher arity are still important, because paths, paths of paths, etc. in $\textrm{Phyl}_n$ are often forced to pass through trees of higher arity.

Billera, Holmes and Vogtmann focused their attention on the set $\mathcal{T}_n$ of phylogenetic trees where lengths of the external edges—edges incident to the root and leaves—are fixed to a constant value. They endow $\mathcal{T}_n$ with a metric, which induces a topology on $\mathcal{T}_n,$ and there is a homeomorphism

$\textrm{Phyl}_n \cong \mathcal{T}_n \times [0,\infty)^{n+1}$

where the data in $[0,\infty)^{n+1}$ describe the lengths of the external edges in a general phylogenetic tree.

In algebraic topology, trees are often used to describe the composition of n-ary operations. This is formalized in the theory of operads. An ‘operad’ is an algebraic stucture where for each natural number $n=0,1,2,\dots$ we have a set $O_n$ whose elements are considered as abstract n-ary operations, not necessarily operating on anything yet. An element $f \in O_n$ can be depicted as a planar tree with one vertex and n labelled leaves:

We can compose these operations in a tree-like way to get new operations:

and an associative law holds, making this sort of composite unambiguous:

There are various kinds of operads, but in this paper our operads will always be ‘unital’, having an operation $1 \in O_1$ that acts as an identity for composition. They will also be ‘symmetric’, meaning there is an action of the symmetric group $S_n$ on each set $O_n,$ compatible with composition. Further, they will be ‘topological’, meaning that each set $O_n$ is a topological space, with composition and permutations acting as continuous maps.

In Section 5 we prove that there is an operad $\textrm{Phyl},$ the ‘phylogenetic operad’, whose space of n-ary operations is $\textrm{Phyl}_n.$ This raises a number of questions:

• What is the mathematical nature of this operad?

• How is it related to ‘Markov processes with branching’?

• How is it related to known operads in topology?

Briefly, the answer is that $\textrm{Phyl}$ is the coproduct of $\textrm{Com},$ the operad for commutative topological semigroups, and $[0,\infty),$ the operad having only unary operations, one for each $t \in [0,\infty).$ The first describes branching, the second describes Markov processes. Moreover, $\textrm{Phyl}$ is closely related to the Boardmann–Vogt $W$ construction applied to $\textrm{Com}.$ This is a construction that Boardmann and Vogt applied to another operad in order to obtain an operad whose algebras are loop spaces.

To understand all this in more detail, first recall that the raison d’être of operads is to have ‘algebras’. The most traditional sort of algebra of an operad $O$ is a topological space $X$ on which each operation $f \in O_n$ acts as a
continuous map

$\alpha(f)\colon X^n \to X$

obeying some conditions: composition, the identity, and the permutation group actions are preserved, and $\alpha(f)$ depends continuously on $f.$ The idea is that the abstract operations in $O$ are realized as actual operations on the space $X.$

In this paper we instead need algebras of a linear sort. Such an algebra of $O$ is a finite-dimensional real vector space $V$ on which each operation $f \in O_n$ acts as a multilinear map

$\alpha(f)\colon V^n \to X$

obeying the same list of conditions. We can also think of $\alpha(f)$ as a linear map

$\alpha(f)\colon V^{\otimes n} \to V$

where $V^{\otimes n}$ is the nth tensor power of $V.$

We also need ‘coalgebras’ of operads. The point is that while ordinarily one thinks of an operation $f \in O_n$ as having n inputs and one output, a phylogenetic tree is better thought of as having one input and n outputs. A coalgebra of $O$ is a finite-dimensional real vector space $V$ on which operation $f \in O_n$ gives a linear map

$\alpha(f) \colon V \to V^{\otimes n}$

obeying the same conditions as an algebra, but ‘turned around’.

The main point of this paper is that the phylogenetic operad has interesting coalgebras, which correspond to how phylogenetic trees are actually used to describe branching Markov processes in biology. But to understand this, we need to start by looking at coalgebras of two operads from which the phylogenetic operad is built.

By abuse of notation, we will use $[0,\infty)$ as the name for the operad having only unary operations, one for each $t \in [0,\infty),$ with composition of operations given by addition. A coalgebra of $[0,\infty)$ is a finite-dimensional real vector space $V$ together with for each $t \in [0,\infty)$ a linear map

$\alpha(t) \colon V \to V$

such that:

$\alpha(s+t) = \alpha(s) \alpha(t)$ for all $s,t \in [0,\infty)$

$\alpha(0) = 1_V$

$\alpha(t)$ depends continuously on $t.$

Analysts usually call such a thing a ‘continuous one-parameter semigroup’ of operators on $V.$

Given a finite set $X,$ a ‘Markov process’ or ‘continuous-time Markov chain’ on $X$ is a continuous one-parameter semigroup of operators on $\mathbb{R}^X$ such that if $f \in \mathbb{R}^X$ is a probability distribution on $X,$ so is $\alpha(t) f$ for all $t \in [0,\infty).$ Equivalently, if we think of $\alpha(t)$ as a $X \times X$ matrix of real numbers, we demand that its entries be nonnegative and each column sum to 1. Such a matrix is called ‘stochastic’. If $X$ is a set of possible sequences of base pairs, a Markov process on $X$ describes the random changes of DNA with the passage of time. We shall see that any Markov process on $X$ makes $\mathbb{R}^X$ into a coalgebra of $[0,\infty).$

This handles the Markov process aspect of DNA evolution; what about the branching? For this we use $\textrm{Com},$ the unique operad with one n-ary operation for each n > 0. Algebras of $\textrm{Com}$ are not-necessarily-unital commutative algebras: there is only one way to multiply n elements for n > 0.

For us what matters most is that coalgebras of $\textrm{Com}$ are finite-dimensional cocommutative coalgebras, not necessarily with counit. If $X$ is a finite set, there is a cocommutative coalgebra whose underlying vector space is $\mathbb{R}^X.$ The unique n-ary operation of $\textrm{Com}$ acts as the linear map

$\displaystyle{ \Delta_n \colon \mathbb{R}^X \to \mathbb{R}^X \otimes \cdots \otimes \mathbb{R}^X \; \cong \;\mathbb{R}^{X^n} }$

where
$\Delta_n (f)(x_1, \dots, x_n) = \left\{ \begin{array}{cl} f(x) & \textrm{if } x_1 = \cdots = x_n = x \\ \\ 0 & \textrm{otherwise} \end{array} \right.$

This map describes the ‘n-fold duplication’ of a probability distribution $f$ on the set $X$ of possible genes. We use this map to say what takes place when a species branches:

Next, we wish to describe how to combine the operads $[0,\infty)$ and $\textrm{Com}$ to obtain the phylogenetic operad. Any pair of operads $O$ and $O'$ has a coproduct $O + O'.$ The definition of coproduct gives an easy way to understand the algebras of $O + O'.$ Such an algebra is simply an object that is both an algebra of $O$ and an algebra of $O',$ with no compatibility conditions imposed.

One can also give an explicit construction of $O + O'.$ Briefly, the n-ary operations of $O + O'$ are certain equivalence classes of trees with leaves labelled $\{1,\dots, n\}$ and all nodes, except for leaves, labelled by operations in $O$ or $O'.$ While this fact is surely known to experts on operads, it seems hard to find in the literature, so we prove this in Theorem 24.

Given this, it should come as no surprise that the operad $\textrm{Phyl}$ is the coproduct $\textrm{Com} + [0,\infty).$ In fact, we shall take this as a definition. Starting from this definition, we work backwards to show that the operations of $\textrm{Phyl}$ correspond to phylogenetic trees. We prove this in Theorem 28. The definition of coproduct determines a topology on the spaces $\textrm{Phyl}_n,$ and it is a nontrivial fact that with this topology we have $\textrm{Phyl}_n \cong \mathcal{T}_n \times [0,\infty)^{n+1}$ for n > 1, where $\mathcal{T}_n$ has the topology defined by Billera, Holmes and Vogtmann. We prove this in Theorem 34.

Using the definition of the phylogenetic operad as a coproduct, it is clear that given any Markov process on a finite set $X,$ the vector space $\mathbb{R}^X$ naturally becomes a coalgebra of this operad. The reason is that, as we have seen, $\mathbb{R}^X$ is automatically a coalgebra of $\textrm{Com},$ and the Markov process makes it into a coalgebra of $[0,\infty).$ Thus, by the universal property of a coproduct, it becomes a coalgebra of $\textrm{Phyl} \cong \textrm{Com} + [0,\infty).$ We prove this in Theorem 36.

Since operads arose in algebraic topology, it is interesting to consider how the phylogenetic operad connects to ideas from that subject. Boardmann and Vogt defined a construction on operads, the ‘$W$ construction’, which when applied to the operad for spaces with an associative multiplication gives an operad for loop spaces. The operad $\textrm{Phyl}$ has an interesting relation to $W(\textrm{Com}).$ To see this, define addition on $[0,\infty]$ in the obvious way, where

$\infty + t = t + \infty = \infty$

Then $[0,\infty]$ becomes a commutative topological monoid, so we obtain an operad with only unary operations, one for each $t \in [0,\infty],$ where composition is addition. By abuse of notation, let us call this operad $[0,\infty].$

Boardmann and Vogt’s $W$ construction involves trees with edges having lengths in $[0,1],$ but we can equivalently use $[0,\infty].$ Leinster has observed that for any nonsymmetric topological operad $O,$ Boardmann and Vogt’s operad $W(O)$ is closely related to $O + [0,\infty].$ Here we make this observation precise in the symmetric case. Operations in $\textrm{Com} + [0,\infty]$ are just like phylogenetic trees except that edges may have length $\infty.$ Moreover, for any operad $O,$ the operad $W(O)$ is a non-unital suboperad of $O + [0,\infty].$ An operation of $O + [0,\infty]$ lies in $W(O)$ if and only if all the external edges of the corresponding tree have length $\infty.$ We prove this in Theorem 40.

Berger and Moerdijk showed that if $S_n$ acts freely on $O_n$ and $O_1$ is well-pointed, then $W(O)$ is a cofibrant replacement for $O.$ This is true for $O = \textrm{Assoc},$ the operad whose algebras are topological semigroups. This cofibrancy is why Boardmann and Vogt could use $W(\textrm{Assoc})$ as an operad for loop spaces. But $S_n$ does not act freely on $\textrm{Com}_n,$ and $W(\textrm{Com})$ is not a cofibrant replacement for $\textrm{Com}.$ So, it is not an operad for infinite loop spaces.

Nonetheless, the larger operad $\textrm{Com} + [0,\infty],$ a compactification of $\textrm{Phyl} = \textrm{Com} + [0,\infty),$ is somewhat interesting. The reason is that any Markov process

$\alpha \colon [0,\infty) \to \mathrm{End}(\mathbb{R}^X)$

approaches a limit as $t \to \infty.$ Indeed, $\alpha$ extends uniquely to a homomorphism from the topological monoid $[0,\infty]$ to $\mathrm{End}(\mathbb{R}^X).$ Thus, given a Markov process on a finite set $X,$ the vector space $\mathbb{R}^X$ naturally becomes a coalgebra of $\textrm{Com} + [0,\infty].$ We prove this in Theorem 38.

## Relative Entropy in Biological Systems

27 November, 2015

Here’s a paper for the proceedings of a workshop on Information and Entropy in Biological System this spring:

• John Baez and Blake Pollard, Relative entropy in biological systems, with Blake S. Pollard, Entropy 18 (2016), 46.

We’d love any comments or questions you might have. I’m not happy with the title. In the paper we advocate using the term ‘relative information’ instead of ‘relative entropy’—yet the latter is much more widely used, so I feel we need it in the title to let people know what the paper is about!

Here’s the basic idea.

Life relies on nonequilibrium thermodynamics, since in thermal equilibrium there are no flows of free energy. Biological systems are also open systems, in the sense that both matter and energy flow in and out of them. Nonetheless, it is important in biology that systems can sometimes be treated as approximately closed, and sometimes approach equilibrium before being disrupted in one way or another. This can occur on a wide range of scales, from large ecosystems to within a single cell or organelle. Examples include:

• A population approaching an evolutionarily stable state.

• Random processes such as mutation, genetic drift, the diffusion of organisms in an environment or the diffusion of molecules in a liquid.

• A chemical reaction approaching equilibrium.

An interesting common feature of these processes is that as they occur, quantities mathematically akin to entropy tend to increase. Closely related quantities such as free energy tend to decrease. In this review, we explain some mathematical results that make this idea precise.

Most of these results involve a quantity that is variously known as ‘relative information’, ‘relative entropy’, ‘information gain’ or the ‘Kullback–Leibler divergence’. We’ll use the first term. Given two probability distributions $p$ and $q$ on a finite set $X$, their relative information, or more precisely the information of $p$ relative to $q$, is

$\displaystyle{ I(p\|q) = \sum_{i \in X} p_i \ln\left(\frac{p_i}{q_i}\right) }$

We use the word ‘information’ instead of ‘entropy’ because one expects entropy to increase with time, and the theorems we present will say that $I(p\|q)$ decreases with time under various conditions. The reason is that the Shannon entropy

$\displaystyle{ S(p) = -\sum_{i \in X} p_i \ln p_i }$

contains a minus sign that is missing from the definition of relative information.

Intuitively, $I(p\|q)$ is the amount of information gained when we start with a hypothesis given by some probability distribution $q$ and then learn the ‘true’ probability distribution $p$. For example, if we start with the hypothesis that a coin is fair and then are told that it landed heads up, the relative information is $\ln 2$, so we have gained 1 bit of information. If however we started with the hypothesis that the coin always lands heads up, we would have gained no information.

We put the word ‘true’ in quotes here, because the notion of a ‘true’ probability distribution, which subjective Bayesians reject, is not required to use relative information. A more cautious description of relative information is that it is a divergence: a way of measuring the difference between probability distributions that obeys

$I(p \| q) \ge 0$

and

$I(p \| q) = 0 \iff p = q$

but not necessarily the other axioms for a distance function, symmetry and the triangle inequality, which indeed fail for relative information.

There are many other divergences besides relative information, some of which we discuss in Section 6. However, relative information can be singled out by a number of characterizations, including one based on ideas from Bayesian inference. The relative information is also close to the expected number of extra bits required to code messages distributed according to the probability measure $p$ using a code optimized for messages distributed according to $q$.

In this review, we describe various ways in which a population or probability distribution evolves continuously according to some differential equation. For all these differential equations, I describe conditions under which relative information decreases. Briefly, the results are as follows. We hasten to reassure the reader that our paper explains all the jargon involved, and the proofs of the claims are given in full:

• In Section 2, we consider a very general form of the Lotka–Volterra equations, which are a commonly used model of population dynamics. Starting from the population $P_i$ of each type of replicating entity, we can define a probability distribution

$p_i = \displaystyle{\frac{P_i}{\sum_{i \in X} P_i }}$

which evolves according to a nonlinear equation called the replicator equation. We describe a necessary and sufficient condition under which $I(q\|p(t))$ is nonincreasing when $p(t)$ evolves according to the replicator equation while $q$ is held fixed.

• In Section 3, we consider a special case of the replicator equation that is widely studied in evolutionary game theory. In this case we can think of probability distributions as mixed strategies in a two-player game. When $q$ is a dominant strategy, $I(q\|p(t))$ can never increase when $p(t)$ evolves according to the replicator equation. We can think of $I(q\|p(t))$ as the information that the population has left to learn. Thus, evolution is analogous to a learning process—an analogy that in the field of artificial intelligence is exploited by evolutionary algorithms!

• In Section 4 we consider continuous-time, finite-state Markov processes. Here we have probability distributions on a finite set $X$ evolving according to a linear equation called the master equation. In this case $I(p(t)\|q(t))$ can never increase. Thus, if $q$ is a steady state solution of the master equation, both $I(p(t)\|q)$ and $I(q\|p(t))$ are nonincreasing. We can always write $q$ as the Boltzmann distribution for some energy function $E : X \to \mathbb{R}$, meaning that

$\displaystyle{ q_i = \frac{\exp(-E_i / k T)}{\displaystyle{\sum_{j \in X} \exp(-E_j / k T)}} }$

where $T$ is temperature and $k$ is Boltzmann’s constant. In this case, $I(p(t)\|q)$ is proportional to a difference of free energies:

$\displaystyle{ I(p(t)\|q) = \frac{F(p) - F(q)}{T} }$

Thus, the nonincreasing nature of $I(p(t)\|q)$ is a version of the Second Law of Thermodynamics.

• In Section 5, we consider chemical reactions and other processes described by reaction networks. In this context we have populations $P_i$ of entities of various kinds $i \in X$, and these populations evolve according to a nonlinear equation called the rate equation. We can generalize relative information from probability distributions to populations by setting

$\displaystyle{ I(P\|Q) = \sum_{i \in X} P_i \ln\left(\frac{P_i}{Q_i}\right) - \left(P_i - Q_i\right) }$

If $Q$ is a special sort of steady state solution of the rate equation, called a complex balanced equilibrium, $I(P(t)\|Q)$ can never increase when $P(t)$ evolves according to the rate equation.

• Finally, in Section 6, we consider a class of functions called $f$-divergences which include relative information as a special case. For any convex function $f : [0,\infty) \to [0,\infty)$, the f-divergence of two probability distributions $p, q : X \to [0,1]$ is given by

$\displaystyle{ I_f(p\|q) = \sum_{i \in X} q_i f\left(\frac{p_i}{q_i}\right)}$

Whenever $p(t)$ and $q(t)$ are probability distributions evolving according to the master equation of some Markov process, $I_f(p(t)\|q(t))$ is nonincreasing. The $f$-divergence is also well-defined for populations, and nonincreasing for two populations that both evolve according to the master equation.

## A Compositional Framework for Markov Processes

4 September, 2015

This summer my students Brendan Fong and Blake Pollard visited me at the Centre for Quantum Technologies, and we figured out how to understand open continuous-time Markov chains! I think this is a nice step towards understanding the math of living systems.

Admittedly, it’s just a small first step. But I’m excited by this step, since Blake and I have been trying to get this stuff to work for a couple years, and it finally fell into place. And we think we know what to do next.

Here’s our paper:

• John C. Baez, Brendan Fong and Blake S. Pollard, A compositional framework for open Markov processes.

And here’s the basic idea…

### Open detailed balanced Markov processes

A continuous-time Markov chain is a way to specify the dynamics of a population which is spread across some finite set of states. Population can flow between the states. The larger the population of a state, the more rapidly population flows out of the state. Because of this property, under certain conditions the populations of the states tend toward an equilibrium where at any state the inflow of population is balanced by its outflow.

In applications to statistical mechanics, we are often interested in equilibria such that for any two states connected by an edge, say $i$ and $j,$ the flow from $i$ to $j$ equals the flow from $j$ to $i.$ A continuous-time Markov chain with a chosen equilibrium having this property is called ‘detailed balanced‘.

I’m getting tired of saying ‘continuous-time Markov chain’, so from now on I’ll just say ‘Markov process’, just because it’s shorter. Okay? That will let me say the next sentence without running out of breath:

Our paper is about open detailed balanced Markov processes.

Here’s an example:

The detailed balanced Markov process itself consists of a finite set of states together with a finite set of edges between them, with each state $i$ labelled by an equilibrium population $q_i >0,$ and each edge $e$ labelled by a rate constant $r_e > 0.$

These populations and rate constants are required to obey an equation called the ‘detailed balance condition’. This equation means that in equilibrium, the flow from $i$ to $j$ equal the flow from $j$ to $i.$ Do you see how it works in this example?

To get an ‘open’ detailed balanced Markov process, some states are designated as inputs or outputs. In general each state may be specified as both an input and an output, or as inputs and outputs multiple times. See how that’s happening in this example? It may seem weird, but it makes things work better.

People usually say Markov processes are all about how probabilities flow from one state to another. But we work with un-normalized probabilities, which we call ‘populations’, rather than probabilities that must sum to 1. The reason is that in an open Markov process, probability is not conserved: it can flow in or out at the inputs and outputs. We allow it to flow both in and out at both the input states and the output states.

Our most fundamental result is that there’s a category $\mathrm{DetBalMark}$ where a morphism is an open detailed balanced Markov process. We think of it as a morphism from its inputs to its outputs.

We compose morphisms in $\mathrm{DetBalMark}$ by identifying the output states of one open detailed balanced Markov process with the input states of another. The populations of identified states must match. For example, we may compose this morphism $N$:

with the previously shown morphism $M$ to get this morphism $M \circ N$:

And here’s our second most fundamental result: the category $\mathrm{DetBalMark}$ is actually a dagger compact category. This lets us do other stuff with open Markov processes. An important one is ‘tensoring’, which lets us take two open Markov processes like $M$ and $N$ above and set them side by side, giving $M \otimes N$:

The so-called compactness is also important. This means we can take some inputs of an open Markov process and turn them into outputs, or vice versa. For example, using the compactness of $\mathrm{DetBalMark}$ we can get this open Markov process from $M$:

In fact all the categories in our paper are dagger compact categories, and all our functors preserve this structure. Dagger compact categories are a well-known framework for describing systems with inputs and outputs, so this is good.

### The analogy to electrical circuits

In a detailed balanced Markov process, population can flow along edges. In the detailed balanced equilibrium, without any flow of population from outside, the flow along from state $i$ to state $j$ will be matched by the flow back from $j$ to $i.$ The populations need to take specific values for this to occur.

In an electrical circuit made of linear resistors, charge can flow along wires. In equilibrium, without any driving voltage from outside, the current along each wire will be zero. The potentials will be equal at every node.

This sets up an analogy between detailed balanced continuous-time Markov chains and electrical circuits made of linear resistors! I love analogy charts, so this makes me very happy:

 Circuits Detailed balanced Markov processes potential population current flow conductance rate constant power dissipation

This analogy is already well known. Schnakenberg used it in his book Thermodynamic Network Analysis of Biological Systems. So, our main goal is to formalize and exploit it. This analogy extends from systems in equilibrium to the more interesting case of nonequilibrium steady states, which are the main topic of our paper.

Earlier, Brendan and I introduced a way to ‘black box’ a circuit and define the relation it determines between potential-current pairs at the input and output terminals. This relation describes the circuit’s external behavior as seen by an observer who can only perform measurements at the terminals.

An important fact is that black boxing is ‘compositional’: if one builds a circuit from smaller pieces, the external behavior of the whole circuit can be determined from the external behaviors of the pieces. For category theorists, this means that black boxing is a functor!

Our new paper with Blake develops a similar ‘black box functor’ for detailed balanced Markov processes, and relates it to the earlier one for circuits.

When you black box a detailed balanced Markov process, you get the relation between population–flow pairs at the terminals. (By the ‘flow at a terminal’, we more precisely mean the net population outflow.) This relation holds not only in equilibrium, but also in any nonequilibrium steady state. Thus, black boxing an open detailed balanced Markov process gives its steady state dynamics as seen by an observer who can only measure populations and flows at the terminals.

### The principle of minimum dissipation

At least since the work of Prigogine, it’s been widely accepted that a large class of systems minimize entropy production in a nonequilibrium steady state. But people still fight about the the precise boundary of this class of systems, and even the meaning of this ‘principle of minimum entropy production’.

For detailed balanced open Markov processes, we show that a quantity we call the ‘dissipation’ is minimized in any steady state. This is a quadratic function of the populations and flows, analogous to the power dissipation of a circuit made of resistors. We make no claim that this quadratic function actually deserves to be called ‘entropy production’. Indeed, Schnakenberg has convincingly argued that they are only approximately equal.

But still, the ‘dissipation’ function is very natural and useful—and Prigogine’s so-called ‘entropy production’ is also a quadratic function.

### Black boxing

I’ve already mentioned the category $\mathrm{DetBalMark},$ where a morphism is an open detailed balanced Markov process. But our paper needs two more categories to tell its story! There’s the category of circuits, and the category of linear relations.

A morphism in the category $\mathrm{Circ}$ is an open electrical circuit made of resistors: that is, a graph with each edge labelled by a ‘conductance’ $c_e > 0,$ together with specified input and output nodes:

A morphism in the category $\mathrm{LinRel}$ is a linear relation $L : U \leadsto V$ between finite-dimensional real vector spaces $U$ and $V.$ This is nothing but a linear subspace $L \subseteq U \oplus V.$ Just as relations generalize functions, linear relations generalize linear functions!

In our previous paper, Brendan and I introduced these two categories and a functor between them, the ‘black box functor’:

$\blacksquare : \mathrm{Circ} \to \mathrm{LinRel}$

The idea is that any circuit determines a linear relation between the potentials and net current flows at the inputs and outputs. This relation describes the behavior of a circuit of resistors as seen from outside.

Our new paper introduces a black box functor for detailed balanced Markov processes:

$\square : \mathrm{DetBalMark} \to \mathrm{LinRel}$

We draw this functor as a white box merely to distinguish it from the other black box functor. The functor $\square$ maps any detailed balanced Markov process to the linear relation obeyed by populations and flows at the inputs and outputs in a steady state. In short, it describes the steady state behavior of the Markov process ‘as seen from outside’.

How do we manage to black box detailed balanced Markov processes? We do it using the analogy with circuits!

### The analogy becomes a functor

Every analogy wants to be a functor. So, we make the analogy between detailed balanced Markov processes and circuits precise by turning it into a functor:

$K : \mathrm{DetBalMark} \to \mathrm{Circ}$

This functor converts any open detailed balanced Markov process into an open electrical circuit made of resistors. This circuit is carefully chosen to reflect the steady-state behavior of the Markov process. Its underlying graph is the same as that of the Markov process. So, the ‘states’ of the Markov process are the same as the ‘nodes’ of the circuit.

Both the equilibrium populations at states of the Markov process and the rate constants labelling edges of the Markov process are used to compute the conductances of edges of this circuit. In the simple case where the Markov process has exactly one edge from any state $i$ to any state $j,$ the rule is this:

$C_{i j} = H_{i j} q_j$

where:

$q_j$ is the equilibrium population of the $j$th state of the Markov process,

$H_{i j}$ is the rate constant for the edge from the $j$th state to the $i$th state of the Markov process, and

$C_{i j}$ is the conductance (that is, the reciprocal of the resistance) of the wire from the $j$th node to the $i$th node of the resulting circuit.

The detailed balance condition for Markov processes says precisely that the matrix $C_{i j}$ is symmetric! This is just right for an electrical circuit made of resistors, since it means that the resistance of the wire from node $i$ to node $j$ equals the resistance of the same wire in the reverse direction, from node $j$ to node $i.$

### A triangle of functors

If you paid careful attention, you’ll have noticed that I’ve described a triangle of functors:

And if you know anything about how category theorists think, you’ll be wondering if this diagram commutes.

In fact, this triangle of functors does not commute! However, a general lesson of category theory is that we should only expect diagrams of functors to commute up to natural isomorphism, and this is what happens here:

The natural transformation $\alpha$ ‘corrects’ the black box functor for resistors to give the one for detailed balanced Markov processes.

The functors $\square$ and $\blacksquare \circ K$ are actually equal on objects. An object in $\mathrm{DetBalMark}$ is a finite set $X$ with each element $i \in X$ labelled a positive populations $q_i.$ Both functors map this object to the vector space $\mathbb{R}^X \oplus \mathbb{R}^X.$ For the functor $\square,$ we think of this as a space of population-flow pairs. For the functor $\blacksquare \circ K,$ we think of it as a space of potential-current pairs. The natural transformation $\alpha$ then gives a linear relation

$\alpha_{X,q} : \mathbb{R}^X \oplus \mathbb{R}^X \leadsto \mathbb{R}^X \oplus \mathbb{R}^X$

in fact an isomorphism of vector spaces, which converts potential-current pairs into population-flow pairs in a manner that depends on the $q_i.$ I’ll skip the formula; it’s in the paper.

But here’s the key point. The naturality of $\alpha$ actually allows us to reduce the problem of computing the functor $\square$ to the problem of computing $\blacksquare.$ Suppose

$M: (X,q) \to (Y,r)$

is any morphism in $\mathrm{DetBalMark}.$ The object $(X,q)$ is some finite set $X$ labelled by populations $q,$ and $(Y,r)$ is some finite set $Y$ labelled by populations $r.$ Then the naturality of $\alpha$ means that this square commutes:

Since $\alpha_{X,q}$ and $\alpha_{Y,r}$ are isomorphisms, we can solve for the functor $\square$ as follows:

$\square(M) = \alpha_Y \circ \blacksquare K(M) \circ \alpha_X^{-1}$

This equation has a clear intuitive meaning! It says that to compute the behavior of a detailed balanced Markov process, namely $\square(f),$ we convert it into a circuit made of resistors and compute the behavior of that, namely $\blacksquare K(f).$ This is not equal to the behavior of the Markov process, but we can compute that behavior by converting the input populations and flows into potentials and currents, feeding them into our circuit, and then converting the outputs back into populations and flows.

### What we really do

So that’s a sketch of what we do, and I hope you ask questions if it’s not clear. But I also hope you read our paper! Here’s what we actually do in there. After an introduction and summary of results:

• Section 3 defines open Markov processes and the open master equation.

• Section 4 introduces detailed balance for open Markov
processes.

• Section 5 recalls the principle of minimum power
for open circuits made of linear resistors, and explains how to black box them.

• Section 6 introduces the principle of minimum dissipation for open detailed balanced Markov processes, and describes how to black box these.

• Section 7 states the analogy between circuits and detailed balanced Markov processes in a formal way.

• Section 8 describes how to compose open Markov processes, making them into the morphisms of a category.

• Section 9 does the same for detailed balanced Markov processes.

• Section 10 describes the ‘black box functor’ that sends any open detailed balanced Markov process to the linear relation describing its external behavior, and recalls the similar functor for circuits.

• Section 11 makes the analogy between between open detailed balanced Markov processes and open circuits even more formal, by making it into a functor. We prove that together with the two black box functors, this forms a triangle that commutes up to natural isomorphism.

• Section 12 is about geometric aspects of this theory. We show that the linear relations in the image of these black box functors are Lagrangian relations between symplectic vector spaces. We also show that the master equation can be seen as a gradient flow equation.

• Section 13 is a summary of what we have learned.

Finally, Appendix A is a quick tutorial on decorated cospans. This is a key mathematical tool in our work, developed by Brendan in an earlier paper.

## The Game of Googol

20 July, 2015

Here’s a puzzle from a recent issue of Quanta, an online science magazine:

Puzzle 1: I write down two different numbers that are completely unknown to you, and hold one in my left hand and one in my right. You have absolutely no idea how I generated these two numbers. Which is larger?

You can point to one of my hands, and I will show you the number in it. Then you can decide to either select the number you have seen or switch to the number you have not seen, held in the other hand. Is there a strategy that will give you a greater than 50% chance of choosing the larger number, no matter which two numbers I write down?

At first it seems the answer is no. Whatever number you see, the other number could be larger or smaller. There’s no way to tell. So obviously you can’t get a better than 50% chance of picking the hand with the largest number—even if you’ve seen one of those numbers!

But “obviously” is not a proof. Sometimes “obvious” things are wrong!

It turns out that, amazingly, the answer to the puzzle is yes! You can find a strategy to do better than 50%. But the strategy uses randomness. So, this puzzle is a great illustration of the power of randomness.

If you want to solve it yourself, stop now or read Quanta magazine for some clues—they offered a small prize for the best answer:

• Pradeep Mutalik, Can information rise from randomness?, Quanta, 7 July 2015.

Greg Egan gave a nice solution in the comments to this magazine article, and I’ll reprint it below along with two followup puzzles. So don’t look down there unless you want a spoiler.

I should add: the most common mistake among educated readers seems to be assuming that the first player, the one who chooses the two numbers, chooses them according to some probability distribution. Don’t assume that. They are simply arbitrary numbers.

### The history of this puzzle

I’d seen this puzzle before—do you know who invented it? On G+, Hans Havermann wrote:

I believe the origin of this puzzle goes back to (at least) John Fox and Gerald Marnie’s 1958 betting game ‘Googol’. Martin Gardner mentioned it in his February 1960 column in Scientific American. Wikipedia mentions it under the heading ‘Secretary problem’. Gardner suggested that a variant of the game was proposed by Arthur Cayley in 1875.﻿

Actually the game of Googol is a generalization of the puzzle that we’ve been discussing. Martin Gardner explained it thus:

Ask someone to take as many slips of paper as he pleases, and on each slip write a different positive number. The numbers may range from small fractions of 1 to a number the size of a googol (1 followed by a hundred 0s) or even larger. These slips are turned face down and shuffled over the top of a table. One at a time you turn the slips face up. The aim is to stop turning when you come to the number that you guess to be the largest of the series. You cannot go back and pick a previously turned slip. If you turn over all the slips, then of course you must pick the last one turned.

So, the puzzle I just showed you is the special case when there are just 2 slips of paper. I seem to recall that Gardner incorrectly dismissed this case as trivial!

There’s been a lot of work on Googol. Julien Berestycki writes:

• Alexander V. Gnedin, A solution to the game of Googol, Annals of Probability (1994), 1588–1595.

One of the many beautiful ideas in this paper is that it asks what is the best strategy for the guy who writes the numbers! It also cites a paper by Gnedin and Berezowskyi (of oligarchic fame). ﻿

### Egan’s solution

Okay, here is Greg Egan’s solution, paraphrased a bit:

Pick some function $f : \mathbb{R} \to \mathbb{R}$ such that:

$\displaystyle{ \lim_{x \to -\infty} f(x) = 0 }$

$\displaystyle{ \lim_{x \to +\infty} f(x) = 1 }$

$f$ is strictly increasing: if $x > y$ then $f(x) > f(y)$

There are lots of functions like this, for example

$\displaystyle{f(x) = \frac{e^x}{e^x + 1} }$

Next, pick one of the first player’s hands at random. If the number you are shown is $a,$ compute $f(a).$ Then generate a uniformly distributed random number $z$ between 0 and 1. If $z$ is less than or equal to $f(a)$ guess that $x$ is the larger number, but if $z$ is greater than $f(a)$ guess that the larger number is in the other hand.

The probability of guessing correctly can be calculated as the probability of seeing the larger number initially and then, correctly, sticking with it, plus the probability of seeing the smaller number initially and then, correctly, choosing the other hand.

Say the larger number is $x$ and the smaller one is $y.$ Then the probability of guessing correctly is

$\frac{1}{2} f(x) + \frac{1}{2} (1 - f(y)) = \frac{1}{2} + \frac{1}{2} (f(x) - f(y))$

This is strictly greater than $\frac{1}{2}$ since $x > y$ so $f(x) - f(y) > 0$.

So, you have a more than 50% chance of winning! But as you play the game, there’s no way to tell how much more than 50%. If the numbers on the other players hands are very large, or very small, your chance will be just slightly more than 50%.

### Followup puzzles

Here are two more puzzles:

Puzzle 2: Prove that no deterministic strategy can guarantee you have a more than 50% chance of choosing the larger number.

Puzzle 3: There are perfectly specific but ‘algorithmically random’ sequences of bits, which can’t predicted well by any program. If we use these to generate a uniform algorithmically random number between 0 and 1, and use the strategy Egan describes, will our chance of choosing the larger number be more than 50%, or not?﻿

But watch out—here come Egan’s solutions to those!

### Solutions

Egan writes:

Puzzle 2: Prove that no deterministic strategy can guarantee you have a more than 50% chance of choosing the larger number.

Answer: If we adopt a deterministic strategy, that means there is a function $S: \mathbb{R} \to \{0,1\}$ that tells us whether on not we stick with the number x when we see it. If $S(x)=1$ we stick with it, if $S(x)=0$ we swap it for the other number.

If the two numbers are $x$ and $y,$ with $x > y,$ then the probability of success will be:

$P = 0.5 + 0.5(S(x)-S(y))$

This is exactly the same as the formula we obtained when we stuck with $x$ with probability $f(x),$ but we have specialised to functions $S$ valued in $\{0,1\}.$

We can only guarantee a more than 50% chance of choosing the larger number if $S$ is monotonically increasing everywhere, i.e. $S(x) > S(y)$ whenever $x > y.$ But this is impossible for a function valued in $\{0,1\}.$ To prove this, define $x_0$ to be any number in $[1,2]$ such that $S(x_0)=0;$ such an $x_0$ must exist, otherwise $S$ would be constant on $[1,2]$ and hence not monotonically increasing. Similarly define $x_1$ to be any number in $[-2,-1]$ such that $S(x_1) = 1.$ We then have $x_0 > x_1$ but $S(x_0) < S(x_1).$

Puzzle 3: There are perfectly specific but ‘algorithmically random’ sequences of bits, which can’t predicted well by any program. If we use these to generate a uniform algorithmically random number between 0 and 1, and use the strategy Egan describes, will our chance of choosing the larger number be more than 50%, or not?﻿

Answer: As Philip Gibbs noted, a deterministic pseudo-random number generator is still deterministic. Using a specific sequence of algorithmically random bits

$(b_1, b_2, \dots )$

to construct a number $z$ between $0$ and $1$ means $z$ takes on the specific value:

$z_0 = \sum_i b_i 2^{-i}$

So rather than sticking with $x$ with probability $f(x)$ for our monotonically increasing function $f,$ we end up always sticking with $x$ if $z_0 \le f(x),$ and always swapping if $z_0 > f(x).$ This is just using a function $S:\mathbb{R} \to \{0,1\}$ as in Puzzle 2, with:

$S(x) = 0$ if $x < f^{-1}(z_0)$

$S(x) = 1$ if $x \ge f^{-1}(z_0)$

So all the same consequences as in Puzzle 2 apply, and we cannot guarantee a more than 50% chance of choosing the larger number.

Puzzle 3 emphasizes the huge gulf between ‘true randomness’, where we only have a probability distribution of numbers $z,$ and the situation where we have a specific number $z_0,$ generated by any means whatsoever.

We could generate $z_0$ using a pseudorandom number generator, radioactive decay of atoms, an oracle whose randomness is certified by all the Greek gods, or whatever. No matter how randomly $z_0$ is generated, once we have it, we know there exist choices for the first player that will guarantee our defeat!

This may seem weird at first, but if you think about simple games of luck you’ll see it’s completely ordinary. We can have a more than 50% chance of winning such a game even if for any particular play we make the other player has a move that ensures our defeat. That’s just how randomness works.

## The Large-Number Limit for Reaction Networks (Part 3)

8 October, 2014

joint with Arjun Jain

We used to talk about reaction networks quite a lot here. When Arjun Jain was visiting the CQT, we made a lot of progress understanding how the master equation reduces to the rate equation in the limit where there are very large numbers of things of each kind. But we never told you the end of the story, and by now it’s been such a long time that you may need a reminder of some basic facts!

So…

The rate equation treats the number of things of each kind as continuous—a nonnegative real number—and says how it changes in a deterministic way.

The master equation treats the number of things of each kind as discrete—a nonnegative integer—and says how it changes in a probabilistic way.

You can think of the master equation as the ‘true’ description, and the rate equation as an approximation that’s good in some limit where there are large numbers of molecules — or more precisely, where the probability distribution of having some number of molecules of each kind is sharply peaked near some large value.

You may remember that in the master equation, the state of a chemical system is described by a vector $\psi$ in a kind of ‘Fock space’, while time evolution is described with the help of an operator on this space, called the ‘Hamiltonian’ $H$:

$\displaystyle{ \frac{d}{dt} \psi(t) = H \psi(t) }$

The Hamiltonian is built from annihilation and creation operators, so all of this looks very much like quantum field theory. The details are here, and we won’t try to review them all:

• John Baez and Jacob Biamonte, Quantum Techniques for Stochastic Mechanics.

The point is this: the ‘large-number limit’ where the master equation reduces to the rate equation smells a lot like the ‘classical limit’ of quantum field theory, where the description of light in terms of photons reduces to the good old Maxwell equations. So, maybe we can understand the large-number limit by borrowing techniques from quantum field theory!

How do we take the classical limit of quantum electromagnetism and get the classical Maxwell equations? For simplicity let’s ignore charged particles and consider the ‘free electromagnetic field’: just photons, described by the quantum version of Maxwell’s equations. When we take the classical limit we let Planck’s constant $\hbar$ go to zero: that much is obvious. However, that’s not all! The energy of each photon is proportional to $\hbar,$ so to take the classical limit and get a solution of the classical Maxwell’s equations with nonzero energy we also need to increase the number of photons. We cleverly do this in such a way that the total energy remains constant as $\hbar \to 0.$

So, in quantum electromagnetism the classical limit is also a large-number limit!

That’s a good sign. It suggests the same math will also apply to our reaction network problem.

But then we hit an apparent roadblock. What’s the analogue of Planck’s constant in chemical reaction networks? What should go to zero?

We told you the answer to that puzzle a while ago: it’s the reciprocal of Avogadro’s number!

You see, chemists measure numbers of molecules in ‘moles’. There’s a large number of molecules in each mole: Avogadro’s number. If we let the reciprocal of Avogadro’s number go to zero, we are taking a limit where chemistry becomes ‘continuous’ and the discreteness of molecules goes away. Of course this is just a mathematical trick, but it’s a very useful one.

So, we got around that roadblock. And then something nice happened.

When taking the classical limit of quantum electromagnetism, we focus attention on certain quantum states that are the ‘best approximations’ to classical states. These are called ‘coherent states’, and it’s very easy to study how the behave as we simultaneously let $\hbar \to 0$ and let the expected number of photons go to infinity.

And the nice thing is that these coherent states are also important in chemistry! But because chemistry involves probabilities rather than amplitudes, they have a different name: ‘Poisson distributions’. On this blog, Brendan Fong used them to give a new proof of a great result in mathematical chemistry, the Anderson–Craciun–Kurtz theorem.

So, we have most of the concepts and tools in place, and we can tackle the large-number limit using quantum techniques.

You can review the details here:

So, after a quick refresher on the notation, we’ll plunge right in.

As you’ll see, we solve the problem except for one important technical detail: passing a derivative through a limit! This means our main result is not a theorem. Rather, it’s an idea for how to prove a theorem. Or if we act like physicists, we can call it a theorem.

### Review of notation

The rate equation says

$\displaystyle{\frac{d}{dt}{x}(t) = \sum_{\tau \in T} r(\tau) (t(\tau)-s(\tau)) {x}(t)^{s(\tau)} }$

where:

$x(t) \in \mathbb{R}^k$ is a vector describing concentrations of $k$ different species at time $t.$ In chemistry these species could be different kinds of molecules.

• Each $\tau \in T$ is a transition, or in chemistry, a reaction.

$s(\tau) \in \mathbb{N}^k$ is a vector of natural numbers saying how many items of each species appear as inputs to the reaction $\tau.$ This is called the source of the reaction.

$t(\tau) \in \mathbb{N}^k$ is a vector of natural numbers saying how many items of each species appear as outputs of the reaction $\tau.$ This is called the target of the reaction. So, $t(\tau) - s(\tau)$ says the net change of the number of items of each species in the reaction $\tau.$

• The rate at which the reaction $\tau$ occurs is proportional to the rate constant $r(\tau)$ times the number

$x(t)^{s(\tau)}$

Here we are raising a vector to a vector power and getting a number, using this rule:

$x^r = x_1^{r_1} \cdots x_k^{r_k}$

where $r$ is any vector of natural numbers $(r_1, \dots, r_k),$ and $x$ is any vector of nonnegative real numbers. From now on we’ll call a vector of natural numbers a multi-index.

In this paper:

• John Baez, Quantum techniques for reaction networks.

it was shown that the master equation implies

$\displaystyle{\frac{d}{dt}\langle N_i \psi(t)\rangle = \sum_{\tau \in T} r(\tau) (t_i(\tau)-s_i(\tau)) \; \langle N^{\, \underline{s(\tau)}} \, \psi(t)\rangle }$

Here:

$\psi(t)$ is the stochastic state saying the probability of having any particular number of items of each species at each time $t.$ We won’t review the precise details of how this work; for that reread the relevant bit of Part 8.

$N_i$ is the $i$th number operator, defined using annihilation and creation operators as in quantum mechanics:

$N_i = a_i^\dagger a_i$

For the annihilation and creation operators, see Part 8.

$\langle N_i \psi(t) \rangle$ is the expected number of items of the $i$th species at time $t.$

• Similarly, $\langle N^{\underline{s(\tau)}} \psi(t)\rangle$ is the expected value of a certain product of operators. For any multi-index $r$ we define the falling power

$N_i^{\underline{r}_i} = N_i (N_i - 1) (N_i - 2) \cdots (N_i - r_i +1)$

and then we define

$N^{\underline{r}} = N_1^{\underline{r_1}} \cdots N_k^{\underline{r_k}}$

### The large-number limit

Okay. Even if you don’t understand any of what we just said, you’ll see the master and rate equation look similar. The master equation implies this:

$\displaystyle{\frac{d}{dt}\langle N_i \psi(t)\rangle = \sum_{\tau \in T} r(\tau) (t_i(\tau)-s_i(\tau)) \; \langle N^{\, \underline{s(\tau)}} \, \psi(t)\rangle }$

while the rate equation says this:

$\displaystyle{\frac{d}{dt}{x}(t) = \sum_{\tau \in T} r(\tau) (t(\tau)-s(\tau)) \; {x}(t)^{s(\tau)} }$

So, we will try to get from the first to the second second with the help of a ‘large-number limit’.

We start with a few definitions. We introduce an adjustable dimensionless quantity which we call $\hbar.$ This is just a positive number, which has nothing to do with quantum theory except that we’re using a mathematical analogy to quantum mechanics to motivate everything we’re doing.

Definition. The rescaled number operators are defined as $\widetilde{N}_i = \hbar N_i.$ This can be thought of as a rescaling of the number of objects, so that instead of counting objects individually, we count them in bunches of size $1/\hbar.$

Definition. For any multi-index $r$ we define the rescaled falling power of the number operator $N_i$ by:

$\widetilde{N}_i^{\underline{r_i}} = \widetilde{N}_i (\widetilde{N}_i - \hbar)(\widetilde{N}_i-2\hbar)\cdots (\widetilde{N}_i-r_i\hbar+\hbar)$

and also define

$\widetilde{N}^{\underline{r}} = \widetilde{N}_1^{\underline{r_1}} \; \cdots \;\widetilde{N}_k^{\underline{r_k}}$

for any multi-index $r.$

Using these, we get the following equation:

$\displaystyle{\frac{1}{\hbar}\frac{d}{dt} \langle \widetilde{N}_i \psi(t)\rangle = \sum_{\tau \in T} r(\tau) (t_i(\tau)-s_i(\tau)) \; \langle \widetilde{N}^{\underline{s(\tau)}} \psi(t)\rangle \; \frac{1}{\hbar^{|s(\tau)|}} }$

where for any multi-index $r$ we set

$|r| = r_1 + \cdots + r_k$

This suggests a way to rescale the rate constants in the master equation:

Definition. The rescaled rate constants are

$\displaystyle{\widetilde{r}(\tau) = \frac{r(\tau)}{\hbar^{|s(\tau)|-1}}}$

From here onwards, we change our viewpoint. We consider the rescaled rate constants $\widetilde{r}(\tau)$ to be fixed, instead of the original rate constants $r(\tau).$ So, as we decrease $\hbar,$ we are studying situations where the original rate constants change to ensure that the rescaled rate constants stays fixed!

So, we switch to working with a rescaled master equation:

Definition. The rescaled master equation is:

$\displaystyle{\frac{d}{dt} \langle \widetilde{N}_i\widetilde{\psi}(t)\rangle = \sum_{\tau \in T} \widetilde{r}(\tau) (t_i(\tau)-s_i(\tau)) \; \langle \widetilde{N}^{\underline{s(\tau)}} \widetilde{\psi}(t)\rangle }$

This is really a one-parameter family of equations, depending on $\hbar\in (0,\infty).$ We write a solution of the rescaled master equation as $\widetilde{\psi}(t),$ but it is really one solution for each value of $\hbar.$

Following the same procedure as above, we can rescale the rate equation, using the same definition of the rescaled rate constants:

Definition. The rescaled number of objects of the $i\mathrm{th}$ species is defined as $\widetilde{x_i}=\hbar x_i,$ where $x_i$ is the original number of objects of the $i\mathrm{th}$ species. Here again, we are counting in bunches of $1/\hbar.$

Using this to rescale the rate equation, we get

Definition. The rescaled rate equation is

$\displaystyle{\frac{d}{dt}\widetilde{x}(t) = \sum_{\tau \in T} \widetilde{r}(\tau) (t(\tau)-s(\tau))\; \widetilde{x}(t)^{s(\tau)} }$

where

$\widetilde{x}(t)=(\widetilde{x_1}(t),\widetilde{x_2}(t),\dots, \widetilde{x_k}(t))$

Therefore, to go from the rescaled master equation to the rescaled rate equation, we require that

$\langle\widetilde{N}^{\underline{r}}\, \widetilde{\psi}(t)\rangle \to \langle\widetilde{N}\widetilde{\psi}(t)\rangle^r$

as $\hbar \to 0.$ If this holds, we can identify $\langle\widetilde{N}\widetilde{\psi}(t)\rangle$ with $\widetilde{x}(t)$ and get the rate equation from the master equation!

To this end, we introduce the following crucial idea:

Definition. A semiclassical family of states, $\widetilde{\psi},$ is defined as a one-parameter family of states depending on $\hbar \in (0,\infty)$ such that for some $\widetilde{c} \in [0,\infty)^k$ we have

$\langle\widetilde{N}^r\widetilde{\psi}\rangle \to \widetilde{c}^{\, r}$

for every $r\in \mathbb{N}^k$ as $\hbar \to 0.$

In particular, this implies

$\langle\widetilde{N}_i\widetilde{\psi}\rangle \to \widetilde{c}_i$

for every index $i.$

Intuitively, a semiclassical family is a family of probability distributions that becomes more and more sharply peaked with a larger and larger mean as $\hbar$ decreases. We would like to show that in this limit, the rescaled master equation gives the rescaled rate equation.

We make this precise in the following propositions.

Proposition 1. If $\widetilde{\psi}$ is a semiclassical family as defined above, then in the $\hbar \to 0$ limit, we have $\langle\widetilde{N}^{\underline{r}}\widetilde{\psi}\rangle \to \widetilde{c}^{\; r}$ as well.

Proof. For each index $i,$

$\displaystyle{ \langle\widetilde{N}_i^{\; \underline{r_i}}\, \widetilde{\psi}\rangle = \displaystyle{ \langle \widetilde{N}_i (\widetilde{N}_i - \hbar)(\widetilde{N}_i-2\hbar)\cdots(\widetilde{N}_i-r_i\hbar+\hbar)\,\widetilde{\psi}\rangle} }$

$\displaystyle{ = \Big\langle\Big(\widetilde{N}_i^r + \hbar\frac{r_i(r_i-1)}{2}\widetilde{N}_i^{r_i-1}+\cdots + \hbar^{r_i-1}(r_i-1)!\Big)\,\widetilde{\psi}\Big\rangle }$

By the definition of a semiclassical family,

$\displaystyle{ \lim_{\hbar\to 0} \langle\Big(\widetilde{N}_i^{r_i} + \hbar\frac{r_i(r_i-1)}{2}\widetilde{N}_i^{r_i-1}+ \cdots + \hbar^{r_i-1}(r_i-1)!\Big)\;\widetilde{\psi}\rangle} = \widetilde{c}_i^{\; r_i}$

since every term but the first approaches zero. Thus, we have

$\displaystyle{ \lim_{\hbar \to 0} \langle\widetilde{N}_i^{\; \underline{r_i}}\, \widetilde{\psi}\rangle = \widetilde{c}_i^{\; r_i} }$

A similar but more elaborate calculation shows that

$\displaystyle{ \lim_{\hbar \to 0} \langle\widetilde{N}_1^{\, \underline{r_1}} \cdots\widetilde{N}_k^{\, \underline{r_k}} \, \widetilde{\psi}\rangle = \lim_{\hbar \to 0} \langle\widetilde{N}_1^{\, r_1}\cdots \widetilde{N}_k^{\, r_k} \widetilde{\psi}\rangle= \lim_{\hbar \to 0}\langle\widetilde{N}^{\, r} \, \widetilde{\psi}\rangle }$

or in other words

$\langle\widetilde{N}^{\, \underline{r}}\,\widetilde{\psi}\rangle \to \widetilde{c}^{\, r}$

as $\hbar \to 0.$   █

Proposition 2. If $\widetilde{\psi}$ is a semiclassical family of states, then

$\displaystyle{ \langle (\widetilde{N}-\widetilde{c})^{r}\, \widetilde{\psi}\rangle \to 0 }$

for any multi-index $r.$

Proof. Consider the $r_i\mathrm{th}$ centered moment of the $i\mathrm{th}$ number operator:

$\displaystyle{\langle(\widetilde{N}_i-\widetilde{c}_i)^{r_i}\widetilde{\psi}\rangle = \sum_{p =0}^{r_i} {r_i \choose p}\langle\widetilde{N}_i^p\widetilde{\psi}\rangle(-\widetilde{c}_i)^{r_i-p} }$

Taking the limit as $\hbar$ goes to zero, this becomes

$\begin{array}{ccl} \displaystyle{ \lim_{\hbar \to 0}\sum_{p =0}^{r_i} {r_i \choose p}\langle\widetilde{N}_i^p\widetilde{\psi}\rangle(-\widetilde{c}_i)^{r_i-p} } &=& \displaystyle{ \sum_{p =0}^{r_i} {r_i \choose p}(\widetilde{c}_i)^p(-\widetilde{c}_i)^{r_i-p} } \\ \\ &=& (\widetilde{c}_i-\widetilde{c}_i)^{r_i} \\ \\ &=& 0 \end{array}$

For a general multi-index $r$ we can prove the same sort of thing with a more elaborate calculation. First note that

$\langle (\widetilde{N}-\widetilde{c})^{r}\widetilde{\psi}\rangle=\langle(\widetilde{N_1}-\widetilde{c_1})^{r_1} \cdots (\widetilde{N_k}-\widetilde{c_k})^{r_k})\widetilde{\psi}\rangle$

The right-hand side can be expanded as

$\displaystyle{ \langle(\sum_{p_1 =0}^{r_1} {r_1 \choose p_1}\widetilde{N}_1^{p_1}(-\widetilde{c}_1)^{r_1-p_1} ) \cdots (\sum_{p_k =0}^{r_k} {r_k \choose p_k}\widetilde{N}_k^{p_k}(-\widetilde{c}_k)^{r_k-p_k} )\widetilde{\psi} }\rangle$

We can write this more tersely as

$\displaystyle{ \sum_{p=0}^r} {r\choose p} \langle \widetilde{N}^p\widetilde{\psi}\rangle (-\widetilde{c})^{r-p}$

where for any multi-index $r$ we define

$\displaystyle{{\sum_{p=0}^{r}}= \sum_{p_1 =0}^{r_1} \cdots \sum_{p_k =0}^{r_k} }$

and for any multi-indices $r, p$ we define

$\displaystyle{ {r \choose p}={r_1 \choose p_1}{r_2 \choose p_2}\cdots {r_k \choose p_k}}$

Now using the definition of a semiclassical state, we see

$\displaystyle{ \lim_{\hbar \to 0} \sum_{p=0}^r} {r\choose p} \langle \widetilde{N}^p\widetilde{\psi}\rangle (-\widetilde{c})^{r-p}= \displaystyle{ \sum_{p=0}^r} {r\choose p} (\widetilde{c})^{p} (-\widetilde{c})^{r-p}$

But this equals zero, as the last expression, expanded, is

$\displaystyle{ (\widetilde{c})^r \left( \sum_{p_1=0}^{r_1} {r_1\choose p_1} (-1)^{r_1-p_1}\right) \cdots \left( \sum_{p_k=0}^{r_k} {r_k\choose p_k} (-1)^{r_k-p_k} \right) }$

where each individual sum is zero.   █

Here is the theorem that would finish the job if we could give a fully rigorous proof:

“Theorem.” If $\widetilde{\psi}(t)$ is a solution of the rescaled master equation and also a semiclassical family for the time interval $[t_0,t_1],$ then $\widetilde{x}(t) = \langle \widetilde{N} \widetilde{\psi}(t) \rangle$ is a solution of the rescaled rate equation for $t \in [t_0,t_1].$

Proof sketch. We sketch a proof that relies on the assumption that we can pass the $\hbar \to 0$ limit through a time derivative. Of course, to make this rigorous, we would need to justify this. Perhaps it is true only in certain cases.

Assuming that we can pass the limit through the derivative:

$\displaystyle{\lim_{\hbar \to 0}\frac{d}{dt} \langle \widetilde{N}\widetilde{\psi}(t)\rangle = \lim_{\hbar \to 0} \sum_{\tau \in T} \widetilde{r}(\tau) (t(\tau)-s(\tau))\langle \widetilde{N}^{\, \underline{s(\tau)}} \, \widetilde{\psi}(t)\rangle }$

and thus

$\displaystyle{\frac{d}{dt}\lim_{\hbar \to 0} \langle \widetilde{N}\widetilde{\psi}(t)\rangle = \sum_{\tau \in T} \widetilde{r}(\tau) (t(\tau)-s(\tau)) \lim_{\hbar \to 0}\langle \widetilde{N}^{\, \underline{s(\tau)}} \, \widetilde{\psi}(t)\rangle }$

and thus

$\displaystyle{\frac{d}{dt}\widetilde{x}(t) = \sum_{\tau \in T} \widetilde{r}(\tau) (t(\tau)-s(\tau))\widetilde{x}^{\, s(\tau)} }.$

As expected, we obtain the rescaled rate equation.   █

Another question is this: if we start with a semiclassical family of states as our initial data, does it remain semiclassical as we evolve it in time? This will probably be true only in certain cases.

### An example: rescaled coherent states

The best-behaved semiclassical states are the coherent states.
Consider the family of coherent states

$\displaystyle{\widetilde{\psi}_{\widetilde{c}} = \frac{e^{(\widetilde{c}/\hbar) z}}{e^{\widetilde{c}/\hbar}}}$

using the notation developed in the earlier mentioned paper. In that paper it was shown that for any multi-index $m$ and any coherent state $\Psi$ we have

$\langle N^{\underline{m}}\Psi\rangle = \langle N\Psi \rangle^m$

Using this result for $\widetilde{\psi}_{\widetilde{c}}$ we get

$\displaystyle{\langle \widetilde{N}^{\underline{m}}\widetilde{\psi}_{\widetilde{c}}\rangle~=~\hbar^{|m|}\langle N^{\underline{m}}\widetilde{\psi}_{\widetilde{c}}\rangle~=~\hbar^{|m|}\langle N\widetilde{\psi}_{\widetilde{c}}\rangle^m~=~\hbar^{|m|}\frac{\widetilde{c}^m}{\hbar^{|m|}}~=~\widetilde{c}^m}$

Since $\langle\widetilde{N}^{\underline{m}}\widetilde{\psi}_{\widetilde{c}}\rangle$ equals $\langle \widetilde{N}^{m} \widetilde{\psi}_{\widetilde{c}}\rangle$ plus terms of order $\hbar,$ as $\hbar \to 0$ we have

$\langle\widetilde{N}^{\underline{m}}\widetilde{\psi}_{\widetilde{c}}\rangle~\to~\langle\widetilde{N}^{m}\widetilde{\psi}_{\widetilde{c}}\rangle=\widetilde{c}^{m}$

showing that our chosen $\widetilde{\psi}_{\widetilde{c}}$ is indeed a semiclassical family.

## Exploring Climate Data (Part 1)

1 August, 2014

joint with Dara O Shayda

Emboldened by our experiments in El Niño analysis and prediction, people in the Azimuth Code Project have been starting to analyze weather and climate data. A lot of this work is exploratory, with no big conclusions. But it’s still interesting! So, let’s try some blog articles where we present this work.

This one will be about the air pressure on the island of Tahiti and in a city called Darwin in Australia: how they’re correlated, and how each one varies. This article will also be a quick introduction to some basic statistics, as well as ‘continuous wavelet transforms’.

### Darwin, Tahiti and El Niños

The El Niño Southern Oscillation is often studied using the air pressure in Darwin, Australia versus the air pressure in Tahiti. When there’s an El Niño, it gets stormy in the eastern Pacific so the air temperatures tend to be lower in Tahiti and higher in Darwin. When there’s a La Niña, it’s the other way around:

The Southern Oscillation Index or SOI is a normalized version of the monthly mean air pressure anomaly in Tahiti minus that in Darwin. Here anomaly means we subtract off the mean, and normalized means that we divide by the standard deviation.

So, the SOI tends to be negative when there’s an El Niño. On the other hand, when there’s an El Niño the Niño 3.4 index tends to be positive—this says it’s hotter than usual in a certain patch of the Pacific.

Here you can see how this works:

When the Niño 3.4 index is positive, the SOI tends to be negative, and vice versa!

It might be fun to explore precisely how well correlated they are. You can get the data to do that by clicking on the links above.

But here’s another question: how similar are the air pressure anomalies in Darwin and in Tahiti? Do we really need to take their difference, or are they so strongly anticorrelated that either one would be enough to detect an El Niño?

You can get the data to answer such questions here:

Southern Oscillation Index based upon annual standardization, Climate Analysis Section, NCAR/UCAR. This includes links to monthly sea level pressure anomalies in Darwin and Tahiti, in either ASCII format (click the second two links) or netCDF format (click the first one and read the explanation).

In fact this website has some nice graphs already made, which I might as well show you! Here’s the SOI and also the sum of the air pressure anomalies in Darwin and Tahiti, normalized in some way:

(Click to enlarge.)

If the sum were zero, the air pressure anomalies in Darwin and Tahiti would contain the same information and life would be simple. But it’s not!

How similar in character are the air pressure anomalies in Darwin and Tahiti? There are many ways to study this question. Dara tackled it by taking the air pressure anomaly data from 1866 to 2012 and computing some ‘continuous wavelet transforms’ of these air pressure anomalies. This is a good excuse for explaining how a continuous wavelet transform works.

### Very basic statistics

It helps to start with some very basic statistics. Suppose you have a list of numbers

$x = (x_1, \dots, x_n)$

You probably know how to take their mean, or average. People often write this with angle brackets:

$\displaystyle{ \langle x \rangle = \frac{1}{n} \sum_{i = 1}^n x_i }$

You can also calculate the mean of their squares:

$\displaystyle{ \langle x^2 \rangle = \frac{1}{n} \sum_{i = 1}^n x_i^2 }$

If you were naive you might think $\langle x^2 \rangle = \langle x \rangle^2,$ but in fact we have:

$\langle x^2 \rangle \ge \langle x \rangle^2$

and they’re equal only if all the $x_i$ are the same. The point is that if the numbers $x_i$ are spread out, the squares of the big ones (positive or negative) contribute more to the average of the squares than if we had averaged them out before squaring. The difference

$\langle x^2 \rangle - \langle x \rangle^2$

is called the variance; it says how spread out our numbers are. The square root of the variance is the standard deviation:

$\sigma_x = \sqrt{\langle x^2 \rangle - \langle x \rangle^2 }$

and this has the slight advantage that if you multiply all the numbers $x_i$ by some constant $c,$ the standard deviation gets multiplied by $|c|.$ (The variance gets multiplied by $c^2.$)

We can generalize the variance to a situation where we have two lists of numbers:

$x = (x_1, \dots, x_n)$

$y = (y_1, \dots, y_n)$

Namely, we can form the covariance

$\langle x y \rangle - \langle x \rangle \langle y \rangle$

This reduces to the variance when $x = y.$ It measures how much $x$ and $y$ vary together — ‘hand in hand’, as it were. A bit more precisely: if $x_i$ is greater than its mean value mainly for $i$ such that $y_i$ is greater than its mean value, the covariance is positive. On the other hand, if $x_i$ tends to be greater than average when $y_i$ is smaller than average — like with the air pressures at Darwin and Tahiti — the covariance will be negative.

For example, if

$x = (1,-1), \quad y = (1,-1)$

then they ‘vary hand in hand’, and the covariance

$\langle x y \rangle - \langle x \rangle \langle y \rangle = 1 - 0 = 1$

is positive. But if

$x = (1,-1), \quad y = (-1,1)$

then one is positive when the other is negative, so the covariance

$\langle x y \rangle - \langle x \rangle \langle y \rangle = -1 - 0 = -1$

is negative.

Of course the covariance will get bigger if we multiply both $x$ and $y$ by some big number. If we don’t want this effect, we can normalize the covariance and get the correlation:

$\displaystyle{ \frac{ \langle x y \rangle - \langle x \rangle \langle y \rangle }{\sigma_x \sigma_y} }$

which will always be between $-1$ and $1.$

For example, if we compute the correlation between the air pressure anomalies at Darwin and Tahiti, measured monthly from 1866 to 2012, we get
-0.253727. This indicates that when one goes up, the other tends to go down. But since we’re not getting -1, it means they’re not completely locked into a linear relationship where one is some negative number times the other.

Okay, we’re almost ready for continuous wavelet transforms! Here is the main thing we need to know. If the mean of either $x$ or $y$ is zero, the formula for covariance simplifies a lot, to

$\displaystyle{ \langle x y \rangle = \frac{1}{n} \sum_{i = 1}^n x_i y_i }$

So, this quantity says how much the numbers $x_i$ ‘vary hand in hand’ with the numbers $y_i,$ in the special case when one (or both) has mean zero.

We can do something similar if $x, y : \mathbb{R} \to \mathbb{R}$ are functions of time defined for all real numbers $t.$ The sum becomes an integral, and we have to give up on dividing by $n.$ We get:

$\displaystyle{ \int_{-\infty}^\infty x(t) y(t)\; d t }$

This is called the inner product of the functions $x$ and $y,$ and often it’s written $\langle x, y \rangle,$ but it’s a lot like the covariance.

### Continuous wavelet transforms

What are continuous wavelet transforms, and why should we care?

People have lots of tricks for studying ‘signals’, like series of numbers $x_i$ or functions $x : \mathbb{R} \to \mathbb{R}.$ One method is to ‘transform’ the signal in a way that reveals useful information. The Fourier transform decomposes a signal into sines and cosines of different frequencies. This lets us see how much power the signal has at different frequencies, but it doesn’t reveal how the power at different frequencies changes with time. For that we should use something else, like the Gabor transform explained by Blake Pollard in a previous post.

Sines and cosines are great, but we might want to look for other patterns in a signal. A ‘continuous wavelet transform’ lets us scan a signal for appearances of a given pattern at different times and also at different time scales: a pattern could go by quickly, or in a stretched out slow way.

To implement the continuous wavelet transform, we need a signal and a pattern to look for. The signal could be a function $x : \mathbb{R} \to \mathbb{R}.$ The pattern would then be another function $y: \mathbb{R} \to \mathbb{R},$ usually called a wavelet.

Here’s an example of a wavelet:

If we’re in a relaxed mood, we could call any function that looks like a bump with wiggles in it a wavelet. There are lots of famous wavelets, but this particular one is the fourth derivative of a certain Gaussian. Mathematica calls this particular wavelet DGaussianWavelet[4], and you can look up the formula under ‘Details’ on their webpage.

However, the exact formula doesn’t matter at all now! If we call this wavelet $y,$ all that matters is that it’s a bump with wiggles on it, and that its mean value is 0, or more precisely:

$\displaystyle{ \int_{-\infty}^\infty y(t) \; d t = 0 }$

As we saw in the last section, this fact lets us take our function $x$ and the wavelet $y$ and see how much they ‘vary hand it hand’ simply by computing their inner product:

$\displaystyle{ \langle x , y \rangle = \int_{-\infty}^\infty x(t) y(t)\; d t }$

Loosely speaking, this measures the ‘amount of $y$-shaped wiggle in the function $x$’. It’s amazing how hard it is to say something in plain English that perfectly captures the meaning of a simple formula like the above one—so take the quoted phrase with a huge grain of salt. But it gives a rough intuition.

Our wavelet $y$ happens to be centered at $t = 0.$ However, we might be interested in $y$-shaped wiggles that are centered not at zero but at some other number $s.$ We could detect these by shifting the function $y$ before taking its inner product with $x$:

$\displaystyle{ \int_{-\infty}^\infty x(t) y(t-s)\; d t }$

We could also be interested in measuring the amount of some stretched-out or squashed version of a $y$-shaped wiggle in the function $x.$ Again we could do this by changing $y$ before taking its inner product with $x$:

$\displaystyle{ \int_{-\infty}^\infty x(t) \; y\left(\frac{t}{P}\right) \; d t }$

When $P$ is big, we get a stretched-out version of $y.$ People sometimes call $P$ the period, since the period of the wiggles in $y$ will be proportional to this (though usually not equal to it).

Finally, we can combine these ideas, and compute

$\displaystyle{ \int_{-\infty}^\infty x(t) \; y\left(\frac{t- s}{P}\right)\; dt }$

This is a function of the shift $s$ and period $P$ which says how much of the $s$-shifted, $P$-stretched wavelet $y$ is lurking in the function $x.$ It’s a version of the continuous wavelet transform!

Mathematica implements this idea for time series, meaning lists of numbers $x = (x_1,\dots,x_n)$ instead of functions $x : \mathbb{R} \to \mathbb{R}.$ The idea is that we think of the numbers as samples of a function $x$:

$x_1 = x(\Delta t)$

$x_2 = x(2 \Delta t)$

and so on, where $\Delta t$ is some time step, and replace the integral above by a suitable sum. Mathematica has a function ContinuousWaveletTransform that does this, giving

$\displaystyle{ w(s,P) = \frac{1}{\sqrt{P}} \sum_{i = 1}^n x_i \; y\left(\frac{i \Delta t - s}{P}\right) }$

The factor of $1/\sqrt{P}$ in front is a useful extra trick: it’s the right way to compensate for the fact that when you stretch out out your wavelet $y$ by a factor of $P,$ it gets bigger. So, when we’re doing integrals, we should define the continuous wavelet transform of $y$ by:

$\displaystyle{ w(s,P) = \frac{1}{\sqrt{P}} \int_{-\infty}^\infty x(t) y(\frac{t- s}{P})\; dt }$

### The results

Dara Shayda started with the air pressure anomaly at Darwin and Tahiti, measured monthly from 1866 to 2012. Taking DGaussianWavelet[4] as his wavelet, he computed the continuous wavelet transform $w(s,P)$ as above. To show us the answer, he created a scalogram:

This is a 2-dimensional color plot showing roughly how big the continuous wavelet transform $w(s,P)$ is for different shifts $s$ and periods $P.$ Blue means it’s very small, green means it’s bigger, yellow means even bigger and red means very large.

Tahiti gave this:

You’ll notice that the patterns at Darwin and Tahiti are similar in character, but notably different in detail. For example, the red spots, where our chosen wavelet shows up strongly with period of order ~100 months, occur at different times.

Puzzle 1. What is the meaning of the ‘spikes’ in these scalograms? What sort of signal would give a spike of this sort?

Puzzle 2. Do a Gabor transform, also known as a ‘windowed Fourier transform’, of the same data. Blake Pollard explained the Gabor transform in his article Milankovitch vs the Ice Ages. This is a way to see how much a signal wiggles at a given frequency at a given time: we multiply the signal by a shifted Gaussian and then takes its Fourier transform.

Puzzle 3. Read about continuous wavelet transforms. If we want to reconstruct our signal $x$ from its continuous wavelet transform, why should we use a wavelet $y$ with

$\displaystyle{\int_{-\infty}^\infty y(t) \; d t = 0 ? }$

In fact we want a somewhat stronger condition, which is implied by the above equation when the Fourier transform of $y$ is smooth and integrable:

Continuous wavelet transform, Wikipedia.

### Another way to understand correlations

David Tweed mentioned another approach from signal processing to understanding the quantity

$\displaystyle{ \langle x y \rangle = \frac{1}{n} \sum_{i = 1}^n x_i y_i }$

If we’ve got two lists of data $x$ and $y$ that we want to compare to see if they behave similarly, the first thing we ought to do is multiplicatively scale each one so they’re of comparable magnitude. There are various possibilities for assigning a scale, but a reasonable one is to ensure they have equal ‘energy’

$\displaystyle{ \sum_{i=1}^n x_i^2 = \sum_{i=1}^n y_i^2 }$

(This can be achieved by dividing each list by its standard deviation, which is equivalent to what was done in the main derivation above.) Once we’ve done that then it’s clear that looking at

$\displaystyle{ \sum_{i=1}^n (x_i-y_i)^2 }$

gives small values when they have a very good match and progressively bigger values as they become less similar. Observe that

$\begin{array}{ccl} \displaystyle{\sum_{i=1}^n (x_i-y_i)^2 } &=& \displaystyle{ \sum_{i=1}^n (x_i^2 - 2 x_i y_i + y_i^2) }\\ &=& \displaystyle{ \sum_{i=1}^n x_i^2 - 2 \sum_{i=1}^n x_i y_i + \sum_{i=1}^n y_i^2 } \end{array}$

Since we’ve scaled things so that $\sum_{i=1}^n x_i^2$ and $\sum_{i=1}^n y_i^2$ are constants, we can see that when $\sum_{i=1}^n x_i y_i$ becomes bigger,

$\displaystyle{ \sum_{i=1}^n (x_i-y_i)^2 }$

becomes smaller. So,

$\displaystyle{\sum_{i=1}^n x_i y_i}$

serves as a measure of how close the lists are, under these assumptions.