## Azimuth News (Part 5)

11 June, 2016

I’ve been rather quiet about Azimuth projects lately, because I’ve been too busy actually working on them. Here’s some of what’s happening:

Jason Erbele is finishing his thesis, entitled Categories in Control: Applied PROPs. He successfully gave his thesis defense on Wednesday June 8th, but he needs to polish it up some more. Building on the material in our paper “Categories in control”, he’s defined a category where the morphisms are signal flow diagrams. But interestingly, not all the diagrams you can draw are actually considered useful in control theory! So he’s also found a subcategory where the morphisms are the ‘good’ signal flow diagrams, the ones control theorists like. For these he studies familiar concepts like controllability and observability. When his thesis is done I’ll announce it here.

Brendan Fong is also finishing his thesis, called The Algebra of Open and Interconnected Systems. Brendan has already created a powerful formalism for studying open systems: the decorated cospan formalism. We’ve applied it to two examples: electrical circuits and Markov processes. Lately he’s been developing the formalism further, and this will appear in his thesis. Again, I’ll talk about it when he’s done!

Blake Pollard and I are writing a paper called “A compositional framework for open chemical reaction networks”. Here we take our work on Markov processes and throw in two new ingredients: dynamics and nonlinearity. Of course Markov processes have a dynamics, but in our previous paper when we ‘black-boxed’ them to study their external behaviour, we got a relation between flows and populations in equilibrium. Now we explain how to handle nonequilibrium situations as well.

Brandon Coya, Franciscus Rebro and I are writing a paper that might be called “The algebra of networks”. I’m not completely sure of the title, nor who the authors will be: Brendan Fong may also be a coauthor. But the paper explores the technology of PROPs as a tool for describing networks. As an application, we’ll give a new shorter proof of the functoriality of black-boxing for electrical circuits. This new proof also applies to nonlinear circuits. I’m really excited about how the theory of PROPs, first introduced in algebraic topology, is catching fire with all the new applications to network theory.

I expect all these projects to be done by the end of the summer. Near the end of June I’ll go to the Centre for Quantum Technologies, in Singapore. This will be my last summer there. My main job will be to finish up the two papers that I’m supposed to be writing.

There’s another paper that’s already done:

Kenny Courser has written a paper “A bicategory of decorated cospans“, pushing Brendan’s framework from categories to bicategories. I’ll explain this very soon here on this blog! One goal is to understand things like the coarse-graining of open systems: that is, the process of replacing a detailed description by a less detailed description. Since we treat open systems as morphisms, coarse-graining is something that goes from one morphism to another, so it’s naturally treated as a 2-morphism in a bicategory.

So, I’ve got a lot of new ideas to explain here, and I’ll start soon! I also want to get deeper into systems biology.

In the fall I’ve got a couple of short trips lined up:

• Monday November 14 – Friday November 18, 2016 – I’ve been invited by Yoav Kallus to visit the Santa Fe Institute. From the 16th to 18th I’ll attend a workshop on Statistical Physics, Information Processing and Biology.

• Monday December 5 – Friday December 9 – I’ve been invited to Berkeley for a workshop on Compositionality at the Simons Institute for the Theory of Computing, organized by Samson Abramsky, Lucien Hardy, and Michael Mislove. ‘Compositionality’ is a name for how you describe the behavior of a big complicated system in terms of the behaviors of its parts, so this is closely connected to my dream of studying open systems by treating them as morphisms that can be composed to form bigger open systems.

Here’s the announcement:

The compositional description of complex objects is a fundamental feature of the logical structure of computation. The use of logical languages in database theory and in algorithmic and finite model theory provides a basic level of compositionality, but establishing systematic relationships between compositional descriptions and complexity remains elusive. Compositional models of probabilistic systems and languages have been developed, but inferring probabilistic properties of systems in a compositional fashion is an important challenge. In quantum computation, the phenomenon of entanglement poses a challenge at a fundamental level to the scope of compositional descriptions. At the same time, compositionally has been proposed as a fundamental principle for the development of physical theories. This workshop will focus on the common structures and methods centered on compositionality that run through all these areas.

I’ll say more about both these workshops when they take place.

## Statistical Laws of Darwinian Evolution

18 April, 2016

guest post by Matteo Smerlak

Biologists like Steven J. Gould like to emphasize that evolution is unpredictable. They have a point: there is absolutely no way an alien visiting the Earth 400 million years ago could have said:

Hey, I know what’s gonna happen here. Some descendants of those ugly fish will grow wings and start flying in the air. Others will walk the surface of the Earth for a few million years, but they’ll get bored and they’ll eventually go back to the oceans; when they do, they’ll be able to chat across thousands of kilometers using ultrasound. Yet others will grow arms, legs, fur, they’ll climb trees and invent BBQ, and, sooner or later, they’ll start wondering “why all this?”.

Nor can we tell if, a week from now, the flu virus will mutate, become highly pathogenic and forever remove the furry creatures from the surface of the Earth.

Evolution isn’t gravity—we can’t tell in which directions things will fall down.

One reason we can’t predict the outcomes of evolution is that genomes evolve in a super-high dimensional combinatorial space, which a ginormous number of possible turns at every step. Another is that living organisms interact with one another in a massively non-linear way, with, feedback loops, tipping points and all that jazz.

Life’s a mess, if you want my physicist’s opinion.

But that doesn’t mean that nothing can be predicted. Think of statistics. Nobody can predict who I’ll vote for in the next election, but it’s easy to tell what the distribution of votes in the country will be like. Thus, for continuous variables which arise as sums of large numbers of independent components, the central limit theorem tells us that the distribution will always be approximately normal. Or take extreme events: the max of $N$ independent random variables is distributed according to a member of a one-parameter family of so-called “extreme value distributions”: this is the content of the famous Fisher–Tippett–Gnedenko theorem.

So this is the problem I want to think about in this blog post: is evolution ruled by statistical laws? Or, in physics terms: does it exhibit some form of universality?

### Fitness distributions are the thing

One lesson from statistical physics is that, to uncover universality, you need to focus on relevant variables. In the case of evolution, it was Darwin’s main contribution to figure out the main relevant variable: the average number of viable offspring, aka fitness, of an organism. Other features—physical strength, metabolic efficiency, you name it—matter only insofar as they are correlated with fitness. If we further assume that fitness is (approximately) heritable, meaning that descendants have the same fitness as their ancestors, we get a simple yet powerful dynamical principle called natural selection: in a given population, the lineage with the highest fitness eventually dominates, i.e. its fraction goes to one over time. This principle is very general: it applies to genes and species, but also to non-living entities such as algorithms, firms or language. The general relevance of natural selection as a evolutionary force is sometimes referred to as “Universal Darwinism”.

The general idea of natural selection is pictured below (reproduced from this paper):

It’s not hard to write down an equation which expresses natural selection in general terms. Consider an infinite population in which each lineage grows with some rate $x$. (This rate is called the log-fitness or Malthusian fitness to contrast it with the number of viable offspring $w=e^{x\Delta t}$ with $\Delta t$ the lifetime of a generation. It’s more convenient to use $x$ than $w$ in what follows, so we’ll just call $x$ “fitness”). Then the distribution of fitness at time $t$ satisfies the equation

$\displaystyle{ \frac{\partial p_t(x)}{\partial t} =\left(x-\int d y\, y\, p_t(y)\right)p_t(x) }$

whose explicit solution in terms of the initial fitness distribution $p_0(x):$

$\displaystyle{ p_t(x)=\frac{e^{x t}p_0(x)}{\int d y\, e^{y t}p_0(y)} }$

is called the Cramér transform of $p_0(x)$ in large deviations theory. That is, viewed as a flow in the space of probability distributions, natural selection is nothing but a time-dependent exponential tilt. (These equations and the results below can be generalized to include the effect of mutations, which are critical to maintain variation in the population, but we’ll skip this here to focus on pure natural selection. See my paper referenced below for more information.)

An immediate consequence of these equations is that the mean fitness $\mu_t=\int dx\, x\, p_t(x)$ grows monotonically in time, with a rate of growth given by the variance $\sigma_t^2=\int dx\, (x-\mu_t)^2\, p_t(x)$:

$\displaystyle{ \frac{d\mu_t}{dt}=\sigma_t^2\geq 0 }$

The great geneticist Ronald Fisher (yes, the one in the extreme value theorem!) was very impressed with this relationship. He thought it amounted to an biological version of the second law of thermodynamics, writing in his 1930 monograph

Professor Eddington has recently remarked that “The law that entropy always increases—the second law of thermodynamics—holds, I think, the supreme position among the laws of nature”. It is not a little instructive that so similar a law should hold the supreme position among the biological sciences.

Unfortunately, this excitement hasn’t been shared by the biological community, notably because this Fisher “fundamental theorem of natural selection” isn’t predictive: the mean fitness $\mu_t$ grows according to the fitness variance $\sigma_t^2$, but what determines the evolution of $\sigma_t^2$? I can’t use the identity above to predict the speed of evolution in any sense. Geneticists say it’s “dynamically insufficient”.

### Two limit theorems

But the situation isn’t as bad as it looks. The evolution of $p_t(x)$ may be decomposed into the evolution of its mean $\mu_t$, of its variance $\sigma_t^2$, and of its shape or type

$\overline{p}_t(x)=\sigma_t p_t(\sigma_t x+\mu_t)$.

(We also call $\overline{p}_t(x)$ the “standardized fitness distribution”.) With Ahmed Youssef we showed that:

• If $p_0(x)$ is supported on the whole real line and decays at infinity as

$-\ln\int_x^{\infty}p_0(y)d y\underset{x\to\infty}{\sim} x^{\alpha}$

for some $\alpha > 1$, then $\mu_t\sim t^{\overline{\alpha}-1}$, $\sigma_t^2\sim t^{\overline{\alpha}-2}$ and $\overline{p}_t(x)$ converges to the standard normal distribution as $t\to\infty$. Here $\overline{\alpha}$ is the conjugate exponent to $\alpha$, i.e. $1/\overline{\alpha}+1/\alpha=1$.

• If $p_0(x)$ has a finite right-end point $x_+$ with

$p(x)\underset{x\to x_+}{\sim} (x_+-x)^\beta$

for some $\beta\geq0$, then $x_+-\mu_t\sim t^{-1}$, $\sigma_t^2\sim t^{-2}$ and $\overline{p}_t(x)$ converges to the flipped gamma distribution

$\displaystyle{ p^*_\beta(x)= \frac{(1+\beta)^{(1+\beta)/2}}{\Gamma(1+\beta)} \Theta[x-(1+\beta)^{1/2}] }$

$\displaystyle { e^{-(1+\beta)^{1/2}[(1+\beta)^{1/2}-x]}\Big[(1+\beta)^{1/2}-x\Big]^\beta }$

Here and below the symbol $\sim$ means “asymptotically equivalent up to a positive multiplicative constant”; $\Theta(x)$ is the Heaviside step function. Note that $p^*_\beta(x)$ becomes Gaussian in the limit $\beta\to\infty$, i.e. the attractors of cases 1 and 2 form a continuous line in the space of probability distributions; the other extreme case, $\beta\to0$, corresponds to a flipped exponential distribution.

The one-parameter family of attractors $p_\beta^*(x)$ is plotted below:

These results achieve two things. First, they resolve the dynamical insufficiency of Fisher’s fundamental theorem by giving estimates of the speed of evolution in terms of the tail behavior of the initial fitness distribution. Second, they show that natural selection is indeed subject to a form of universality, whereby the relevant statistical structure turns out to be finite dimensional, with only a handful of “conserved quantities” (the $\alpha$ and $\beta$ exponents) controlling the late-time behavior of natural selection. This amounts to a large reduction in complexity and, concomitantly, an enhancement of predictive power.

(For the mathematically-oriented reader, the proof of the theorems above involves two steps: first, translate the selection equation into a equation for (cumulant) generating functions; second, use a suitable Tauberian theorem—the Kasahara theorem—to relate the behavior of generating functions at large values of their arguments to the tail behavior of $p_0(x)$. Details in our paper.)

It’s useful to consider the convergence of fitness distributions to the attractors $p_\beta^*(x)$ for $0\leq\beta\leq \infty$ in the skewness-kurtosis plane, i.e. in terms of the third and fourth cumulants of $p_t(x)$.

The red curve is the family of attractors, with the normal at the bottom right and the flipped exponential at the top left, and the dots correspond to numerical simulations performed with the classical Wright–Fisher model and with a simple genetic algorithm solving a linear programming problem. The attractors attract!

### Conclusion and a question

Statistics is useful because limit theorems (the central limit theorem, the extreme value theorem) exist. Without them, we wouldn’t be able to make any population-level prediction. Same with statistical physics: it only because matter consists of large numbers of atoms, and limit theorems hold (the H-theorem, the second law), that macroscopic physics is possible in the first place. I believe the same perspective is useful in evolutionary dynamics: it’s true that we can’t predict how many wings birds will have in ten million years, but we can tell what shape fitness distributions should have if natural selection is true.

I’ll close with an open question for you, the reader. In the central limit theorem as well as in the second law of thermodynamics, convergence is driven by a Lyapunov function, namely entropy. (In the case of the central limit theorem, it’s a relatively recent result by Arstein et al.: the entropy of the normalized sum of $n$ i.i.d. random variables, when it’s finite, is a monotonically increasing function of $n$.) In the case of natural selection for unbounded fitness, it’s clear that entropy will also be eventually monotonically increasing—the normal is the distribution with largest entropy at fixed variance and mean.

Yet it turns out that, in our case, entropy isn’t monotonic at all times; in fact, the closer the initial distribution $p_0(x)$ is to the normal distribution, the later the entropy of the standardized fitness distribution starts to increase. Or, equivalently, the closer the initial distribution $p_0(x)$ to the normal, the later its relative entropy with respect to the normal. Why is this? And what’s the actual Lyapunov function for this process (i.e., what functional of the standardized fitness distribution is monotonic at all times under natural selection)?

In the plots above the blue, orange and green lines correspond respectively to

$\displaystyle{ p_0(x)\propto e^{-x^2/2-x^4}, \quad p_0(x)\propto e^{-x^2/2-.01x^4}, \quad p_0(x)\propto e^{-x^2/2-.001x^4} }$

### References

• S. J. Gould, Wonderful Life: The Burgess Shale and the Nature of History, W. W. Norton & Co., New York, 1989.

• M. Smerlak and A. Youssef, Limiting fitness distributions in evolutionary dynamics, 2015.

• R. A. Fisher, The Genetical Theory of Natural Selection, Oxford University Press, Oxford, 1930.

• S. Artstein, K. Ball, F. Barthe and A. Naor, Solution of Shannon’s problem on the monotonicity of entropy, J. Am. Math. Soc. 17 (2004), 975–982.

## Probability Puzzles (Part 3)

26 March, 2016

Here’s a puzzle based on something interesting that I learned from Greg Egan. I’ve dramatized it a bit.

Traditional Tom and Liberal Lisa are dating. They discuss their plans for having children:

Tom: I plan to keep having kids until I get two sons in a row.

Lisa: What?! That’s absurd. Why?

Tom: I want two to run my store when I get old.

Lisa: Even ignoring your insulting assumption that only boys can manage your shop, why in the world do you need two in a row?

Tom: From my own childhood, I’ve learned there’s a special bond between sons who are next to each other in age. They play together, they grow up together… they can run my shop together.

Lisa: Hmm. Well, then maybe I should have children until I have a girl followed directly by a boy!

Tom: What?!

Lisa: Well, I’ve observed that something special happens when a boy has an older sister, with no intervening siblings. They play together, they grow up together… and maybe he learns not to be such a sexist pig!

They decide they are incompatible, so they split up and each one separately tries to find a mate who will go along with their reproductive plan.

Now for some puzzles. In these puzzles, assume that each time someone has a child, they have a 50% chance of having either a daughter or a son. Also assume each event is independent: that is, the gender of any children so far has no effect on that of later ones. Also ignore twins and other tricky issues.

Puzzle 1. If Tom carries out his plan of having children until he has two consecutive sons, and then stops, what is the expected number of children he will have?

Puzzle 2. If Lisa carries out her plan of having children until she has a daughter followed directly by a son, and then stops, what is the expected number of children she will have?

Puzzle 3: Which is greater, Tom’s expected number of children or Lisa’s? Or are they equal?

For maximum benefit, try to answer Puzzle 3 before doing the calculations required to answer Puzzles 1 or 2.

## Information Geometry (Part 15)

14 January, 2016

joint with Blake Pollard

Lately we’ve been thinking about open Markov processes. These are random processes where something can hop randomly from one state to another (that’s the ‘Markov process’ part) but also enter or leave the system (that’s the ‘open’ part).

The ultimate goal is to understand the nonequilibrium thermodynamics of open systems—systems where energy and maybe matter flows in and out. If we could understand this well enough, we could understand in detail how life works. That’s a difficult job! But one has to start somewhere, and this is one place to start.

We have a few papers on this subject:

• Blake Pollard, A Second Law for open Markov processes. (Blog article here.)

• John Baez, Brendan Fong and Blake Pollard, A compositional framework for Markov processes. (Blog article here.)

• Blake Pollard, Open Markov processes: A compositional perspective on non-equilibrium steady states in biology. (Blog article here.)

However, right now we just want to show you three closely connected results about how relative entropy changes in open Markov processes.

### Definitions

An open Markov process consists of a finite set $X$ of states, a subset $B \subseteq X$ of boundary states, and an infinitesimal stochastic operator $H: \mathbb{R}^X \to \mathbb{R}^X,$ meaning a linear operator with

$H_{ij} \geq 0 \ \ \text{for all} \ \ i \neq j$

and

$\sum_i H_{ij} = 0 \ \ \text{for all} \ \ j$

For each state $i \in X$ we introduce a population $p_i \in [0,\infty).$ We call the resulting function $p : X \to [0,\infty)$ the population distribution.

Populations evolve in time according to the open master equation:

$\displaystyle{ \frac{dp_i}{dt} = \sum_j H_{ij}p_j} \ \ \text{for all} \ \ i \in X-B$

$p_i(t) = b_i(t) \ \ \text{for all} \ \ i \in B$

So, the populations $p_i$ obey a linear differential equation at states $i$ that are not in the boundary, but they are specified ‘by the user’ to be chosen functions $b_i$ at the boundary states. The off-diagonal entry $H_{ij}$ for $i \neq j$ describe the rate at which population transitions from the $j$th to the $i$th state.

A closed Markov process, or continuous-time discrete-state Markov chain, is an open Markov process whose boundary is empty. For a closed Markov process, the open master equation becomes the usual master equation:

$\displaystyle{ \frac{dp}{dt} = Hp }$

In a closed Markov process the total population is conserved:

$\displaystyle{ \frac{d}{dt} \sum_{i \in X} p_i = \sum_{i,j} H_{ij}p_j = 0 }$

This lets us normalize the initial total population to 1 and have it stay equal to 1. If we do this, we can talk about probabilities instead of populations. In an open Markov process, population can flow in and out at the boundary states.

For any pair of distinct states $i,j,$ $H_{ij}p_j$ is the flow of population from $j$ to $i.$ The net flux of population from the $j$th state to the $i$th state is the flow from $j$ to $i$ minus the flow from $i$ to $j$:

$J_{ij} = H_{ij}p_j - H_{ji}p_i$

A steady state is a solution of the open master equation that does not change with time. A steady state for a closed Markov process is typically called an equilibrium. So, an equilibrium obeys the master equation at all states, while for a steady state this may not be true at the boundary states. The idea is that population can flow in or out at the boundary states.

We say an equilibrium $p : X \to [0,\infty)$ of a Markov process is detailed balanced if all the net fluxes vanish:

$J_{ij} = 0 \ \ \text{for all} \ \ i,j \in X$

or in other words:

$H_{ij}p_j = H_{ji}p_i \ \ \text{for all} \ \ i,j \in X$

Given two population distributions $p, q : X \to [0,\infty)$ we can define the relative entropy

$\displaystyle{ I(p,q) = \sum_i p_i \ln \left( \frac{p_i}{q_i} \right)}$

When $q$ is a detailed balanced equilibrium solution of the master equation, the relative entropy can be seen as the ‘free energy’ of $p.$ For a precise statement, see Section 4 of Relative entropy in biological systems.

The Second Law of Thermodynamics implies that the free energy of a closed system tends to decrease with time, so for closed Markov processes we expect $I(p,q)$ to be nonincreasing. And this is true! But for open Markov processes, free energy can flow in from outside. This is just one of several nice results about how relative entropy changes with time.

### Results

Theorem 1. Consider an open Markov process with $X$ as its set of states and $B$ as the set of boundary states. Suppose $p(t)$ and $q(t)$ obey the open master equation, and let the quantities

$\displaystyle{ \frac{Dp_i}{Dt} = \frac{dp_i}{dt} - \sum_{j \in X} H_{ij}p_j }$

$\displaystyle{ \frac{Dq_i}{Dt} = \frac{dq_i}{dt} - \sum_{j \in X} H_{ij}q_j }$

measure how much the time derivatives of $p_i$ and $q_i$ fail to obey the master equation. Then we have

$\begin{array}{ccl} \displaystyle{ \frac{d}{dt} I(p(t),q(t)) } &=& \displaystyle{ \sum_{i, j \in X} H_{ij} p_j \left( \ln(\frac{p_i}{q_i}) - \frac{p_i q_j}{p_j q_i} \right)} \\ \\ && \; + \; \displaystyle{ \sum_{i \in B} \frac{\partial I}{\partial p_i} \frac{Dp_i}{Dt} + \frac{\partial I}{\partial q_i} \frac{Dq_i}{Dt} } \end{array}$

This result separates the change in relative entropy change into two parts: an ‘internal’ part and a ‘boundary’ part.

It turns out the ‘internal’ part is always less than or equal to zero. So, from Theorem 1 we can deduce a version of the Second Law of Thermodynamics for open Markov processes:

Theorem 2. Given the conditions of Theorem 1, we have

$\displaystyle{ \frac{d}{dt} I(p(t),q(t)) \; \le \; \sum_{i \in B} \frac{\partial I}{\partial p_i} \frac{Dp_i}{Dt} + \frac{\partial I}{\partial q_i} \frac{Dq_i}{Dt} }$

Intuitively, this says that free energy can only increase if it comes in from the boundary!

There is another nice result that holds when $q$ is an equilibrium solution of the master equation. This idea seems to go back to Schnakenberg:

Theorem 3. Given the conditions of Theorem 1, suppose also that $q$ is an equilibrium solution of the master equation. Then we have

$\displaystyle{ \frac{d}{dt} I(p(t),q) = -\frac{1}{2} \sum_{i,j \in X} J_{ij} A_{ij} \; + \; \sum_{i \in B} \frac{\partial I}{\partial p_i} \frac{Dp_i}{Dt} }$

where

$J_{ij} = H_{ij}p_j - H_{ji}p_i$

is the net flux from $j$ to $i,$ while

$\displaystyle{ A_{ij} = \ln \left(\frac{p_j q_i}{p_i q_j} \right) }$

is the conjugate thermodynamic force.

The flux $J_{ij}$ has a nice meaning: it’s the net flow of population from $j$ to $i.$ The thermodynamic force is a bit subtler, but this theorem reveals its meaning: it says how much the population wants to flow from $j$ to $i.$

More precisely, up to that factor of $1/2,$ the thermodynamic force $A_{ij}$ says how much free energy loss is caused by net flux from $j$ to $i.$ There’s a nice analogy here to water losing potential energy as it flows downhill due to the force of gravity.

### Proofs

Proof of Theorem 1. We begin by taking the time derivative of the relative information:

$\begin{array}{ccl} \displaystyle{ \frac{d}{dt} I(p(t),q(t)) } &=& \displaystyle{ \sum_{i \in X} \frac{\partial I}{\partial p_i} \frac{dp_i}{dt} + \frac{\partial I}{\partial q_i} \frac{dq_i}{dt} } \end{array}$

We can separate this into a sum over states $i \in X - B,$ for which the time derivatives of $p_i$ and $q_i$ are given by the master equation, and boundary states $i \in B,$ for which they are not:

$\begin{array}{ccl} \displaystyle{ \frac{d}{dt} I(p(t),q(t)) } &=& \displaystyle{ \sum_{i \in X-B, \; j \in X} \frac{\partial I}{\partial p_i} H_{ij} p_j + \frac{\partial I}{\partial q_i} H_{ij} q_j }\\ \\ && + \; \; \; \displaystyle{ \sum_{i \in B} \frac{\partial I}{\partial p_i} \frac{dp_i}{dt} + \frac{\partial I}{\partial q_i} \frac{dq_i}{dt}} \end{array}$

For boundary states we have

$\displaystyle{ \frac{dp_i}{dt} = \frac{Dp_i}{Dt} + \sum_{j \in X} H_{ij}p_j }$

and similarly for the time derivative of $q_i.$ We thus obtain

$\begin{array}{ccl} \displaystyle{ \frac{d}{dt} I(p(t),q(t)) } &=& \displaystyle{ \sum_{i,j \in X} \frac{\partial I}{\partial p_i} H_{ij} p_j + \frac{\partial I}{\partial q_i} H_{ij} q_j }\\ \\ && + \; \; \displaystyle{ \sum_{i \in B} \frac{\partial I}{\partial p_i} \frac{Dp_i}{Dt} + \frac{\partial I}{\partial q_i} \frac{Dq_i}{Dt}} \end{array}$

To evaluate the first sum, recall that

$\displaystyle{ I(p,q) = \sum_{i \in X} p_i \ln (\frac{p_i}{q_i})}$

so

$\displaystyle{\frac{\partial I}{\partial p_i}} =\displaystyle{1 + \ln (\frac{p_i}{q_i})} , \qquad \displaystyle{ \frac{\partial I}{\partial q_i}}= \displaystyle{- \frac{p_i}{q_i} }$

Thus, we have

$\displaystyle{ \sum_{i,j \in X} \frac{\partial I}{\partial p_i} H_{ij} p_j + \frac{\partial I}{\partial q_i} H_{ij} q_j = \sum_{i,j\in X} (1 + \ln (\frac{p_i}{q_i})) H_{ij} p_j - \frac{p_i}{q_i} H_{ij} q_j }$

We can rewrite this as

$\displaystyle{ \sum_{i,j \in X} H_{ij} p_j \left( 1 + \ln(\frac{p_i}{q_i}) - \frac{p_i q_j}{p_j q_i} \right) }$

Since $H_{ij}$ is infinitesimal stochastic we have $\sum_{i} H_{ij} = 0,$ so the first term drops out, and we are left with

$\displaystyle{ \sum_{i,j \in X} H_{ij} p_j \left( \ln(\frac{p_i}{q_i}) - \frac{p_i q_j}{p_j q_i} \right) }$

as desired.   █

Proof of Theorem 2. Thanks to Theorem 1, to prove

$\displaystyle{ \frac{d}{dt} I(p(t),q(t)) \; \le \; \sum_{i \in B} \frac{\partial I}{\partial p_i} \frac{Dp_i}{Dt} + \frac{\partial I}{\partial q_i} \frac{Dq_i}{Dt} }$

it suffices to show that

$\displaystyle{ \sum_{i,j \in X} H_{ij} p_j \left( \ln(\frac{p_i}{q_i}) - \frac{p_i q_j}{p_j q_i} \right) \le 0 }$

or equivalently (recalling the proof of Theorem 1):

$\displaystyle{ \sum_{i,j} H_{ij} p_j \left( \ln(\frac{p_i}{q_i}) + 1 - \frac{p_i q_j}{p_j q_i} \right) \le 0 }$

The last two terms on the left hand side cancel when $i = j.$ Thus, if we break the sum into an $i \ne j$ part and an $i = j$ part, the left side becomes

$\displaystyle{ \sum_{i \ne j} H_{ij} p_j \left( \ln(\frac{p_i}{q_i}) + 1 - \frac{p_i q_j}{p_j q_i} \right) \; + \; \sum_j H_{jj} p_j \ln(\frac{p_j}{q_j}) }$

Next we can use the infinitesimal stochastic property of $H$ to write $H_{jj}$ as the sum of $-H_{ij}$ over $i$ not equal to $j,$ obtaining

$\displaystyle{ \sum_{i \ne j} H_{ij} p_j \left( \ln(\frac{p_i}{q_i}) + 1 - \frac{p_i q_j}{p_j q_i} \right) - \sum_{i \ne j} H_{ij} p_j \ln(\frac{p_j}{q_j}) } =$

$\displaystyle{ \sum_{i \ne j} H_{ij} p_j \left( \ln(\frac{p_iq_j}{p_j q_i}) + 1 - \frac{p_i q_j}{p_j q_i} \right) }$

Since $H_{ij} \ge 0$ when $i \ne j$ and $\ln(s) + 1 - s \le 0$ for all $s > 0,$ we conclude that this quantity is $\le 0.$   █

Proof of Theorem 3. Now suppose also that $q$ is an equilibrium solution of the master equation. Then $Dq_i/Dt = dq_i/dt = 0$ for all states $i,$ so by Theorem 1 we need to show

$\displaystyle{ \sum_{i, j \in X} H_{ij} p_j \left( \ln(\frac{p_i}{q_i}) - \frac{p_i q_j}{p_j q_i} \right) \; = \; -\frac{1}{2} \sum_{i,j \in X} J_{ij} A_{ij} }$

We also have $\sum_{j \in X} H_{ij} q_j = 0,$ so the second
term in the sum at left vanishes, and it suffices to show

$\displaystyle{ \sum_{i, j \in X} H_{ij} p_j \ln(\frac{p_i}{q_i}) \; = \; - \frac{1}{2} \sum_{i,j \in X} J_{ij} A_{ij} }$

By definition we have

$\displaystyle{ \frac{1}{2} \sum_{i,j} J_{ij} A_{ij}} = \displaystyle{ \frac{1}{2} \sum_{i,j} \left( H_{ij} p_j - H_{ji}p_i \right) \ln \left( \frac{p_j q_i}{p_i q_j} \right) }$

This in turn equals

$\displaystyle{ \frac{1}{2} \sum_{i,j} H_{ij}p_j \ln \left( \frac{p_j q_i}{p_i q_j} \right) - \frac{1}{2} \sum_{i,j} H_{ji}p_i \ln \left( \frac{p_j q_i}{p_i q_j} \right) }$

and we can switch the dummy indices $i,j$ in the second sum, obtaining

$\displaystyle{ \frac{1}{2} \sum_{i,j} H_{ij}p_j \ln \left( \frac{p_j q_i}{p_i q_j} \right) - \frac{1}{2} \sum_{i,j} H_{ij}p_j \ln \left( \frac{p_i q_j}{p_j q_i} \right) }$

or simply

$\displaystyle{ \sum_{i,j} H_{ij} p_j \ln \left( \frac{p_j q_i}{p_i q_j} \right) }$

But this is

$\displaystyle{ \sum_{i,j} H_{ij} p_j \left(\ln ( \frac{p_j}{q_j}) + \ln (\frac{q_i}{p_i}) \right) }$

and the first term vanishes because $H$ is infinitesimal stochastic: $\sum_i H_{ij} = 0.$ We thus have

$\displaystyle{ \frac{1}{2} \sum_{i,j} J_{ij} A_{ij}} = \sum_{i,j} H_{ij} p_j \ln (\frac{q_i}{p_i} )$

as desired.   █

21 December, 2015

A few years ago, after hearing Susan Holmes speak about the mathematics of phylogenetic trees, I became interested in their connection to algebraic topology. I wrote an article about this here:

• John Baez, Operads and the tree of life, 6 July 2011.

In trying to the make the ideas precise I recruited the help of Nina Otter, who was then a graduate student at ETH Zürich. She came to Riverside and we started to work together.

Now Nina is a grad student at Oxford working on mathematical biology with Heather Harrington. I visited her last summer and we made more progress… but then she realized that our paper needed another big theorem, a result relating our topology on the space of phylogenetic trees to the topology described by Susan Holmes and her coauthors here:

• Louis J. Billera, Susan P. Holmes and Karen Vogtmann, Geometry of the space of phylogenetic trees, Advances in Applied Mathematics 27 (2001), 733–767.

It took another half year to finish things up. I could never have done this myself.

But now we’re done! Here’s our paper:

• John Baez and Nina Otter, Operads and phylogenetic trees.

### The basic idea

Trees are important, not only in mathematics, but also biology. The most important is the ‘tree of life’ relating all organisms that have ever lived on Earth. Darwin drew this sketch of it in 1837:

He wrote about it in On the Origin of Species

The affinities of all the beings of the same class have sometimes been represented by a great tree. I believe this simile largely speaks the truth. The green and budding twigs may represent existing species; and those produced during former years may represent the long succession of extinct species. At each period of growth all the growing twigs have tried to branch out on all sides, and to overtop and kill the surrounding twigs and branches, in the same manner as species and groups of species have at all times overmastered other species in the great battle for life.

Now we know that the tree of life is not really a tree in the mathematical sense. One reason is ‘endosymbiosis’: the incorporation of one organism together with its genetic material into another, as probably happened with the mitochondria in our cells and also the plastids that hold chlorophyll in plants. Another is ‘horizontal gene transfer’: the passing of genetic material from one organism to another, which happens frequently with bacteria. So, the tree of life is really a thicket:

But a tree is often a good approximation, especially for animals and plants in the last few hundred million years. Biologists who try to infer phylogenetic trees from present-day genetic data often use simple models where:

• the genotype of each species follows a random walk, but

• species branch in two at various times.

These are called ‘Markov models’. The simplest Markov model for DNA evolution is the Jukes–Cantor model. Consider one or more pieces of DNA having a total of n base pairs. We can think of this as a string of letters chosen from the set {A,T,C,G}:

…ATCGATTGAGCTCTAGCG…

As time passes, the Jukes–Cantor model says the DNA changes randomly, with each base pair having the same constant rate of randomly flipping to any other. So, we get a Markov process on the set

$X = \{\textrm{A,T,C,G}\}{}^N$

However, a species can also split in two. So, given current-day genetic data from various species, biologists try to infer the most probable tree where, starting from a common ancestor, the DNA in question undergoes a random walk most of the time but branches in two at certain times.

To formalize this, we can define a concept of ‘phylogenetic tree’. Our work is based on the definition of Billera, Holmes and Vogtmann, though we use a slightly different definition, for reasons that will soon become clear. For us, a phylogenetic tree is a rooted tree with leaves labelled by numbers $1,2, \dots, n$ and edges labelled by ‘times’ or, geometrically speaking, ‘lengths’ in $[0, \infty).$ We require that:

• the length of every edge is positive, except perhaps for ‘external edges’: that is, edges incident to the leaves or root;

• there are no 1-ary vertices.

For example, here is a phylogenetic tree with 5 leaves:

where $\ell_1, \dots, \ell_6 \ge 0$ but we demand that $\ell_7 > 0.$ We draw the vertices as dots. We do not count the leaves and the root as vertices, and we label the root with the number $0.$ We cannot collapse edges of length zero that end at leaves, since doing so would eliminate those leaves. Also note that the embedding of the tree in the plane is irrelevant, so this counts as the same phylogenetic tree:

In applications to biology, we are often interested in trees where the total distance from the root to the leaf is the same for every leaf, since all species have evolved for the same time from their common ancestor. These are mathematically interesting as well, because then the distance between any two leaves defines an ultrametric on the set of leaves. However, more general phylogenetic trees are also interesting—and they become essential when we construct an operad whose operations are phylogenetic trees.

Let $\textrm{Phyl}_n$ be the set of phylogenetic trees with n leaves. This has a natural topology, which we explain in Section 6 of our paper. For example, here is a continuous path in $\textrm{Phyl}_4$ where we only change the length of one internal edge, reducing it until it becomes zero and we can collapse it:

Phylogenetic trees reconstructed by biologists are typically binary. When a phylogenetic tree appears to have higher arity, sometimes we merely lack sufficient data to resolve a higher-arity branching into a number of binary ones. With the topology we are using on $\textrm{Phyl}_n,$ binary trees form an open dense set of $\textrm{Phyl}_n,$ except for $\textrm{Phyl}_1.$ However, trees of higher arity are still important, because paths, paths of paths, etc. in $\textrm{Phyl}_n$ are often forced to pass through trees of higher arity.

Billera, Holmes and Vogtmann focused their attention on the set $\mathcal{T}_n$ of phylogenetic trees where lengths of the external edges—edges incident to the root and leaves—are fixed to a constant value. They endow $\mathcal{T}_n$ with a metric, which induces a topology on $\mathcal{T}_n,$ and there is a homeomorphism

$\textrm{Phyl}_n \cong \mathcal{T}_n \times [0,\infty)^{n+1}$

where the data in $[0,\infty)^{n+1}$ describe the lengths of the external edges in a general phylogenetic tree.

In algebraic topology, trees are often used to describe the composition of n-ary operations. This is formalized in the theory of operads. An ‘operad’ is an algebraic stucture where for each natural number $n=0,1,2,\dots$ we have a set $O_n$ whose elements are considered as abstract n-ary operations, not necessarily operating on anything yet. An element $f \in O_n$ can be depicted as a planar tree with one vertex and n labelled leaves:

We can compose these operations in a tree-like way to get new operations:

and an associative law holds, making this sort of composite unambiguous:

There are various kinds of operads, but in this paper our operads will always be ‘unital’, having an operation $1 \in O_1$ that acts as an identity for composition. They will also be ‘symmetric’, meaning there is an action of the symmetric group $S_n$ on each set $O_n,$ compatible with composition. Further, they will be ‘topological’, meaning that each set $O_n$ is a topological space, with composition and permutations acting as continuous maps.

In Section 5 we prove that there is an operad $\textrm{Phyl},$ the ‘phylogenetic operad’, whose space of n-ary operations is $\textrm{Phyl}_n.$ This raises a number of questions:

• What is the mathematical nature of this operad?

• How is it related to ‘Markov processes with branching’?

• How is it related to known operads in topology?

Briefly, the answer is that $\textrm{Phyl}$ is the coproduct of $\textrm{Com},$ the operad for commutative topological semigroups, and $[0,\infty),$ the operad having only unary operations, one for each $t \in [0,\infty).$ The first describes branching, the second describes Markov processes. Moreover, $\textrm{Phyl}$ is closely related to the Boardmann–Vogt $W$ construction applied to $\textrm{Com}.$ This is a construction that Boardmann and Vogt applied to another operad in order to obtain an operad whose algebras are loop spaces.

To understand all this in more detail, first recall that the raison d’être of operads is to have ‘algebras’. The most traditional sort of algebra of an operad $O$ is a topological space $X$ on which each operation $f \in O_n$ acts as a
continuous map

$\alpha(f)\colon X^n \to X$

obeying some conditions: composition, the identity, and the permutation group actions are preserved, and $\alpha(f)$ depends continuously on $f.$ The idea is that the abstract operations in $O$ are realized as actual operations on the space $X.$

In this paper we instead need algebras of a linear sort. Such an algebra of $O$ is a finite-dimensional real vector space $V$ on which each operation $f \in O_n$ acts as a multilinear map

$\alpha(f)\colon V^n \to X$

obeying the same list of conditions. We can also think of $\alpha(f)$ as a linear map

$\alpha(f)\colon V^{\otimes n} \to V$

where $V^{\otimes n}$ is the nth tensor power of $V.$

We also need ‘coalgebras’ of operads. The point is that while ordinarily one thinks of an operation $f \in O_n$ as having n inputs and one output, a phylogenetic tree is better thought of as having one input and n outputs. A coalgebra of $O$ is a finite-dimensional real vector space $V$ on which operation $f \in O_n$ gives a linear map

$\alpha(f) \colon V \to V^{\otimes n}$

obeying the same conditions as an algebra, but ‘turned around’.

The main point of this paper is that the phylogenetic operad has interesting coalgebras, which correspond to how phylogenetic trees are actually used to describe branching Markov processes in biology. But to understand this, we need to start by looking at coalgebras of two operads from which the phylogenetic operad is built.

By abuse of notation, we will use $[0,\infty)$ as the name for the operad having only unary operations, one for each $t \in [0,\infty),$ with composition of operations given by addition. A coalgebra of $[0,\infty)$ is a finite-dimensional real vector space $V$ together with for each $t \in [0,\infty)$ a linear map

$\alpha(t) \colon V \to V$

such that:

$\alpha(s+t) = \alpha(s) \alpha(t)$ for all $s,t \in [0,\infty)$

$\alpha(0) = 1_V$

$\alpha(t)$ depends continuously on $t.$

Analysts usually call such a thing a ‘continuous one-parameter semigroup’ of operators on $V.$

Given a finite set $X,$ a ‘Markov process’ or ‘continuous-time Markov chain’ on $X$ is a continuous one-parameter semigroup of operators on $\mathbb{R}^X$ such that if $f \in \mathbb{R}^X$ is a probability distribution on $X,$ so is $\alpha(t) f$ for all $t \in [0,\infty).$ Equivalently, if we think of $\alpha(t)$ as a $X \times X$ matrix of real numbers, we demand that its entries be nonnegative and each column sum to 1. Such a matrix is called ‘stochastic’. If $X$ is a set of possible sequences of base pairs, a Markov process on $X$ describes the random changes of DNA with the passage of time. We shall see that any Markov process on $X$ makes $\mathbb{R}^X$ into a coalgebra of $[0,\infty).$

This handles the Markov process aspect of DNA evolution; what about the branching? For this we use $\textrm{Com},$ the unique operad with one n-ary operation for each n > 0. Algebras of $\textrm{Com}$ are not-necessarily-unital commutative algebras: there is only one way to multiply n elements for n > 0.

For us what matters most is that coalgebras of $\textrm{Com}$ are finite-dimensional cocommutative coalgebras, not necessarily with counit. If $X$ is a finite set, there is a cocommutative coalgebra whose underlying vector space is $\mathbb{R}^X.$ The unique n-ary operation of $\textrm{Com}$ acts as the linear map

$\displaystyle{ \Delta_n \colon \mathbb{R}^X \to \mathbb{R}^X \otimes \cdots \otimes \mathbb{R}^X \; \cong \;\mathbb{R}^{X^n} }$

where
$\Delta_n (f)(x_1, \dots, x_n) = \left\{ \begin{array}{cl} f(x) & \textrm{if } x_1 = \cdots = x_n = x \\ \\ 0 & \textrm{otherwise} \end{array} \right.$

This map describes the ‘n-fold duplication’ of a probability distribution $f$ on the set $X$ of possible genes. We use this map to say what takes place when a species branches:

Next, we wish to describe how to combine the operads $[0,\infty)$ and $\textrm{Com}$ to obtain the phylogenetic operad. Any pair of operads $O$ and $O'$ has a coproduct $O + O'.$ The definition of coproduct gives an easy way to understand the algebras of $O + O'.$ Such an algebra is simply an object that is both an algebra of $O$ and an algebra of $O',$ with no compatibility conditions imposed.

One can also give an explicit construction of $O + O'.$ Briefly, the n-ary operations of $O + O'$ are certain equivalence classes of trees with leaves labelled $\{1,\dots, n\}$ and all nodes, except for leaves, labelled by operations in $O$ or $O'.$ While this fact is surely known to experts on operads, it seems hard to find in the literature, so we prove this in Theorem 24.

Given this, it should come as no surprise that the operad $\textrm{Phyl}$ is the coproduct $\textrm{Com} + [0,\infty).$ In fact, we shall take this as a definition. Starting from this definition, we work backwards to show that the operations of $\textrm{Phyl}$ correspond to phylogenetic trees. We prove this in Theorem 28. The definition of coproduct determines a topology on the spaces $\textrm{Phyl}_n,$ and it is a nontrivial fact that with this topology we have $\textrm{Phyl}_n \cong \mathcal{T}_n \times [0,\infty)^{n+1}$ for n > 1, where $\mathcal{T}_n$ has the topology defined by Billera, Holmes and Vogtmann. We prove this in Theorem 34.

Using the definition of the phylogenetic operad as a coproduct, it is clear that given any Markov process on a finite set $X,$ the vector space $\mathbb{R}^X$ naturally becomes a coalgebra of this operad. The reason is that, as we have seen, $\mathbb{R}^X$ is automatically a coalgebra of $\textrm{Com},$ and the Markov process makes it into a coalgebra of $[0,\infty).$ Thus, by the universal property of a coproduct, it becomes a coalgebra of $\textrm{Phyl} \cong \textrm{Com} + [0,\infty).$ We prove this in Theorem 36.

Since operads arose in algebraic topology, it is interesting to consider how the phylogenetic operad connects to ideas from that subject. Boardmann and Vogt defined a construction on operads, the ‘$W$ construction’, which when applied to the operad for spaces with an associative multiplication gives an operad for loop spaces. The operad $\textrm{Phyl}$ has an interesting relation to $W(\textrm{Com}).$ To see this, define addition on $[0,\infty]$ in the obvious way, where

$\infty + t = t + \infty = \infty$

Then $[0,\infty]$ becomes a commutative topological monoid, so we obtain an operad with only unary operations, one for each $t \in [0,\infty],$ where composition is addition. By abuse of notation, let us call this operad $[0,\infty].$

Boardmann and Vogt’s $W$ construction involves trees with edges having lengths in $[0,1],$ but we can equivalently use $[0,\infty].$ Leinster has observed that for any nonsymmetric topological operad $O,$ Boardmann and Vogt’s operad $W(O)$ is closely related to $O + [0,\infty].$ Here we make this observation precise in the symmetric case. Operations in $\textrm{Com} + [0,\infty]$ are just like phylogenetic trees except that edges may have length $\infty.$ Moreover, for any operad $O,$ the operad $W(O)$ is a non-unital suboperad of $O + [0,\infty].$ An operation of $O + [0,\infty]$ lies in $W(O)$ if and only if all the external edges of the corresponding tree have length $\infty.$ We prove this in Theorem 40.

Berger and Moerdijk showed that if $S_n$ acts freely on $O_n$ and $O_1$ is well-pointed, then $W(O)$ is a cofibrant replacement for $O.$ This is true for $O = \textrm{Assoc},$ the operad whose algebras are topological semigroups. This cofibrancy is why Boardmann and Vogt could use $W(\textrm{Assoc})$ as an operad for loop spaces. But $S_n$ does not act freely on $\textrm{Com}_n,$ and $W(\textrm{Com})$ is not a cofibrant replacement for $\textrm{Com}.$ So, it is not an operad for infinite loop spaces.

Nonetheless, the larger operad $\textrm{Com} + [0,\infty],$ a compactification of $\textrm{Phyl} = \textrm{Com} + [0,\infty),$ is somewhat interesting. The reason is that any Markov process

$\alpha \colon [0,\infty) \to \mathrm{End}(\mathbb{R}^X)$

approaches a limit as $t \to \infty.$ Indeed, $\alpha$ extends uniquely to a homomorphism from the topological monoid $[0,\infty]$ to $\mathrm{End}(\mathbb{R}^X).$ Thus, given a Markov process on a finite set $X,$ the vector space $\mathbb{R}^X$ naturally becomes a coalgebra of $\textrm{Com} + [0,\infty].$ We prove this in Theorem 38.

## Relative Entropy in Biological Systems

27 November, 2015

Here’s a paper for the proceedings of a workshop on Information and Entropy in Biological System this spring:

• John Baez and Blake Pollard, Relative entropy in biological systems, with Blake S. Pollard, Entropy 18 (2016), 46.

We’d love any comments or questions you might have. I’m not happy with the title. In the paper we advocate using the term ‘relative information’ instead of ‘relative entropy’—yet the latter is much more widely used, so I feel we need it in the title to let people know what the paper is about!

Here’s the basic idea.

Life relies on nonequilibrium thermodynamics, since in thermal equilibrium there are no flows of free energy. Biological systems are also open systems, in the sense that both matter and energy flow in and out of them. Nonetheless, it is important in biology that systems can sometimes be treated as approximately closed, and sometimes approach equilibrium before being disrupted in one way or another. This can occur on a wide range of scales, from large ecosystems to within a single cell or organelle. Examples include:

• A population approaching an evolutionarily stable state.

• Random processes such as mutation, genetic drift, the diffusion of organisms in an environment or the diffusion of molecules in a liquid.

• A chemical reaction approaching equilibrium.

An interesting common feature of these processes is that as they occur, quantities mathematically akin to entropy tend to increase. Closely related quantities such as free energy tend to decrease. In this review, we explain some mathematical results that make this idea precise.

Most of these results involve a quantity that is variously known as ‘relative information’, ‘relative entropy’, ‘information gain’ or the ‘Kullback–Leibler divergence’. We’ll use the first term. Given two probability distributions $p$ and $q$ on a finite set $X$, their relative information, or more precisely the information of $p$ relative to $q$, is

$\displaystyle{ I(p\|q) = \sum_{i \in X} p_i \ln\left(\frac{p_i}{q_i}\right) }$

We use the word ‘information’ instead of ‘entropy’ because one expects entropy to increase with time, and the theorems we present will say that $I(p\|q)$ decreases with time under various conditions. The reason is that the Shannon entropy

$\displaystyle{ S(p) = -\sum_{i \in X} p_i \ln p_i }$

contains a minus sign that is missing from the definition of relative information.

Intuitively, $I(p\|q)$ is the amount of information gained when we start with a hypothesis given by some probability distribution $q$ and then learn the ‘true’ probability distribution $p$. For example, if we start with the hypothesis that a coin is fair and then are told that it landed heads up, the relative information is $\ln 2$, so we have gained 1 bit of information. If however we started with the hypothesis that the coin always lands heads up, we would have gained no information.

We put the word ‘true’ in quotes here, because the notion of a ‘true’ probability distribution, which subjective Bayesians reject, is not required to use relative information. A more cautious description of relative information is that it is a divergence: a way of measuring the difference between probability distributions that obeys

$I(p \| q) \ge 0$

and

$I(p \| q) = 0 \iff p = q$

but not necessarily the other axioms for a distance function, symmetry and the triangle inequality, which indeed fail for relative information.

There are many other divergences besides relative information, some of which we discuss in Section 6. However, relative information can be singled out by a number of characterizations, including one based on ideas from Bayesian inference. The relative information is also close to the expected number of extra bits required to code messages distributed according to the probability measure $p$ using a code optimized for messages distributed according to $q$.

In this review, we describe various ways in which a population or probability distribution evolves continuously according to some differential equation. For all these differential equations, I describe conditions under which relative information decreases. Briefly, the results are as follows. We hasten to reassure the reader that our paper explains all the jargon involved, and the proofs of the claims are given in full:

• In Section 2, we consider a very general form of the Lotka–Volterra equations, which are a commonly used model of population dynamics. Starting from the population $P_i$ of each type of replicating entity, we can define a probability distribution

$p_i = \displaystyle{\frac{P_i}{\sum_{i \in X} P_i }}$

which evolves according to a nonlinear equation called the replicator equation. We describe a necessary and sufficient condition under which $I(q\|p(t))$ is nonincreasing when $p(t)$ evolves according to the replicator equation while $q$ is held fixed.

• In Section 3, we consider a special case of the replicator equation that is widely studied in evolutionary game theory. In this case we can think of probability distributions as mixed strategies in a two-player game. When $q$ is a dominant strategy, $I(q\|p(t))$ can never increase when $p(t)$ evolves according to the replicator equation. We can think of $I(q\|p(t))$ as the information that the population has left to learn. Thus, evolution is analogous to a learning process—an analogy that in the field of artificial intelligence is exploited by evolutionary algorithms!

• In Section 4 we consider continuous-time, finite-state Markov processes. Here we have probability distributions on a finite set $X$ evolving according to a linear equation called the master equation. In this case $I(p(t)\|q(t))$ can never increase. Thus, if $q$ is a steady state solution of the master equation, both $I(p(t)\|q)$ and $I(q\|p(t))$ are nonincreasing. We can always write $q$ as the Boltzmann distribution for some energy function $E : X \to \mathbb{R}$, meaning that

$\displaystyle{ q_i = \frac{\exp(-E_i / k T)}{\displaystyle{\sum_{j \in X} \exp(-E_j / k T)}} }$

where $T$ is temperature and $k$ is Boltzmann’s constant. In this case, $I(p(t)\|q)$ is proportional to a difference of free energies:

$\displaystyle{ I(p(t)\|q) = \frac{F(p) - F(q)}{T} }$

Thus, the nonincreasing nature of $I(p(t)\|q)$ is a version of the Second Law of Thermodynamics.

• In Section 5, we consider chemical reactions and other processes described by reaction networks. In this context we have populations $P_i$ of entities of various kinds $i \in X$, and these populations evolve according to a nonlinear equation called the rate equation. We can generalize relative information from probability distributions to populations by setting

$\displaystyle{ I(P\|Q) = \sum_{i \in X} P_i \ln\left(\frac{P_i}{Q_i}\right) - \left(P_i - Q_i\right) }$

If $Q$ is a special sort of steady state solution of the rate equation, called a complex balanced equilibrium, $I(P(t)\|Q)$ can never increase when $P(t)$ evolves according to the rate equation.

• Finally, in Section 6, we consider a class of functions called $f$-divergences which include relative information as a special case. For any convex function $f : [0,\infty) \to [0,\infty)$, the f-divergence of two probability distributions $p, q : X \to [0,1]$ is given by

$\displaystyle{ I_f(p\|q) = \sum_{i \in X} q_i f\left(\frac{p_i}{q_i}\right)}$

Whenever $p(t)$ and $q(t)$ are probability distributions evolving according to the master equation of some Markov process, $I_f(p(t)\|q(t))$ is nonincreasing. The $f$-divergence is also well-defined for populations, and nonincreasing for two populations that both evolve according to the master equation.

## A Compositional Framework for Markov Processes

4 September, 2015

This summer my students Brendan Fong and Blake Pollard visited me at the Centre for Quantum Technologies, and we figured out how to understand open continuous-time Markov chains! I think this is a nice step towards understanding the math of living systems.

Admittedly, it’s just a small first step. But I’m excited by this step, since Blake and I have been trying to get this stuff to work for a couple years, and it finally fell into place. And we think we know what to do next.

Here’s our paper:

• John C. Baez, Brendan Fong and Blake S. Pollard, A compositional framework for open Markov processes.

And here’s the basic idea…

### Open detailed balanced Markov processes

A continuous-time Markov chain is a way to specify the dynamics of a population which is spread across some finite set of states. Population can flow between the states. The larger the population of a state, the more rapidly population flows out of the state. Because of this property, under certain conditions the populations of the states tend toward an equilibrium where at any state the inflow of population is balanced by its outflow.

In applications to statistical mechanics, we are often interested in equilibria such that for any two states connected by an edge, say $i$ and $j,$ the flow from $i$ to $j$ equals the flow from $j$ to $i.$ A continuous-time Markov chain with a chosen equilibrium having this property is called ‘detailed balanced‘.

I’m getting tired of saying ‘continuous-time Markov chain’, so from now on I’ll just say ‘Markov process’, just because it’s shorter. Okay? That will let me say the next sentence without running out of breath:

Our paper is about open detailed balanced Markov processes.

Here’s an example:

The detailed balanced Markov process itself consists of a finite set of states together with a finite set of edges between them, with each state $i$ labelled by an equilibrium population $q_i >0,$ and each edge $e$ labelled by a rate constant $r_e > 0.$

These populations and rate constants are required to obey an equation called the ‘detailed balance condition’. This equation means that in equilibrium, the flow from $i$ to $j$ equal the flow from $j$ to $i.$ Do you see how it works in this example?

To get an ‘open’ detailed balanced Markov process, some states are designated as inputs or outputs. In general each state may be specified as both an input and an output, or as inputs and outputs multiple times. See how that’s happening in this example? It may seem weird, but it makes things work better.

People usually say Markov processes are all about how probabilities flow from one state to another. But we work with un-normalized probabilities, which we call ‘populations’, rather than probabilities that must sum to 1. The reason is that in an open Markov process, probability is not conserved: it can flow in or out at the inputs and outputs. We allow it to flow both in and out at both the input states and the output states.

Our most fundamental result is that there’s a category $\mathrm{DetBalMark}$ where a morphism is an open detailed balanced Markov process. We think of it as a morphism from its inputs to its outputs.

We compose morphisms in $\mathrm{DetBalMark}$ by identifying the output states of one open detailed balanced Markov process with the input states of another. The populations of identified states must match. For example, we may compose this morphism $N$:

with the previously shown morphism $M$ to get this morphism $M \circ N$:

And here’s our second most fundamental result: the category $\mathrm{DetBalMark}$ is actually a dagger compact category. This lets us do other stuff with open Markov processes. An important one is ‘tensoring’, which lets us take two open Markov processes like $M$ and $N$ above and set them side by side, giving $M \otimes N$:

The so-called compactness is also important. This means we can take some inputs of an open Markov process and turn them into outputs, or vice versa. For example, using the compactness of $\mathrm{DetBalMark}$ we can get this open Markov process from $M$:

In fact all the categories in our paper are dagger compact categories, and all our functors preserve this structure. Dagger compact categories are a well-known framework for describing systems with inputs and outputs, so this is good.

### The analogy to electrical circuits

In a detailed balanced Markov process, population can flow along edges. In the detailed balanced equilibrium, without any flow of population from outside, the flow along from state $i$ to state $j$ will be matched by the flow back from $j$ to $i.$ The populations need to take specific values for this to occur.

In an electrical circuit made of linear resistors, charge can flow along wires. In equilibrium, without any driving voltage from outside, the current along each wire will be zero. The potentials will be equal at every node.

This sets up an analogy between detailed balanced continuous-time Markov chains and electrical circuits made of linear resistors! I love analogy charts, so this makes me very happy:

 Circuits Detailed balanced Markov processes potential population current flow conductance rate constant power dissipation

This analogy is already well known. Schnakenberg used it in his book Thermodynamic Network Analysis of Biological Systems. So, our main goal is to formalize and exploit it. This analogy extends from systems in equilibrium to the more interesting case of nonequilibrium steady states, which are the main topic of our paper.

Earlier, Brendan and I introduced a way to ‘black box’ a circuit and define the relation it determines between potential-current pairs at the input and output terminals. This relation describes the circuit’s external behavior as seen by an observer who can only perform measurements at the terminals.

An important fact is that black boxing is ‘compositional’: if one builds a circuit from smaller pieces, the external behavior of the whole circuit can be determined from the external behaviors of the pieces. For category theorists, this means that black boxing is a functor!

Our new paper with Blake develops a similar ‘black box functor’ for detailed balanced Markov processes, and relates it to the earlier one for circuits.

When you black box a detailed balanced Markov process, you get the relation between population–flow pairs at the terminals. (By the ‘flow at a terminal’, we more precisely mean the net population outflow.) This relation holds not only in equilibrium, but also in any nonequilibrium steady state. Thus, black boxing an open detailed balanced Markov process gives its steady state dynamics as seen by an observer who can only measure populations and flows at the terminals.

### The principle of minimum dissipation

At least since the work of Prigogine, it’s been widely accepted that a large class of systems minimize entropy production in a nonequilibrium steady state. But people still fight about the the precise boundary of this class of systems, and even the meaning of this ‘principle of minimum entropy production’.

For detailed balanced open Markov processes, we show that a quantity we call the ‘dissipation’ is minimized in any steady state. This is a quadratic function of the populations and flows, analogous to the power dissipation of a circuit made of resistors. We make no claim that this quadratic function actually deserves to be called ‘entropy production’. Indeed, Schnakenberg has convincingly argued that they are only approximately equal.

But still, the ‘dissipation’ function is very natural and useful—and Prigogine’s so-called ‘entropy production’ is also a quadratic function.

### Black boxing

I’ve already mentioned the category $\mathrm{DetBalMark},$ where a morphism is an open detailed balanced Markov process. But our paper needs two more categories to tell its story! There’s the category of circuits, and the category of linear relations.

A morphism in the category $\mathrm{Circ}$ is an open electrical circuit made of resistors: that is, a graph with each edge labelled by a ‘conductance’ $c_e > 0,$ together with specified input and output nodes:

A morphism in the category $\mathrm{LinRel}$ is a linear relation $L : U \leadsto V$ between finite-dimensional real vector spaces $U$ and $V.$ This is nothing but a linear subspace $L \subseteq U \oplus V.$ Just as relations generalize functions, linear relations generalize linear functions!

In our previous paper, Brendan and I introduced these two categories and a functor between them, the ‘black box functor’:

$\blacksquare : \mathrm{Circ} \to \mathrm{LinRel}$

The idea is that any circuit determines a linear relation between the potentials and net current flows at the inputs and outputs. This relation describes the behavior of a circuit of resistors as seen from outside.

Our new paper introduces a black box functor for detailed balanced Markov processes:

$\square : \mathrm{DetBalMark} \to \mathrm{LinRel}$

We draw this functor as a white box merely to distinguish it from the other black box functor. The functor $\square$ maps any detailed balanced Markov process to the linear relation obeyed by populations and flows at the inputs and outputs in a steady state. In short, it describes the steady state behavior of the Markov process ‘as seen from outside’.

How do we manage to black box detailed balanced Markov processes? We do it using the analogy with circuits!

### The analogy becomes a functor

Every analogy wants to be a functor. So, we make the analogy between detailed balanced Markov processes and circuits precise by turning it into a functor:

$K : \mathrm{DetBalMark} \to \mathrm{Circ}$

This functor converts any open detailed balanced Markov process into an open electrical circuit made of resistors. This circuit is carefully chosen to reflect the steady-state behavior of the Markov process. Its underlying graph is the same as that of the Markov process. So, the ‘states’ of the Markov process are the same as the ‘nodes’ of the circuit.

Both the equilibrium populations at states of the Markov process and the rate constants labelling edges of the Markov process are used to compute the conductances of edges of this circuit. In the simple case where the Markov process has exactly one edge from any state $i$ to any state $j,$ the rule is this:

$C_{i j} = H_{i j} q_j$

where:

$q_j$ is the equilibrium population of the $j$th state of the Markov process,

$H_{i j}$ is the rate constant for the edge from the $j$th state to the $i$th state of the Markov process, and

$C_{i j}$ is the conductance (that is, the reciprocal of the resistance) of the wire from the $j$th node to the $i$th node of the resulting circuit.

The detailed balance condition for Markov processes says precisely that the matrix $C_{i j}$ is symmetric! This is just right for an electrical circuit made of resistors, since it means that the resistance of the wire from node $i$ to node $j$ equals the resistance of the same wire in the reverse direction, from node $j$ to node $i.$

### A triangle of functors

If you paid careful attention, you’ll have noticed that I’ve described a triangle of functors:

And if you know anything about how category theorists think, you’ll be wondering if this diagram commutes.

In fact, this triangle of functors does not commute! However, a general lesson of category theory is that we should only expect diagrams of functors to commute up to natural isomorphism, and this is what happens here:

The natural transformation $\alpha$ ‘corrects’ the black box functor for resistors to give the one for detailed balanced Markov processes.

The functors $\square$ and $\blacksquare \circ K$ are actually equal on objects. An object in $\mathrm{DetBalMark}$ is a finite set $X$ with each element $i \in X$ labelled a positive populations $q_i.$ Both functors map this object to the vector space $\mathbb{R}^X \oplus \mathbb{R}^X.$ For the functor $\square,$ we think of this as a space of population-flow pairs. For the functor $\blacksquare \circ K,$ we think of it as a space of potential-current pairs. The natural transformation $\alpha$ then gives a linear relation

$\alpha_{X,q} : \mathbb{R}^X \oplus \mathbb{R}^X \leadsto \mathbb{R}^X \oplus \mathbb{R}^X$

in fact an isomorphism of vector spaces, which converts potential-current pairs into population-flow pairs in a manner that depends on the $q_i.$ I’ll skip the formula; it’s in the paper.

But here’s the key point. The naturality of $\alpha$ actually allows us to reduce the problem of computing the functor $\square$ to the problem of computing $\blacksquare.$ Suppose

$M: (X,q) \to (Y,r)$

is any morphism in $\mathrm{DetBalMark}.$ The object $(X,q)$ is some finite set $X$ labelled by populations $q,$ and $(Y,r)$ is some finite set $Y$ labelled by populations $r.$ Then the naturality of $\alpha$ means that this square commutes:

Since $\alpha_{X,q}$ and $\alpha_{Y,r}$ are isomorphisms, we can solve for the functor $\square$ as follows:

$\square(M) = \alpha_Y \circ \blacksquare K(M) \circ \alpha_X^{-1}$

This equation has a clear intuitive meaning! It says that to compute the behavior of a detailed balanced Markov process, namely $\square(f),$ we convert it into a circuit made of resistors and compute the behavior of that, namely $\blacksquare K(f).$ This is not equal to the behavior of the Markov process, but we can compute that behavior by converting the input populations and flows into potentials and currents, feeding them into our circuit, and then converting the outputs back into populations and flows.

### What we really do

So that’s a sketch of what we do, and I hope you ask questions if it’s not clear. But I also hope you read our paper! Here’s what we actually do in there. After an introduction and summary of results:

• Section 3 defines open Markov processes and the open master equation.

• Section 4 introduces detailed balance for open Markov
processes.

• Section 5 recalls the principle of minimum power
for open circuits made of linear resistors, and explains how to black box them.

• Section 6 introduces the principle of minimum dissipation for open detailed balanced Markov processes, and describes how to black box these.

• Section 7 states the analogy between circuits and detailed balanced Markov processes in a formal way.

• Section 8 describes how to compose open Markov processes, making them into the morphisms of a category.

• Section 9 does the same for detailed balanced Markov processes.

• Section 10 describes the ‘black box functor’ that sends any open detailed balanced Markov process to the linear relation describing its external behavior, and recalls the similar functor for circuits.

• Section 11 makes the analogy between between open detailed balanced Markov processes and open circuits even more formal, by making it into a functor. We prove that together with the two black box functors, this forms a triangle that commutes up to natural isomorphism.

• Section 12 is about geometric aspects of this theory. We show that the linear relations in the image of these black box functors are Lagrangian relations between symplectic vector spaces. We also show that the master equation can be seen as a gradient flow equation.

• Section 13 is a summary of what we have learned.

Finally, Appendix A is a quick tutorial on decorated cospans. This is a key mathematical tool in our work, developed by Brendan in an earlier paper.