## Beyond Classical Bayesian Networks

7 July, 2018

In the final installment of the Applied Category Theory seminar, two students explain a category-theoretic approach to Bayesian networks and their generalizations. Check it out:

• Pablo Andres-Martinez and Sophie Raynor, Beyond Classical Bayesian networks, The n-Category Café, 7 July 2018.

Pablo Andres-Martinez is a postdoc at the University of Edinburgh, working in the cool-sounding Centre for Doctoral Training in Pervasive Parallelism. Sophie Raynor works at Hoppinger. Their blog article discusses this paper:

• Joe Henson, Raymond Lal and Matthew F. Pusey, Theory-independent limits on correlations from generalized Bayesian networks, New Journal of Physics 16 (2014), 113043.

## Compositionality

6 May, 2018

A new journal! We’ve been working on it for a long time, but we finished sorting out some details at ACT2018, and now we’re ready to tell the world!

It’s free to read, free to publish in, and it’s about building big things from smaller parts. Here’s the top of the journal’s home page right now:

Here’s the official announcement:

We are pleased to announce the launch of Compositionality, a new diamond open-access journal for research using compositional ideas, most notably of a category-theoretic origin, in any discipline. Topics may concern foundational structures, an organizing principle, or a powerful tool. Example areas include but are not limited to: computation, logic, physics, chemistry, engineering, linguistics, and cognition. To learn more about the scope and editorial policies of the journal, please visit our website at http://www.compositionality-journal.org.

Compositionality is the culmination of a long-running discussion by many members of the extended category theory community, and the editorial policies, look, and mission of the journal have yet to be finalized. We would love to get your feedback about our ideas on the forum we have established for this purpose:

http://reddit.com/r/compositionality

Lastly, the journal is currently receiving applications to serve on the editorial board; submissions are due May 31 and will be evaluated by the members of our steering board: John Baez, Bob Coecke, Kathryn Hess, Steve Lack, and Valeria de Paiva.

https://tinyurl.com/call-for-editors

We will announce a call for submissions in mid-June.

We’re looking forward to your ideas and submissions!

Best regards,

Brendan Fong, Nina Otter, and Joshua Tan

## Coarse-Graining Open Markov Processes

4 March, 2018

Kenny Courser and I have been working hard on this paper for months:

• John Baez and Kenny Courser, Coarse-graining open Markov processes.

It may be almost done. So, it would be great if people here could take a look and comment on it! It’s a cool mix of probability theory and double categories. I’ve posted a similar but non-isomorphic article on the n-Category Café, where people know a lot about double categories. But maybe some of you here know more about Markov processes!

‘Coarse-graining’ is a standard method of extracting a simple Markov process from a more complicated one by identifying states. We extend coarse-graining to open Markov processes. An ‘open’ Markov process is one where probability can flow in or out of certain states called ‘inputs’ and ‘outputs’. One can build up an ordinary Markov process from smaller open pieces in two basic ways:

• composition, where we identify the outputs of one open Markov process with the inputs of another,

and

• tensoring, where we set two open Markov processes side by side.

A while back, Brendan Fong, Blake Pollard and I showed that these constructions make open Markov processes into the morphisms of a symmetric monoidal category:

A compositional framework for Markov processes, Azimuth, January 12, 2016.

Here Kenny and I go further by constructing a symmetric monoidal double category where the 2-morphisms include ways of coarse-graining open Markov processes. We also extend the previously defined ‘black-boxing’ functor from the category of open Markov processes to this double category.

But before you dive into the paper, let me explain all this stuff a bit more….

Very roughly speaking, a ‘Markov process’ is a stochastic model describing a sequence of transitions between states in which the probability of a transition depends only on the current state. But the only Markov processes talk about are continuous-time Markov processes with a finite set of states. These can be drawn as labeled graphs:

where the number labeling each edge describes the probability per time of making a transition from one state to another.

An ‘open’ Markov process is a generalization in which probability can also flow in or out of certain states designated as ‘inputs’ and outputs’:

Open Markov processes can be seen as morphisms in a category, since we can compose two open Markov processes by identifying the outputs of the first with the inputs of the second. Composition lets us build a Markov process from smaller open parts—or conversely, analyze the behavior of a Markov process in terms of its parts.

In this paper, Kenny extend the study of open Markov processes to include coarse-graining. ‘Coarse-graining’ is a widely studied method of simplifying a Markov process by mapping its set of states $X$ onto some smaller set $X'$ in a manner that respects the dynamics. Here we introduce coarse-graining for open Markov processes. And we show how to extend this notion to the case of maps $p: X \to X'$ that are not surjective, obtaining a general concept of morphism between open Markov processes.

Since open Markov processes are already morphisms in a category, it is natural to treat morphisms between them as morphisms between morphisms, or ‘2-morphisms’. We can do this using double categories!

Double categories were first introduced by Ehresmann around 1963. Since then, they’ve used in topology and other branches of pure math—but more recently they’ve been used to study open dynamical systems and open discrete-time Markov chains. So, it should not be surprising that they are also useful for open Markov processes..

A 2-morphism in a double category looks like this:

While a mere category has only objects and morphisms, here we have a few more types of things. We call $A, B, C$ and $D$ ‘objects’, $f$ and $g$ ‘vertical 1-morphisms’, $M$ and $N$ ‘horizontal 1-cells’, and $\alpha$ a ‘2-morphism’. We can compose vertical 1-morphisms to get new vertical 1-morphisms and compose horizontal 1-cells to get new horizontal 1-cells. We can compose the 2-morphisms in two ways: horizontally by setting squares side by side, and vertically by setting one on top of the other. The ‘interchange law’ relates vertical and horizontal composition of 2-morphisms.

In a ‘strict’ double category all these forms of composition are associative. In a ‘pseudo’ double category, horizontal 1-cells compose in a weakly associative manner: that is, the associative law holds only up to an invertible 2-morphism, the ‘associator’, which obeys a coherence law. All this is just a sketch; for details on strict and pseudo double categories try the paper by Grandis and Paré.

Kenny and I construct a double category $\mathbb{M}\mathbf{ark}$ with:

1. finite sets as objects,
2. maps between finite sets as vertical 1-morphisms,
3. open Markov processes as horizontal 1-cells,
4. morphisms between open Markov processes as 2-morphisms.

I won’t give the definition of item 4 here; you gotta read our paper for that! Composition of open Markov processes is only weakly associative, so $\mathbb{M}\mathbf{ark}$ is a pseudo double category.

This is how our paper goes. In Section 2 we define open Markov processes and steady state solutions of the open master equation. In Section 3 we introduce coarse-graining first for Markov processes and then open Markov processes. In Section 4 we construct the double category $\mathbb{M}\mathbf{ark}$ described above. We prove this is a symmetric monoidal double category in the sense defined by Mike Shulman. This captures the fact that we can not only compose open Markov processes but also ‘tensor’ them by setting them side by side.

For example, if we compose this open Markov process:

with the one I showed you before:

we get this open Markov process:

But if we tensor them, we get this:

As compared with an ordinary Markov process, the key new feature of an open Markov process is that probability can flow in or out. To describe this we need a generalization of the usual master equation for Markov processes, called the ‘open master equation’.

This is something that Brendan, Blake and I came up with earlier. In this equation, the probabilities at input and output states are arbitrary specified functions of time, while the probabilities at other states obey the usual master equation. As a result, the probabilities are not necessarily normalized. We interpret this by saying probability can flow either in or out at both the input and the output states.

If we fix constant probabilities at the inputs and outputs, there typically exist solutions of the open master equation with these boundary conditions that are constant as a function of time. These are called ‘steady states’. Often these are nonequilibrium steady states, meaning that there is a nonzero net flow of probabilities at the inputs and outputs. For example, probability can flow through an open Markov process at a constant rate in a nonequilibrium steady state. It’s like a bathtub where water is flowing in from the faucet, and flowing out of the drain, but the level of the water isn’t changing.

Brendan, Blake and I studied the relation between probabilities and flows at the inputs and outputs that holds in steady state. We called the process of extracting this relation from an open Markov process ‘black-boxing’, since it gives a way to forget the internal workings of an open system and remember only its externally observable behavior. We showed that black-boxing is compatible with composition and tensoring. In other words, we showed that black-boxing is a symmetric monoidal functor.

In Section 5 of our new paper, Kenny and I show that black-boxing is compatible with morphisms between open Markov processes. To make this idea precise, we prove that black-boxing gives a map from the double category $\mathbb{M}\mathbf{ark}$ to another double category, called $\mathbb{L}\mathbf{inRel}$, which has:

1. finite-dimensional real vector spaces $U,V,W,\dots$ as objects,
2. linear maps $f : V \to W$ as vertical 1-morphisms from $V$ to $W$,
3. linear relations $R \subseteq V \oplus W$ as horizontal 1-cells from $V$ to $W$,
4. squares

obeying $(f \oplus g)R \subseteq S$ as 2-morphisms.

Here a ‘linear relation’ from a vector space $V$ to a vector space $W$ is a linear subspace $R \subseteq V \oplus W$. Linear relations can be composed in the usual way we compose relations. The double category $\mathbb{L}\mathbf{inRel}$ becomes symmetric monoidal using direct sum as the tensor product, but unlike $\mathbb{M}\mathbf{ark}$ it is a strict double category: that is, composition of linear relations is associative.

Our main result, Theorem 5.5, says that black-boxing gives a symmetric monoidal double functor

$\blacksquare : \mathbb{M}\mathbf{ark} \to \mathbb{L}\mathbf{inRel}$

As you’ll see if you check out our paper, there’s a lot of nontrivial content hidden in this short statement! The proof requires a lot of linear algebra and also a reasonable amount of category theory. For example, we needed this fact: if you’ve got a commutative cube in the category of finite sets:

and the top and bottom faces are pushouts, and the two left-most faces are pullbacks, and the two left-most arrows on the bottom face are monic, then the two right-most faces are pullbacks. I think it’s cool that this is relevant to Markov processes!

Finally, in Section 6 we state a conjecture. First we use a technique invented by Mike Shulman to construct symmetric monoidal bicategories $\mathbf{Mark}$ and $\mathbf{LinRel}$ from the symmetric monoidal double categories $\mathbb{M}\mathbf{ark}$ and $\mathbb{L}\mathbf{inRel}$. We conjecture that our black-boxing double functor determines a functor between these symmetric monoidal bicategories. This has got to be true. However, double categories seem to be a simpler framework for coarse-graining open Markov processes.

Finally, let me talk a bit about some related work. As I already mentioned, Brendan, Blake and I constructed a symmetric monoidal category where the morphisms are open Markov processes. However, we formalized such Markov processes in a slightly different way than Kenny and I do now. We defined a Markov process to be one of the pictures I’ve been showing you: a directed multigraph where each edge is assigned a positive number called its ‘rate constant’. In other words, we defined it to be a diagram

where $X$ is a finite set of vertices or ‘states’, $E$ is a finite set of edges or ‘transitions’ between states, the functions $s,t : E \to X$ give the source and target of each edge, and $r : E \to (0,\infty)$ gives the rate constant for each transition. We explained how from this data one can extract a matrix of real numbers $(H_{i j})_{i,j \in X}$ called the ‘Hamiltonian’ of the Markov process, with two properties that are familiar in this game:

$H_{i j} \geq 0$ if $i \neq j$,

$\sum_{i \in X} H_{i j} = 0$ for all $j \in X$.

A matrix with these properties is called ‘infinitesimal stochastic’, since these conditions are equivalent to $\exp(t H)$ being stochastic for all $t \ge 0$.

In our new paper, Kenny and I skip the directed multigraphs and work directly with the Hamiltonians! In other words, we define a Markov process to be a finite set $X$ together with an infinitesimal stochastic matrix $(H_{ij})_{i,j \in X}$. This allows us to work more directly with the Hamiltonian and the all-important ‘master equation’

$\displaystyle{ \frac{d p(t)}{d t} = H p(t) }$

which describes the evolution of a time-dependent probability distribution

$p(t) : X \to \mathbb{R}$

Clerc, Humphrey and Panangaden have constructed a bicategory with finite sets as objects, ‘open discrete labeled Markov processes’ as morphisms, and ‘simulations’ as 2-morphisms. The use the word ‘open’ in a pretty similar way to me. But their open discrete labeled Markov processes are also equipped with a set of ‘actions’ which represent interactions between the Markov process and the environment, such as an outside entity acting on a stochastic system. A ‘simulation’ is then a function between the state spaces that map the inputs, outputs and set of actions of one open discrete labeled Markov process to the inputs, outputs and set of actions of another.

Another compositional framework for Markov processes was discussed by de Francesco Albasini, Sabadini and Walters. They constructed an algebra of ‘Markov automata’. A Markov automaton is a family of matrices with non-negative real coefficients that is indexed by elements of a binary product of sets, where one set represents a set of ‘signals on the left interface’ of the Markov automata and the other set analogously for the right interface.

So, double categories are gradually invading the theory of Markov processes… as part of the bigger trend toward applied category theory. They’re natural things; scientists should use them.

## Biology as Information Dynamics (Part 2)

27 April, 2017

Here’s a video of the talk I gave at the Stanford Complexity Group:

You can see slides here:

Abstract. If biology is the study of self-replicating entities, and we want to understand the role of information, it makes sense to see how information theory is connected to the ‘replicator equation’ — a simple model of population dynamics for self-replicating entities. The relevant concept of information turns out to be the information of one probability distribution relative to another, also known as the Kullback–Liebler divergence. Using this we can get a new outlook on free energy, see evolution as a learning process, and give a clearer, more general formulation of Fisher’s fundamental theorem of natural selection.

I’d given a version of this talk earlier this year at a workshop on Quantifying biological complexity, but I’m glad this second try got videotaped and not the first, because I was a lot happier about my talk this time. And as you’ll see at the end, there were a lot of interesting questions.

## Information Geometry (Part 16)

1 February, 2017

This week I’m giving a talk on biology and information:

• John Baez, Biology as information dynamics, talk for Biological Complexity: Can it be Quantified?, a workshop at the Beyond Center, 2 February 2017.

While preparing this talk, I discovered a cool fact. I doubt it’s new, but I haven’t exactly seen it elsewhere. I came up with it while trying to give a precise and general statement of ‘Fisher’s fundamental theorem of natural selection’. I won’t start by explaining that theorem, since my version looks rather different than Fisher’s, and I came up with mine precisely because I had trouble understanding his. I’ll say a bit more about this at the end.

Here’s my version:

The square of the rate at which a population learns information is the variance of its fitness.

This is a nice advertisement for the virtues of diversity: more variance means faster learning. But it requires some explanation!

### The setup

Let’s start by assuming we have $n$ different kinds of self-replicating entities with populations $P_1, \dots, P_n.$ As usual, these could be all sorts of things:

• molecules of different chemicals
• organisms belonging to different species
• genes of different alleles
• restaurants belonging to different chains
• people with different beliefs
• game-players with different strategies
• etc.

I’ll call them replicators of different species.

Let’s suppose each population $P_i$ is a function of time that grows at a rate equal to this population times its ‘fitness’. I explained the resulting equation back in Part 9, but it’s pretty simple:

$\displaystyle{ \frac{d}{d t} P_i(t) = f_i(P_1(t), \dots, P_n(t)) \, P_i(t) }$

Here $f_i$ is a completely arbitrary smooth function of all the populations! We call it the fitness of the ith species.

This equation is important, so we want a short way to write it. I’ll often write $f_i(P_1(t), \dots, P_n(t))$ simply as $f_i,$ and $P_i(t)$ simply as $P_i.$ With these abbreviations, which any red-blooded physicist would take for granted, our equation becomes simply this:

$\displaystyle{ \frac{dP_i}{d t} = f_i \, P_i }$

Next, let $p_i(t)$ be the probability that a randomly chosen organism is of the ith species:

$\displaystyle{ p_i(t) = \frac{P_i(t)}{\sum_j P_j(t)} }$

Starting from our equation describing how the populations evolve, we can figure out how these probabilities evolve. The answer is called the replicator equation:

$\displaystyle{ \frac{d}{d t} p_i(t) = ( f_i - \langle f \rangle ) \, p_i(t) }$

Here $\langle f \rangle$ is the average fitness of all the replicators, or mean fitness:

$\displaystyle{ \langle f \rangle = \sum_j f_j(P_1(t), \dots, P_n(t)) \, p_j(t) }$

In what follows I’ll abbreviate the replicator equation as follows:

$\displaystyle{ \frac{dp_i}{d t} = ( f_i - \langle f \rangle ) \, p_i }$

### The result

Okay, now let’s figure out how fast the probability distribution

$p(t) = (p_1(t), \dots, p_n(t))$

changes with time. For this we need to choose a way to measure the length of the vector

$\displaystyle{ \frac{dp}{dt} = (\frac{d}{dt} p_1(t), \dots, \frac{d}{dt} p_n(t)) }$

And here information geometry comes to the rescue! We can use the Fisher information metric, which is a Riemannian metric on the space of probability distributions.

I’ve talked about the Fisher information metric in many ways in this series. The most important fact is that as a probability distribution $p(t)$ changes with time, its speed

$\displaystyle{ \left\| \frac{dp}{dt} \right\|}$

as measured using the Fisher information metric can be seen as the rate at which information is learned. I’ll explain that later. Right now I just want a simple formula for the Fisher information metric. Suppose $v$ and $w$ are two tangent vectors to the point $p$ in the space of probability distributions. Then the Fisher information metric is given as follows:

$\displaystyle{ \langle v, w \rangle = \sum_i \frac{1}{p_i} \, v_i w_i }$

Using this we can calculate the speed at which $p(t)$ moves when it obeys the replicator equation. Actually the square of the speed is simpler:

$\begin{array}{ccl} \displaystyle{ \left\| \frac{dp}{dt} \right\|^2 } &=& \displaystyle{ \sum_i \frac{1}{p_i} \left( \frac{dp_i}{dt} \right)^2 } \\ \\ &=& \displaystyle{ \sum_i \frac{1}{p_i} \left( ( f_i - \langle f \rangle ) \, p_i \right)^2 } \\ \\ &=& \displaystyle{ \sum_i ( f_i - \langle f \rangle )^2 p_i } \end{array}$

The answer has a nice meaning, too! It’s just the variance of the fitness: that is, the square of its standard deviation.

So, if you’re willing to buy my claim that the speed $\|dp/dt\|$ is the rate at which our population learns new information, then we’ve seen that the square of the rate at which a population learns information is the variance of its fitness!

### Fisher’s fundamental theorem

Now, how is this related to Fisher’s fundamental theorem of natural selection? First of all, what is Fisher’s fundamental theorem? Here’s what Wikipedia says about it:

It uses some mathematical notation but is not a theorem in the mathematical sense.

It states:

“The rate of increase in fitness of any organism at any time is equal to its genetic variance in fitness at that time.”

Or in more modern terminology:

“The rate of increase in the mean fitness of any organism at any time ascribable to natural selection acting through changes in gene frequencies is exactly equal to its genetic variance in fitness at that time”.

Largely as a result of Fisher’s feud with the American geneticist Sewall Wright about adaptive landscapes, the theorem was widely misunderstood to mean that the average fitness of a population would always increase, even though models showed this not to be the case. In 1972, George R. Price showed that Fisher’s theorem was indeed correct (and that Fisher’s proof was also correct, given a typo or two), but did not find it to be of great significance. The sophistication that Price pointed out, and that had made understanding difficult, is that the theorem gives a formula for part of the change in gene frequency, and not for all of it. This is a part that can be said to be due to natural selection

Price’s paper is here:

• George R. Price, Fisher’s ‘fundamental theorem’ made clear, Annals of Human Genetics 36 (1972), 129–140.

I don’t find it very clear, perhaps because I didn’t spend enough time on it. But I think I get the idea.

My result is a theorem in the mathematical sense, though quite an easy one. I assume a population distribution evolves according to the replicator equation and derive an equation whose right-hand side matches that of Fisher’s original equation: the variance of the fitness.

But my left-hand side is different: it’s the square of the speed of the corresponding probability distribution, where speed is measured using the ‘Fisher information metric’. This metric was discovered by the same guy, Ronald Fisher, but I don’t think he used it in his work on the fundamental theorem!

Something a bit similar to my statement appears as Theorem 2 of this paper:

• Marc Harper, Information geometry and evolutionary game theory.

and for that theorem he cites:

• Josef Hofbauer and Karl Sigmund, Evolutionary Games and Population Dynamics, Cambridge University Press, Cambridge, 1998.

However, his Theorem 2 really concerns the rate of increase of fitness, like Fisher’s fundamental theorem. Moreover, he assumes that the probability distribution $p(t)$ flows along the gradient of a function, and I’m not assuming that. Indeed, my version applies to situations where the probability distribution moves round and round in periodic orbits!

### Relative information and the Fisher information metric

The key to generalizing Fisher’s fundamental theorem is thus to focus on the speed at which $p(t)$ moves, rather than the increase in fitness. Why do I call this speed the ‘rate at which the population learns information’? It’s because we’re measuring this speed using the Fisher information metric, which is closely connected to relative information, also known as relative entropy or the Kullback–Leibler divergence.

I explained this back in Part 7, but that explanation seems hopelessly technical to me now, so here’s a faster one, which I created while preparing my talk.

The information of a probability distribution $q$ relative to a probability distribution $p$ is

$\displaystyle{ I(q,p) = \sum_{i =1}^n q_i \log\left(\frac{q_i}{p_i}\right) }$

It says how much information you learn if you start with a hypothesis $p$ saying that the probability of the ith situation was $p_i,$ and then update this to a new hypothesis $q.$

Now suppose you have a hypothesis that’s changing with time in a smooth way, given by a time-dependent probability $p(t).$ Then a calculation shows that

$\displaystyle{ \left.\frac{d}{dt} I(p(t),p(t_0)) \right|_{t = t_0} = 0 }$

for all times $t_0$. This seems paradoxical at first. I like to jokingly put it this way:

To first order, you’re never learning anything.

However, as long as the velocity $\frac{d}{dt}p(t_0)$ is nonzero, we have

$\displaystyle{ \left.\frac{d^2}{dt^2} I(p(t),p(t_0)) \right|_{t = t_0} > 0 }$

so we can say

To second order, you’re always learning something… unless your opinions are fixed.

This lets us define a ‘rate of learning’—that is, a ‘speed’ at which the probability distribution $p(t)$ moves. And this is precisely the speed given by the Fisher information metric!

In other words:

$\displaystyle{ \left\|\frac{dp}{dt}(t_0)\right\|^2 = \left.\frac{d^2}{dt^2} I(p(t),p(t_0)) \right|_{t = t_0} }$

where the length is given by Fisher information metric. Indeed, this formula can be used to define the Fisher information metric. From this definition we can easily work out the concrete formula I gave earlier.

In summary: as a probability distribution moves around, the relative information between the new probability distribution and the original one grows approximately as the square of time, not linearly. So, to talk about a ‘rate at which information is learned’, we need to use the above formula, involving a second time derivative. This rate is just the speed at which the probability distribution moves, measured using the Fisher information metric. And when we have a probability distribution describing how many replicators are of different species, and it’s evolving according to the replicator equation, this speed is also just the variance of the fitness!

## Biology as Information Dynamics (Part 1)

31 January, 2017

This is my talk for the workshop Biological Complexity: Can It Be Quantified?

• John Baez, Biology as information dynamics, 2 February 2017.

Abstract. If biology is the study of self-replicating entities, and we want to understand the role of information, it makes sense to see how information theory is connected to the ‘replicator equation’—a simple model of population dynamics for self-replicating entities. The relevant concept of information turns out to be the information of one probability distribution relative to another, also known as the Kullback–Leibler divergence. Using this we can get a new outlook on free energy, see evolution as a learning process, and give a clean general formulation of Fisher’s fundamental theorem of natural selection.

• Marc Harper, The replicator equation as an inference dynamic.

• Marc Harper, Information geometry and evolutionary game theory.

• Barry Sinervo and Curt M. Lively, The rock-paper-scissors game and the evolution of alternative male strategies, Nature 380 (1996), 240–243.

• John Baez, Diversity, entropy and thermodynamics.

• John Baez, Information geometry.

The last reference contains proofs of the equations shown in red in my slides.
In particular, Part 16 contains a proof of my updated version of Fisher’s fundamental theorem.

## Compositional Frameworks for Open Systems

27 November, 2016

Here are the slides of Blake Pollard’s talk at the Santa Fe Institute workshop on Statistical Physics, Information Processing and Biology:

• Blake Pollard, Compositional frameworks for open systems, 17 November 2016.

He gave a really nice introduction to how we can use categories to study open systems, with his main example being ‘open Markov processes’, where probability can flow in and out of the set of states. People liked it a lot!