Here’s a paper for the proceedings of a workshop on Information and Entropy in Biological System this spring:

• John Baez and Blake Pollard, Relative entropy in biological systems, with Blake S. Pollard, *Entropy* **18** (2016), 46.

We’d love any comments or questions you might have. I’m not happy with the title. In the paper we advocate using the term ‘relative information’ instead of ‘relative entropy’—yet the latter is much more widely used, so I feel we need it in the title to let people know what the paper is about!

Here’s the basic idea.

Life relies on nonequilibrium thermodynamics, since in thermal equilibrium there are no flows of free energy. Biological systems are also open systems, in the sense that both matter and energy flow in and out of them. Nonetheless, it is important in biology that systems can sometimes be treated as approximately closed, and sometimes approach equilibrium before being disrupted in one way or another. This can occur on a wide range of scales, from large ecosystems to within a single cell or organelle. Examples include:

• A population approaching an evolutionarily stable state.

• Random processes such as mutation, genetic drift, the diffusion of organisms in an environment or the diffusion of molecules in a liquid.

• A chemical reaction approaching equilibrium.

An interesting common feature of these processes is that as they occur, quantities mathematically akin to entropy tend to increase. Closely related quantities such as free energy tend to decrease. In this review, we explain some mathematical results that make this idea precise.

Most of these results involve a quantity that is variously known as ‘relative information’, ‘relative entropy’, ‘information gain’ or the ‘Kullback–Leibler divergence’. We’ll use the first term. Given two probability distributions and on a finite set , their **relative information**, or more precisely the **information of relative to **, is

We use the word ‘information’ instead of ‘entropy’ because one expects entropy to increase with time, and the theorems we present will say that decreases with time under various conditions. The reason is that the Shannon entropy

contains a minus sign that is missing from the definition of relative information.

Intuitively, is the amount of information gained when we start with a hypothesis given by some probability distribution and then learn the ‘true’ probability distribution . For example, if we start with the hypothesis that a coin is fair and then are told that it landed heads up, the relative information is , so we have gained 1 bit of information. If however we started with the hypothesis that the coin always lands heads up, we would have gained no information.

We put the word ‘true’ in quotes here, because the notion of a ‘true’ probability distribution, which subjective Bayesians reject, is not required to use relative information. A more cautious description of relative information is that it is a **divergence**: a way of measuring the difference between probability distributions that obeys

and

but not necessarily the other axioms for a distance function, symmetry and the triangle inequality, which indeed fail for relative information.

There are many other divergences besides relative information, some of which we discuss in Section 6. However, relative information can be singled out by a number of characterizations, including one based on ideas from Bayesian inference. The relative information is also close to the expected number of extra bits required to code messages distributed according to the probability measure using a code optimized for messages distributed according to .

In this review, we describe various ways in which a population or probability distribution evolves continuously according to some differential equation. For all these differential equations, I describe conditions under which relative information decreases. Briefly, the results are as follows. We hasten to reassure the reader that our paper explains all the jargon involved, and the proofs of the claims are given in full:

• In Section 2, we consider a very general form of the Lotka–Volterra equations, which are a commonly used model of population dynamics. Starting from the population of each type of replicating entity, we can define a probability distribution

which evolves according to a nonlinear equation called the replicator equation. We describe a necessary and sufficient condition under which is nonincreasing when evolves according to the replicator equation while is held fixed.

• In Section 3, we consider a special case of the replicator equation that is widely studied in evolutionary game theory. In this case we can think of probability distributions as mixed strategies in a two-player game. When is a dominant strategy, can never increase when evolves according to the replicator equation. We can think of as the information that the population has left to learn. Thus, evolution is analogous to a learning process—an analogy that in the field of artificial intelligence is exploited by evolutionary algorithms!

• In Section 4 we consider continuous-time, finite-state Markov processes. Here we have probability distributions on a finite set evolving according to a linear equation called the master equation. In this case can never increase. Thus, if is a steady state solution of the master equation, both and are nonincreasing. We can always write as the Boltzmann distribution for some energy function , meaning that

where is temperature and is Boltzmann’s constant. In this case, is proportional to a difference of free energies:

Thus, the nonincreasing nature of is a version of the Second Law of Thermodynamics.

• In Section 5, we consider chemical reactions and other processes described by reaction networks. In this context we have populations of entities of various kinds , and these populations evolve according to a nonlinear equation called the rate equation. We can generalize relative information from probability distributions to populations by setting

If is a special sort of steady state solution of the rate equation, called a complex balanced equilibrium, can never increase when evolves according to the rate equation.

• Finally, in Section 6, we consider a class of functions called -divergences which include relative information as a special case. For any convex function , the ** f-divergence** of two probability distributions is given by

Whenever and are probability distributions evolving according to the master equation of some Markov process, is nonincreasing. The -divergence is also well-defined for populations, and nonincreasing for two populations that both evolve according to the master equation.

Hi John, I’ve been meaning to ask you for some time about this: have you looked at the recent work on so-called “causal entropic forces” ? It seems to also attempt to make rigorous some ideas connecting life (or “intelligence”) with increasing entropy.

I don’t know about ‘causal entropic forces’, but you might like this blog article:

• John Baez, Entropic forces,

Azimuth, 1 February 2012.I understand your embarrassment concerning the usage of ‘entropy’ and ‘information’, and that you prefer to use the latter term, while people are using the former one.

Long time ago I read the paper “Energy and Information” by Myron Tribus and Edward C. McIrvine, Scientific American 224 (1971) 179. This paper is very clear in distinguishing between entropy and information—very useful stuff, and I recommend it to everybody.

The part I read sounds good (basic idea and first two examples.) I have no suggestion for title.

If 1-2 sentences could do it, you might add whether or not any of the “other divergences” are more like metrics, and if not, what that means, and if so, why they’re not “better”. (If some divergence is symmetric, maybe nonlinearly rescaling it could also fix the triangle inequality — if not, it would be interesting to understand why.)

I say a lot about other divergences in Section 6, and I don’t think the introduction (quoted here) is the place to dive into that. But I’ll just tell you something in there:

The

Jensen–Shannon divergenceof two probability distributions is defined in terms of relative information bywhere is the arithmetic mean of the probability distributions and :

The Jensen–Shannon divergence is obviously symmetric in its arguments. More interesting is that its square root is actually a metric on the space of probability distributions! In particular, it obeys the triangle inequality. Even better, it is nonincreasing whenever and evolve in time via a Markov process. Without some property like this, a metric on probability distributions is not very interesting to me.

Markov processes are linear. For nonlinear systems like the replicator equation or the rate equation of a chemical reaction network, it seems hard to find any metric where distances between populations decrease as time passes. There are however nice theorems about how relative information decreases, as summarized in this introduction.

Typo on p. 5: “At this, point Marc Harper”

And a couple more on p. 9: “any two probability distribution”

“alwasy infinitesimal stochastic”

p. 11: “This is a version of the Second Law of Thermodynamics: it says that To prove this, note that”

It is very interesting.

I am thinking that the Landauer’s principle can be used to prove the entropy increase using heat flow (graph) from a chemical reaction, so that each chemical reaction can give ordered phases, and life=consciousness, for high reduction of the inner energy (and the self-replication like a usual manner to obtain order, in long times).

I am thinking that if the information is a bit string, and if each program is a number of memory unit, then it could be possible to generate computer programs using free Helmholtz free energy like a reservoir of bit change, and the environment like a reservoir of bit change: the graphs coding the flow of the energy like an adaptive program (thermodynamic computer science).

If there exists an upper limit of the entropy (for the third law of thermodynamic), then the Landauer’s principle is not true ever, and there is a quantum restriction: there is an information that is not stored in the thermodynamic system because all the increase of the quantum levels have an energy greater of the Landauer limit (so that a principle in thermodynamic could be a quantum principle).

We fixed up the paper and put it on the arXiv, though there are at least a few typos in the current arXiv version which I’ll fix later. We also submitted the paper to

Entropy.[…] There’s a lot more to say here, but I just want to add that free energy can also be interpreted as ‘relative information’, a purely information-theoretic concept. For an explanation, see Section 4 of this paper:

• John Baez and Blake Pollard, Relative entropy in biological systems. (Blog article here.)

Since I like abstract generalities, this information-theoretic way of understanding free energy appeals to me.

And of course free energy is

useful, so an organism should care about it—and we should be able to track what an organism actually does with it. This is one of my main goals: understanding better what it means for a system to ‘do something with free energy’.In glycolysis, some of the free energy of glucose gets transferred to ATP. ATP is a bit like ‘money’: it carries free energy in a way that the cell can easily ‘spend’ to do interesting things. So, at some point I want to look at an example of how the cell actually spends this money. But for now I want to think about glycolysis—which may be more like ‘cashing a check and getting money’. […]

It’s been a long time since you’ve seen an installment of the information geometry series on this blog! Before I took a long break, I was explaining relative entropy and how it changes in evolutionary games. Much of what I said is summarized and carried further here:

• John Baez and Blake Pollard, Relative entropy in biological systems. (Blog article here.)

But now Blake Pollard has a new paper, and I want to talk about that:

• Blake Pollard, Open Markov processes: A compositional perspective on non-equilibrium steady states in biology.

I’ll focus on just one aspect: the principle of minimum entropy production.