Relative Entropy in Biological Systems

Here’s a paper for the proceedings of a workshop on Information and Entropy in Biological System this spring:

• John Baez and Blake Pollard, Relative entropy in biological systems, with Blake S. Pollard, Entropy 18 (2016), 46.

We’d love any comments or questions you might have. I’m not happy with the title. In the paper we advocate using the term ‘relative information’ instead of ‘relative entropy’—yet the latter is much more widely used, so I feel we need it in the title to let people know what the paper is about!

Here’s the basic idea.

Life relies on nonequilibrium thermodynamics, since in thermal equilibrium there are no flows of free energy. Biological systems are also open systems, in the sense that both matter and energy flow in and out of them. Nonetheless, it is important in biology that systems can sometimes be treated as approximately closed, and sometimes approach equilibrium before being disrupted in one way or another. This can occur on a wide range of scales, from large ecosystems to within a single cell or organelle. Examples include:

• A population approaching an evolutionarily stable state.

• Random processes such as mutation, genetic drift, the diffusion of organisms in an environment or the diffusion of molecules in a liquid.

• A chemical reaction approaching equilibrium.

An interesting common feature of these processes is that as they occur, quantities mathematically akin to entropy tend to increase. Closely related quantities such as free energy tend to decrease. In this review, we explain some mathematical results that make this idea precise.

Most of these results involve a quantity that is variously known as ‘relative information’, ‘relative entropy’, ‘information gain’ or the ‘Kullback–Leibler divergence’. We’ll use the first term. Given two probability distributions p and q on a finite set X, their relative information, or more precisely the information of p relative to q, is

\displaystyle{ I(p\|q) = \sum_{i \in X} p_i \ln\left(\frac{p_i}{q_i}\right) }

We use the word ‘information’ instead of ‘entropy’ because one expects entropy to increase with time, and the theorems we present will say that I(p\|q) decreases with time under various conditions. The reason is that the Shannon entropy

\displaystyle{ S(p) = -\sum_{i \in X} p_i \ln p_i }

contains a minus sign that is missing from the definition of relative information.

Intuitively, I(p\|q) is the amount of information gained when we start with a hypothesis given by some probability distribution q and then learn the ‘true’ probability distribution p. For example, if we start with the hypothesis that a coin is fair and then are told that it landed heads up, the relative information is \ln 2, so we have gained 1 bit of information. If however we started with the hypothesis that the coin always lands heads up, we would have gained no information.

We put the word ‘true’ in quotes here, because the notion of a ‘true’ probability distribution, which subjective Bayesians reject, is not required to use relative information. A more cautious description of relative information is that it is a divergence: a way of measuring the difference between probability distributions that obeys

I(p \| q) \ge 0


I(p \| q) = 0 \iff p = q

but not necessarily the other axioms for a distance function, symmetry and the triangle inequality, which indeed fail for relative information.

There are many other divergences besides relative information, some of which we discuss in Section 6. However, relative information can be singled out by a number of characterizations, including one based on ideas from Bayesian inference. The relative information is also close to the expected number of extra bits required to code messages distributed according to the probability measure p using a code optimized for messages distributed according to q.

In this review, we describe various ways in which a population or probability distribution evolves continuously according to some differential equation. For all these differential equations, I describe conditions under which relative information decreases. Briefly, the results are as follows. We hasten to reassure the reader that our paper explains all the jargon involved, and the proofs of the claims are given in full:

• In Section 2, we consider a very general form of the Lotka–Volterra equations, which are a commonly used model of population dynamics. Starting from the population P_i of each type of replicating entity, we can define a probability distribution

p_i = \displaystyle{\frac{P_i}{\sum_{i \in X} P_i }}

which evolves according to a nonlinear equation called the replicator equation. We describe a necessary and sufficient condition under which I(q\|p(t)) is nonincreasing when p(t) evolves according to the replicator equation while q is held fixed.

• In Section 3, we consider a special case of the replicator equation that is widely studied in evolutionary game theory. In this case we can think of probability distributions as mixed strategies in a two-player game. When q is a dominant strategy, I(q\|p(t)) can never increase when p(t) evolves according to the replicator equation. We can think of I(q\|p(t)) as the information that the population has left to learn. Thus, evolution is analogous to a learning process—an analogy that in the field of artificial intelligence is exploited by evolutionary algorithms!

• In Section 4 we consider continuous-time, finite-state Markov processes. Here we have probability distributions on a finite set X evolving according to a linear equation called the master equation. In this case I(p(t)\|q(t)) can never increase. Thus, if q is a steady state solution of the master equation, both I(p(t)\|q) and I(q\|p(t)) are nonincreasing. We can always write q as the Boltzmann distribution for some energy function E : X \to \mathbb{R}, meaning that

\displaystyle{ q_i = \frac{\exp(-E_i / k T)}{\displaystyle{\sum_{j \in X} \exp(-E_j / k T)}} }

where T is temperature and k is Boltzmann’s constant. In this case, I(p(t)\|q) is proportional to a difference of free energies:

\displaystyle{ I(p(t)\|q) = \frac{F(p) - F(q)}{T} }

Thus, the nonincreasing nature of I(p(t)\|q) is a version of the Second Law of Thermodynamics.

• In Section 5, we consider chemical reactions and other processes described by reaction networks. In this context we have populations P_i of entities of various kinds i \in X, and these populations evolve according to a nonlinear equation called the rate equation. We can generalize relative information from probability distributions to populations by setting

\displaystyle{ I(P\|Q) = \sum_{i \in X} P_i \ln\left(\frac{P_i}{Q_i}\right) - \left(P_i - Q_i\right) }

If Q is a special sort of steady state solution of the rate equation, called a complex balanced equilibrium, I(P(t)\|Q) can never increase when P(t) evolves according to the rate equation.

• Finally, in Section 6, we consider a class of functions called f-divergences which include relative information as a special case. For any convex function f : [0,\infty) \to [0,\infty), the f-divergence of two probability distributions p, q : X \to [0,1] is given by

\displaystyle{ I_f(p\|q) = \sum_{i \in X} q_i f\left(\frac{p_i}{q_i}\right)}

Whenever p(t) and q(t) are probability distributions evolving according to the master equation of some Markov process, I_f(p(t)\|q(t)) is nonincreasing. The f-divergence is also well-defined for populations, and nonincreasing for two populations that both evolve according to the master equation.

12 Responses to Relative Entropy in Biological Systems

  1. Simon Burton says:

    Hi John, I’ve been meaning to ask you for some time about this: have you looked at the recent work on so-called “causal entropic forces” ? It seems to also attempt to make rigorous some ideas connecting life (or “intelligence”) with increasing entropy.

  2. I understand your embarrassment concerning the usage of ‘entropy’ and ‘information’, and that you prefer to use the latter term, while people are using the former one.

    Long time ago I read the paper “Energy and Information” by Myron Tribus and Edward C. McIrvine, Scientific American 224 (1971) 179. This paper is very clear in distinguishing between entropy and information—very useful stuff, and I recommend it to everybody.

  3. Bruce Smith says:

    The part I read sounds good (basic idea and first two examples.) I have no suggestion for title.

    If 1-2 sentences could do it, you might add whether or not any of the “other divergences” are more like metrics, and if not, what that means, and if so, why they’re not “better”. (If some divergence is symmetric, maybe nonlinearly rescaling it could also fix the triangle inequality — if not, it would be interesting to understand why.)

    • John Baez says:

      I say a lot about other divergences in Section 6, and I don’t think the introduction (quoted here) is the place to dive into that. But I’ll just tell you something in there:

      The Jensen–Shannon divergence of two probability distributions is defined in terms of relative information by

      \displaystyle{  JS(p\|q) = \frac{I(p\|m) + I(m\|p)}{2}  }

      where m is the arithmetic mean of the probability distributions p and q:

      \displaystyle{   m_i = \frac{p_i + q_i}{2} }

      The Jensen–Shannon divergence is obviously symmetric in its arguments. More interesting is that its square root is actually a metric on the space of probability distributions! In particular, it obeys the triangle inequality. Even better, it is nonincreasing whenever p and q evolve in time via a Markov process. Without some property like this, a metric on probability distributions is not very interesting to me.

      Markov processes are linear. For nonlinear systems like the replicator equation or the rate equation of a chemical reaction network, it seems hard to find any metric where distances between populations decrease as time passes. There are however nice theorems about how relative information decreases, as summarized in this introduction.

  4. Blake Stacey says:

    Typo on p. 5: “At this, point Marc Harper”

  5. domenico says:

    It is very interesting.
    I am thinking that the Landauer’s principle can be used to prove the entropy increase using heat flow (graph) from a chemical reaction, so that each chemical reaction can give ordered phases, and life=consciousness, for high reduction of the inner energy (and the self-replication like a usual manner to obtain order, in long times).
    I am thinking that if the information is a bit string, and if each program is a number of memory unit, then it could be possible to generate computer programs using free Helmholtz free energy like a reservoir of bit change, and the environment like a reservoir of bit change: the graphs coding the flow of the energy like an adaptive program (thermodynamic computer science).
    If there exists an upper limit of the entropy (for the third law of thermodynamic), then the Landauer’s principle is not true ever, and there is a quantum restriction: there is an information that is not stored in the thermodynamic system because all the increase of the quantum levels have an energy greater of the Landauer limit (so that a principle in thermodynamic could be a quantum principle).

  6. John Baez says:

    We fixed up the paper and put it on the arXiv, though there are at least a few typos in the current arXiv version which I’ll fix later. We also submitted the paper to Entropy.

  7. […] There’s a lot more to say here, but I just want to add that free energy can also be interpreted as ‘relative information’, a purely information-theoretic concept. For an explanation, see Section 4 of this paper:

    • John Baez and Blake Pollard, Relative entropy in biological systems. (Blog article here.)

    Since I like abstract generalities, this information-theoretic way of understanding free energy appeals to me.

    And of course free energy is useful, so an organism should care about it—and we should be able to track what an organism actually does with it. This is one of my main goals: understanding better what it means for a system to ‘do something with free energy’.

    In glycolysis, some of the free energy of glucose gets transferred to ATP. ATP is a bit like ‘money’: it carries free energy in a way that the cell can easily ‘spend’ to do interesting things. So, at some point I want to look at an example of how the cell actually spends this money. But for now I want to think about glycolysis—which may be more like ‘cashing a check and getting money’. […]

  8. It’s been a long time since you’ve seen an installment of the information geometry series on this blog! Before I took a long break, I was explaining relative entropy and how it changes in evolutionary games. Much of what I said is summarized and carried further here:

    • John Baez and Blake Pollard, Relative entropy in biological systems. (Blog article here.)

    But now Blake Pollard has a new paper, and I want to talk about that:

    • Blake Pollard, Open Markov processes: A compositional perspective on non-equilibrium steady states in biology.

    I’ll focus on just one aspect: the principle of minimum entropy production.

You can use Markdown or HTML in your comments. You can also use LaTeX, like this: $latex E = m c^2 $. The word 'latex' comes right after the first dollar sign, with a space after it.

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.