Here’s a draft of a paper for the proceedings of a workshop on Information and Entropy in Biological System this spring:
• John Baez and Blake Pollard, Relative Entropy in Biological Systems.
We’d love any comments or questions you might have. I’m not happy with the title. In the paper we advocate using the term ‘relative information’ instead of ‘relative entropy’—yet the latter is much more widely used, so I feel we need it in the title to let people know what the paper is about!
Here’s the basic idea.
Life relies on nonequilibrium thermodynamics, since in thermal equilibrium there are no flows of free energy. Biological systems are also open systems, in the sense that both matter and energy flow in and out of them. Nonetheless, it is important in biology that systems can sometimes be treated as approximately closed, and sometimes approach equilibrium before being disrupted in one way or another. This can occur on a wide range of scales, from large ecosystems to within a single cell or organelle. Examples include:
• A population approaching an evolutionarily stable state.
• Random processes such as mutation, genetic drift, the diffusion of organisms in an environment or the diffusion of molecules in a liquid.
• A chemical reaction approaching equilibrium.
An interesting common feature of these processes is that as they occur, quantities mathematically akin to entropy tend to increase. Closely related quantities such as free energy tend to decrease. In this review, we explain some mathematical results that make this idea precise.
Most of these results involve a quantity that is variously known as ‘relative information’, ‘relative entropy’, ‘information gain’ or the ‘Kullback–Leibler divergence’. We’ll use the first term. Given two probability distributions and on a finite set , their relative information, or more precisely the information of relative to , is
We use the word ‘information’ instead of ‘entropy’ because one expects entropy to increase with time, and the theorems we present will say that decreases with time under various conditions. The reason is that the Shannon entropy
contains a minus sign that is missing from the definition of relative information.
Intuitively, is the amount of information gained when we start with a hypothesis given by some probability distribution and then learn the ‘true’ probability distribution . For example, if we start with the hypothesis that a coin is fair and then are told that it landed heads up, the relative information is , so we have gained 1 bit of information. If however we started with the hypothesis that the coin always lands heads up, we would have gained no information.
We put the word ‘true’ in quotes here, because the notion of a ‘true’ probability distribution, which subjective Bayesians reject, is not required to use relative information. A more cautious description of relative information is that it is a divergence: a way of measuring the difference between probability distributions that obeys
but not necessarily the other axioms for a distance function, symmetry and the triangle inequality, which indeed fail for relative information.
There are many other divergences besides relative information, some of which we discuss in Section 6. However, relative information can be singled out by a number of characterizations, including one based on ideas from Bayesian inference. The relative information is also close to the expected number of extra bits required to code messages distributed according to the probability measure using a code optimized for messages distributed according to .
In this review, we describe various ways in which a population or probability distribution evolves continuously according to some differential equation. For all these differential equations, I describe conditions under which relative information decreases. Briefly, the results are as follows. We hasten to reassure the reader that our paper explains all the jargon involved, and the proofs of the claims are given in full:
• In Section 2, we consider a very general form of the Lotka–Volterra equations, which are a commonly used model of population dynamics. Starting from the population of each type of replicating entity, we can define a probability distribution
which evolves according to a nonlinear equation called the replicator equation. We describe a necessary and sufficient condition under which is nonincreasing when evolves according to the replicator equation while is held fixed.
• In Section 3, we consider a special case of the replicator equation that is widely studied in evolutionary game theory. In this case we can think of probability distributions as mixed strategies in a two-player game. When is a dominant strategy, $I(q|p(t))$ can never increase when evolves according to the replicator equation. We can think of as the information that the population has left to learn.
Thus, evolution is analogous to a learning process—an analogy that in the field of artificial intelligence is exploited by evolutionary algorithms.
• In Section 4 we consider continuous-time, finite-state Markov processes. Here we have probability distributions on a finite set evolving according to a linear equation called the master equation. In this case can never increase. Thus, if is a steady state solution of the master equation, both and are nonincreasing. We can always write as the Boltzmann distribution for some energy function , meaning that
where is temperature and is Boltzmann’s constant. In this case, is proportional to a difference of free energies:
Thus, the nonincreasing nature of is a version of the Second Law of Thermodynamics.
• In Section 5, we consider chemical reactions and other processes described by reaction networks. In this context we have populations of entities of various kinds , and these populations evolve according to a nonlinear equation called the rate equation. We can generalize relative information from probability distributions to populations by setting
If is a special sort of steady state solution of the rate equation, called a complex balanced equilibrium, can never increase when evolves according to the rate equation.
• Finally, in Section 6, we consider a class of functions called -divergences which include relative information as a special case. For any convex function , the f-divergence of two probability distributions is given by
Whenever and are probability distributions evolving according to the master equation of some Markov process, is nonincreasing. The -divergence is also well-defined for populations, and nonincreasing for two populations that both evolve according to the master equation.