Information Geometry (Part 16)

1 February, 2017

This week I’m giving a talk on biology and information:

• John Baez, Biology as information dynamics, talk for Biological Complexity: Can it be Quantified?, a workshop at the Beyond Center, 2 February 2017.

While preparing this talk, I discovered a cool fact. I doubt it’s new, but I haven’t exactly seen it elsewhere. I came up with it while trying to give a precise and general statement of ‘Fisher’s fundamental theorem of natural selection’. I won’t start by explaining that theorem, since my version looks rather different than Fisher’s, and I came up with mine precisely because I had trouble understanding his. I’ll say a bit more about this at the end.

Here’s my version:

The square of the rate at which a population learns information is the variance of its fitness.

This is a nice advertisement for the virtues of diversity: more variance means faster learning. But it requires some explanation!

The setup

Let’s start by assuming we have n different kinds of self-replicating entities with populations P_1, \dots, P_n. As usual, these could be all sorts of things:

• molecules of different chemicals
• organisms belonging to different species
• genes of different alleles
• restaurants belonging to different chains
• people with different beliefs
• game-players with different strategies
• etc.

I’ll call them replicators of different species.

Let’s suppose each population P_i is a function of time that grows at a rate equal to this population times its ‘fitness’. I explained the resulting equation back in Part 9, but it’s pretty simple:

\displaystyle{ \frac{d}{d t} P_i(t) = f_i(P_1(t), \dots, P_n(t)) \, P_i(t)   }

Here f_i is a completely arbitrary smooth function of all the populations! We call it the fitness of the ith species.

This equation is important, so we want a short way to write it. I’ll often write f_i(P_1(t), \dots, P_n(t)) simply as f_i, and P_i(t) simply as P_i. With these abbreviations, which any red-blooded physicist would take for granted, our equation becomes simply this:

\displaystyle{ \frac{dP_i}{d t}  = f_i \, P_i   }

Next, let p_i(t) be the probability that a randomly chosen organism is of the ith species:

\displaystyle{ p_i(t) = \frac{P_i(t)}{\sum_j P_j(t)} }

Starting from our equation describing how the populations evolve, we can figure out how these probabilities evolve. The answer is called the replicator equation:

\displaystyle{ \frac{d}{d t} p_i(t)  = ( f_i - \langle f \rangle ) \, p_i(t) }

Here \langle f \rangle is the average fitness of all the replicators, or mean fitness:

\displaystyle{ \langle f \rangle = \sum_j f_j(P_1(t), \dots, P_n(t)) \, p_j(t)  }

In what follows I’ll abbreviate the replicator equation as follows:

\displaystyle{ \frac{dp_i}{d t}  = ( f_i - \langle f \rangle ) \, p_i }

The result

Okay, now let’s figure out how fast the probability distribution

p(t) = (p_1(t), \dots, p_n(t))

changes with time. For this we need to choose a way to measure the length of the vector

\displaystyle{  \frac{dp}{dt} = (\frac{d}{dt} p_1(t), \dots, \frac{d}{dt} p_n(t)) }

And here information geometry comes to the rescue! We can use the Fisher information metric, which is a Riemannian metric on the space of probability distributions.

I’ve talked about the Fisher information metric in many ways in this series. The most important fact is that as a probability distribution p(t) changes with time, its speed

\displaystyle{  \left\| \frac{dp}{dt} \right\|}

as measured using the Fisher information metric can be seen as the rate at which information is learned. I’ll explain that later. Right now I just want a simple formula for the Fisher information metric. Suppose v and w are two tangent vectors to the point p in the space of probability distributions. Then the Fisher information metric is given as follows:

\displaystyle{ \langle v, w \rangle = \sum_i \frac{1}{p_i} \, v_i w_i }

Using this we can calculate the speed at which p(t) moves when it obeys the replicator equation. Actually the square of the speed is simpler:

\begin{array}{ccl}  \displaystyle{ \left\| \frac{dp}{dt}  \right\|^2 } &=& \displaystyle{ \sum_i \frac{1}{p_i} \left( \frac{dp_i}{dt} \right)^2 } \\ \\  &=& \displaystyle{ \sum_i \frac{1}{p_i} \left( ( f_i - \langle f \rangle ) \, p_i \right)^2 } \\ \\  &=& \displaystyle{ \sum_i  ( f_i - \langle f \rangle )^2 p_i }   \end{array}

The answer has a nice meaning, too! It’s just the variance of the fitness: that is, the square of its standard deviation.

So, if you’re willing to buy my claim that the speed \|dp/dt\| is the rate at which our population learns new information, then we’ve seen that the square of the rate at which a population learns information is the variance of its fitness!

Fisher’s fundamental theorem

Now, how is this related to Fisher’s fundamental theorem of natural selection? First of all, what is Fisher’s fundamental theorem? Here’s what Wikipedia says about it:

It uses some mathematical notation but is not a theorem in the mathematical sense.

It states:

“The rate of increase in fitness of any organism at any time is equal to its genetic variance in fitness at that time.”

Or in more modern terminology:

“The rate of increase in the mean fitness of any organism at any time ascribable to natural selection acting through changes in gene frequencies is exactly equal to its genetic variance in fitness at that time”.

Largely as a result of Fisher’s feud with the American geneticist Sewall Wright about adaptive landscapes, the theorem was widely misunderstood to mean that the average fitness of a population would always increase, even though models showed this not to be the case. In 1972, George R. Price showed that Fisher’s theorem was indeed correct (and that Fisher’s proof was also correct, given a typo or two), but did not find it to be of great significance. The sophistication that Price pointed out, and that had made understanding difficult, is that the theorem gives a formula for part of the change in gene frequency, and not for all of it. This is a part that can be said to be due to natural selection

Price’s paper is here:

• George R. Price, Fisher’s ‘fundamental theorem’ made clear, Annals of Human Genetics 36 (1972), 129–140.

I don’t find it very clear, perhaps because I didn’t spend enough time on it. But I think I get the idea.

My result is a theorem in the mathematical sense, though quite an easy one. I assume a population distribution evolves according to the replicator equation and derive an equation whose right-hand side matches that of Fisher’s original equation: the variance of the fitness.

But my left-hand side is different: it’s the square of the speed of the corresponding probability distribution, where speed is measured using the ‘Fisher information metric’. This metric was discovered by the same guy, Ronald Fisher, but I don’t think he used it in his work on the fundamental theorem!

Something a bit similar to my statement appears as Theorem 2 of this paper:

• Marc Harper, Information geometry and evolutionary game theory.

and for that theorem he cites:

• Josef Hofbauer and Karl Sigmund, Evolutionary Games and Population Dynamics, Cambridge University Press, Cambridge, 1998.

However, his Theorem 2 really concerns the rate of increase of fitness, like Fisher’s fundamental theorem. Moreover, he assumes that the probability distribution p(t) flows along the gradient of a function, and I’m not assuming that. Indeed, my version applies to situations where the probability distribution moves round and round in periodic orbits!

Relative information and the Fisher information metric

The key to generalizing Fisher’s fundamental theorem is thus to focus on the speed at which p(t) moves, rather than the increase in fitness. Why do I call this speed the ‘rate at which the population learns information’? It’s because we’re measuring this speed using the Fisher information metric, which is closely connected to relative information, also known as relative entropy or the Kullback–Leibler divergence.

I explained this back in Part 7, but that explanation seems hopelessly technical to me now, so here’s a faster one, which I created while preparing my talk.

The information of a probability distribution q relative to a probability distribution p is

\displaystyle{     I(q,p) = \sum_{i =1}^n q_i \log\left(\frac{q_i}{p_i}\right) }

It says how much information you learn if you start with a hypothesis p saying that the probability of the ith situation was p_i, and then update this to a new hypothesis q.

Now suppose you have a hypothesis that’s changing with time in a smooth way, given by a time-dependent probability p(t). Then a calculation shows that

\displaystyle{ \left.\frac{d}{dt} I(p(t),p(t_0)) \right|_{t = t_0} = 0 }

for all times t_0. This seems paradoxical at first. I like to jokingly put it this way:

To first order, you’re never learning anything.

However, as long as the velocity \frac{d}{dt}p(t_0) is nonzero, we have

\displaystyle{ \left.\frac{d^2}{dt^2} I(p(t),p(t_0)) \right|_{t = t_0} > 0 }

so we can say

To second order, you’re always learning something… unless your opinions are fixed.

This lets us define a ‘rate of learning’—that is, a ‘speed’ at which the probability distribution p(t) moves. And this is precisely the speed given by the Fisher information metric!

In other words:

\displaystyle{ \left\|\frac{dp}{dt}(t_0)\right\|^2 =  \left.\frac{d^2}{dt^2} I(p(t),p(t_0)) \right|_{t = t_0} }

where the length is given by Fisher information metric. Indeed, this formula can be used to define the Fisher information metric. From this definition we can easily work out the concrete formula I gave earlier.

In summary: as a probability distribution moves around, the relative information between the new probability distribution and the original one grows approximately as the square of time, not linearly. So, to talk about a ‘rate at which information is learned’, we need to use the above formula, involving a second time derivative. This rate is just the speed at which the probability distribution moves, measured using the Fisher information metric. And when we have a probability distribution describing how many replicators are of different species, and it’s evolving according to the replicator equation, this speed is also just the variance of the fitness!


Biology as Information Dynamics

31 January, 2017

This is my talk for the workshop Biological Complexity: Can It Be Quantified?

• John Baez, Biology as information dynamics, 2 February 2017.

Abstract. If biology is the study of self-replicating entities, and we want to understand the role of information, it makes sense to see how information theory is connected to the ‘replicator equation’—a simple model of population dynamics for self-replicating entities. The relevant concept of information turns out to be the information of one probability distribution relative to another, also known as the Kullback–Leibler divergence. Using this we can get a new outlook on free energy, see evolution as a learning process, and give a clean general formulation of Fisher’s fundamental theorem of natural selection.

For more, read:

• Marc Harper, The replicator equation as an inference dynamic.

• Marc Harper, Information geometry and evolutionary game theory.

• Barry Sinervo and Curt M. Lively, The rock-paper-scissors game and the evolution of alternative male strategies, Nature 380 (1996), 240–243.

• John Baez, Diversity, entropy and thermodynamics.

• John Baez, Information geometry.

The last reference contains proofs of the equations shown in red in my slides.
In particular, Part 16 contains a proof of my updated version of Fisher’s fundamental theorem.


Quantifying Biological Complexity

23 January, 2017

Next week I’m going to this workshop:

Biological Complexity: Can It Be Quantified?, 1-3 February 2017, Beyond Center for Fundamental Concepts in Science, Arizona State University, Tempe Arizona. Organized by Paul Davies.

I haven’t heard that any of it will be made publicly available, but I’ll see if there’s something I can show you. Here’s the schedule:

Wednesday February 1st

9:00 – 9:30 am Paul Davies

Brief welcome address, outline of the subject and aims of the meeting

Session 1. Life: do we know it when we see it?

9:30 – 10:15 am: Chris McKay, “Mission to Enceladus”

10:15 – 10:45 am: Discussion

10:45– 11:15 am: Tea/coffee break

11:15 – 12:00 pm: Kate Adamala, “Alive but not life”

12:00 – 12:30 pm: Discussion

12:30 – 2:00 pm: Lunch

Session 2. Quantifying life

2:00 – 2:45 pm: Lee Cronin, “The living and the dead: molecular signatures of life”

2:45 – 3:30 pm: Sara Walker, “Can we build a life meter?”

3:30 – 4:00 pm: Discussion

4:00 – 4:30 pm: Tea/coffee break

4:30 – 5:15 pm: Manfred Laubichler, “Complexity is smaller than you think”

5:15 – 5:30 pm: Discussion

The Beyond Annual Lecture

7:00 – 8:30 pm: Sean Carroll, “Our place in the universe”

Thursday February 2nd

Session 3: Life, information and the second law of thermodynamics

9:00 – 9:45 am: James Crutchfield, “Vital bits: the fuel of life”

9:45 – 10:00 am: Discussion

10:00 – 10:45 pm: John Baez, “Information and entropy in biology”

10:45 – 11:00 am: Discussion

11:00 – 11:30 pm: Tea/coffee break

11:30 – 12:15 pm: Chris Adami, “What is biological information?”

12:15 – 12:30 pm: Discussion

12:30 – 2:00 pm: Lunch

Session 4: The emergence of agency

2:00 – 2:45 pm: Olaf Khang Witkowski, “When do autonomous agents act collectively?”

2:45 – 3:00 pm: Discussion

3:00 – 3:45 pm: William Marshall, “When macro beats micro”

3:45 – 4:00 pm: Discussion

4:00 – 4:30 am: Tea/coffee break

4:30 – 5:15pm: Alexander Boyd, “Biology’s demons”

5:15 – 5:30 pm: Discussion

Friday February 3rd

Session 5: New physics?

9:00 – 9:45 am: Sean Carroll, “Laws of complexity, laws of life?”

9:45 – 10:00 am: Discussion

10:00 – 10:45 am: Andreas Wagner, “The arrival of the fittest”

10:45 – 11:00 am: Discussion

11:00 – 11:30 am: Tea/coffee break

11:30 – 12:30 pm: George Ellis, “Top-down causation demands new laws”

12:30 – 2:00 pm: Lunch


Algorithmic Thermodynamics (Part 3)

15 November, 2016

 

This is my talk for the Santa Fe Institute workshop on Statistical Mechanics, Information Processing and Biology:

Computation and thermodynamics.

It’s about the link between computation and entropy. I take the idea of a Turing machine for granted, but starting with that I explain recursive functions, the Church-Turing thesis, Kolomogorov complexity, the relation between Kolmogorov complexity and Shannon entropy, the uncomputability of Kolmogorov complexity, the ‘complexity barrier’, Levin’s computable version of complexity, and finally my work with Mike Stay on algorithmic thermodynamics.

In my talk slides I mention the ‘complexity barrier’, and state this theorem:

Theorem. Choose your favorite set of axioms for math. If it’s finite and consistent, there exists C ≥ 0, the complexity barrier, such that for no natural number n can you prove the Kolmogorov complexity of n exceeds C.

For a sketch of the proof of this result, go here:

Chaitin’s incompleteness theorem.

In my talk I showed a movie related to this: an animated video created in 2009 using a program less than 4 kilobytes long that runs on a Windows XP machine:

For more

For more details, read our paper:

• John Baez and Mike Stay, Algorithmic thermodynamics, Math. Struct. Comp. Sci. 22 (2012), 771-787.

or these blog articles:

Algorithmic thermodynamics (part 1).

Algorithmic thermodynamics (part 2).

They all emphasize slightly different aspects!


Information Processing and Biology

7 November, 2016

santa_fe_institute

The Santa Fe Institute, in New Mexico, is a place for studying complex systems. I’ve never been there! Next week I’ll go there to give a colloquium on network theory, and also to participate in this workshop:

Statistical Mechanics, Information Processing and Biology, November 16–18, Santa Fe Institute. Organized by David Krakauer, Michael Lachmann, Manfred Laubichler, Peter Stadler, and David Wolpert.

Abstract. This workshop will address a fundamental question in theoretical biology: Does the relationship between statistical physics and the need of biological systems to process information underpin some of their deepest features? It recognizes that a core feature of biological systems is that they acquire, store and process information (i.e., perform computation). However to manipulate information in this way they require a steady flux of free energy from their environments. These two, inter-related attributes of biological systems are often taken for granted; they are not part of standard analyses of either the homeostasis or the evolution of biological systems. In this workshop we aim to fill in this major gap in our understanding of biological systems, by gaining deeper insight in the relation between the need for biological systems to process information and the free energy they need to pay for that processing.

The goal of this workshop is to address these issues by focusing on a set three specific questions: 1) How has the fraction of free energy flux on earth that is used by biological computation changed with time? 2) What is the free energy cost of biological computation or functioning? 3) What is the free energy cost of the evolution of biological computation or functioning? In all of these cases we are interested in the fundamental limits that the laws of physics impose on various aspects of living systems as expressed by these three questions.

I think it’s not open to the public, but I will try to blog about it. The speakers include a lot of experts on information theory, statistical mechanics, and biology. Here they are:

Wednesday November 16: Chris Jarzynski, Seth Lloyd, Artemy Kolchinski, John Baez, Manfred Laubichler, Harold de Vladar, Sonja Prohaska, Chris Kempes.

Thursday November 17: Phil Ball, Matina C. Donaldson-Matasci, Sebastian Deffner, David Wolpert, Daniel Polani, Christoph Flamm, Massimiliano Esposito, Hildegard Meyer-Ortmanns, Blake Pollard, Mikhail Prokopenko, Peter Stadler, Ben Machta.

Friday November 18: Jim Crutchfield, Sara Walker, Hyunju Kim, Takahiro Sagawa, Michael Lachmann, Wojciech Zurek, Christian Van den Broeck, Susanne Still, Chris Stephens.


Information Geometry (Part 15)

14 January, 2016

joint with Blake Pollard

Lately we’ve been thinking about open Markov processes. These are random processes where something can hop randomly from one state to another (that’s the ‘Markov process’ part) but also enter or leave the system (that’s the ‘open’ part).

The ultimate goal is to understand the nonequilibrium thermodynamics of open systems—systems where energy and maybe matter flows in and out. If we could understand this well enough, we could understand in detail how life works. That’s a difficult job! But one has to start somewhere, and this is one place to start.

We have a few papers on this subject:

• Blake Pollard, A Second Law for open Markov processes. (Blog article here.)

• John Baez, Brendan Fong and Blake Pollard, A compositional framework for Markov processes. (Blog article here.)

• Blake Pollard, Open Markov processes: A compositional perspective on non-equilibrium steady states in biology. (Blog article here.)

However, right now we just want to show you three closely connected results about how relative entropy changes in open Markov processes.

Definitions

An open Markov process consists of a finite set X of states, a subset B \subseteq X of boundary states, and an infinitesimal stochastic operator H: \mathbb{R}^X \to \mathbb{R}^X, meaning a linear operator with

H_{ij} \geq 0 \ \  \text{for all} \ \ i \neq j

and

\sum_i H_{ij} = 0 \ \  \text{for all} \ \ j

For each state i \in X we introduce a population p_i  \in [0,\infty). We call the resulting function p : X \to [0,\infty) the population distribution.

Populations evolve in time according to the open master equation:

\displaystyle{ \frac{dp_i}{dt} = \sum_j H_{ij}p_j} \ \  \text{for all} \ \  i \in X-B

p_i(t) = b_i(t) \ \  \text{for all} \ \  i \in B

So, the populations p_i obey a linear differential equation at states i that are not in the boundary, but they are specified ‘by the user’ to be chosen functions b_i at the boundary states. The off-diagonal entry H_{ij} for i \neq j describe the rate at which population transitions from the jth to the ith state.

A closed Markov process, or continuous-time discrete-state Markov chain, is an open Markov process whose boundary is empty. For a closed Markov process, the open master equation becomes the usual master equation:

\displaystyle{  \frac{dp}{dt} = Hp }

In a closed Markov process the total population is conserved:

\displaystyle{ \frac{d}{dt} \sum_{i \in X} p_i = \sum_{i,j} H_{ij}p_j = 0 }

This lets us normalize the initial total population to 1 and have it stay equal to 1. If we do this, we can talk about probabilities instead of populations. In an open Markov process, population can flow in and out at the boundary states.

For any pair of distinct states i,j, H_{ij}p_j is the flow of population from j to i. The net flux of population from the jth state to the ith state is the flow from j to i minus the flow from i to j:

J_{ij} = H_{ij}p_j - H_{ji}p_i

A steady state is a solution of the open master equation that does not change with time. A steady state for a closed Markov process is typically called an equilibrium. So, an equilibrium obeys the master equation at all states, while for a steady state this may not be true at the boundary states. The idea is that population can flow in or out at the boundary states.

We say an equilibrium p : X \to [0,\infty) of a Markov process is detailed balanced if all the net fluxes vanish:

J_{ij} = 0 \ \  \text{for all} \ \ i,j \in X

or in other words:

H_{ij}p_j = H_{ji}p_i \ \  \text{for all} \ \ i,j \in X

Given two population distributions p, q : X \to [0,\infty) we can define the relative entropy

\displaystyle{  I(p,q) = \sum_i p_i \ln \left( \frac{p_i}{q_i} \right)}

When q is a detailed balanced equilibrium solution of the master equation, the relative entropy can be seen as the ‘free energy’ of p. For a precise statement, see Section 4 of Relative entropy in biological systems.

The Second Law of Thermodynamics implies that the free energy of a closed system tends to decrease with time, so for closed Markov processes we expect I(p,q) to be nonincreasing. And this is true! But for open Markov processes, free energy can flow in from outside. This is just one of several nice results about how relative entropy changes with time.

Results

Theorem 1. Consider an open Markov process with X as its set of states and B as the set of boundary states. Suppose p(t) and q(t) obey the open master equation, and let the quantities

\displaystyle{ \frac{Dp_i}{Dt} = \frac{dp_i}{dt} - \sum_{j \in X} H_{ij}p_j }

\displaystyle{  \frac{Dq_i}{Dt} = \frac{dq_i}{dt} - \sum_{j \in X} H_{ij}q_j }

measure how much the time derivatives of p_i and q_i fail to obey the master equation. Then we have

\begin{array}{ccl}   \displaystyle{  \frac{d}{dt}  I(p(t),q(t)) } &=& \displaystyle{ \sum_{i, j \in X} H_{ij} p_j  \left( \ln(\frac{p_i}{q_i}) - \frac{p_i q_j}{p_j q_i} \right)} \\ \\  && \; + \; \displaystyle{ \sum_{i \in B} \frac{\partial I}{\partial p_i} \frac{Dp_i}{Dt} +  \frac{\partial I}{\partial q_i} \frac{Dq_i}{Dt}  }  \end{array}

This result separates the change in relative entropy change into two parts: an ‘internal’ part and a ‘boundary’ part.

It turns out the ‘internal’ part is always less than or equal to zero. So, from Theorem 1 we can deduce a version of the Second Law of Thermodynamics for open Markov processes:

Theorem 2. Given the conditions of Theorem 1, we have

\displaystyle{  \frac{d}{dt}  I(p(t),q(t)) \; \le \;   \sum_{i \in B} \frac{\partial I}{\partial p_i} \frac{Dp_i}{Dt} +  \frac{\partial I}{\partial q_i} \frac{Dq_i}{Dt}   }

Intuitively, this says that free energy can only increase if it comes in from the boundary!

There is another nice result that holds when q is an equilibrium solution of the master equation. This idea seems to go back to Schnakenberg:

Theorem 3. Given the conditions of Theorem 1, suppose also that q is an equilibrium solution of the master equation. Then we have

\displaystyle{  \frac{d}{dt}  I(p(t),q) =  -\frac{1}{2} \sum_{i,j \in X} J_{ij} A_{ij} \; + \; \sum_{i \in B} \frac{\partial I}{\partial p_i} \frac{Dp_i}{Dt}  }

where

J_{ij} = H_{ij}p_j - H_{ji}p_i

is the net flux from j to i, while

\displaystyle{ A_{ij} = \ln \left(\frac{p_j q_i}{p_i q_j} \right) }

is the conjugate thermodynamic force.

The flux J_{ij} has a nice meaning: it’s the net flow of population from j to i. The thermodynamic force is a bit subtler, but this theorem reveals its meaning: it says how much the population wants to flow from j to i.

More precisely, up to that factor of 1/2, the thermodynamic force A_{ij} says how much free energy loss is caused by net flux from j to i. There’s a nice analogy here to water losing potential energy as it flows downhill due to the force of gravity.

Proofs

Proof of Theorem 1. We begin by taking the time derivative of the relative information:

\begin{array}{ccl} \displaystyle{ \frac{d}{dt}  I(p(t),q(t)) } &=&   \displaystyle{  \sum_{i \in X} \frac{\partial I}{\partial p_i} \frac{dp_i}{dt} +  \frac{\partial I}{\partial q_i} \frac{dq_i}{dt} }  \end{array}

We can separate this into a sum over states i \in X - B, for which the time derivatives of p_i and q_i are given by the master equation, and boundary states i \in B, for which they are not:

\begin{array}{ccl} \displaystyle{ \frac{d}{dt}  I(p(t),q(t)) } &=&   \displaystyle{  \sum_{i \in X-B, \; j \in X} \frac{\partial I}{\partial p_i} H_{ij} p_j +                                                \frac{\partial I}{\partial q_i} H_{ij} q_j }\\  \\    && + \; \; \; \displaystyle{  \sum_{i \in B} \frac{\partial I}{\partial p_i} \frac{dp_i}{dt} +  \frac{\partial I}{\partial q_i} \frac{dq_i}{dt}}    \end{array}

For boundary states we have

\displaystyle{ \frac{dp_i}{dt} = \frac{Dp_i}{Dt} + \sum_{j \in X} H_{ij}p_j }

and similarly for the time derivative of q_i. We thus obtain

\begin{array}{ccl}   \displaystyle{ \frac{d}{dt}  I(p(t),q(t)) } &=&   \displaystyle{  \sum_{i,j \in X} \frac{\partial I}{\partial p_i} H_{ij} p_j + \frac{\partial I}{\partial q_i} H_{ij} q_j }\\  \\  && + \; \; \displaystyle{  \sum_{i \in B} \frac{\partial I}{\partial p_i} \frac{Dp_i}{Dt} +  \frac{\partial I}{\partial q_i} \frac{Dq_i}{Dt}}    \end{array}

To evaluate the first sum, recall that

\displaystyle{   I(p,q) = \sum_{i \in X} p_i \ln (\frac{p_i}{q_i})}

so

\displaystyle{\frac{\partial I}{\partial p_i}} =\displaystyle{1 +  \ln (\frac{p_i}{q_i})} ,  \qquad  \displaystyle{ \frac{\partial I}{\partial q_i}}=  \displaystyle{- \frac{p_i}{q_i}   }

Thus, we have

\displaystyle{ \sum_{i,j \in X}  \frac{\partial I}{\partial p_i} H_{ij} p_j + \frac{\partial I}{\partial q_i} H_{ij} q_j  =    \sum_{i,j\in X} (1 +  \ln (\frac{p_i}{q_i})) H_{ij} p_j - \frac{p_i}{q_i} H_{ij} q_j }

We can rewrite this as

\displaystyle{   \sum_{i,j \in X} H_{ij} p_j  \left( 1 + \ln(\frac{p_i}{q_i}) - \frac{p_i q_j}{p_j q_i} \right) }

Since H_{ij} is infinitesimal stochastic we have \sum_{i} H_{ij} = 0, so the first term drops out, and we are left with

\displaystyle{   \sum_{i,j \in X} H_{ij} p_j  \left( \ln(\frac{p_i}{q_i}) - \frac{p_i q_j}{p_j q_i} \right) }

as desired.   █

Proof of Theorem 2. Thanks to Theorem 1, to prove

\displaystyle{  \frac{d}{dt}  I(p(t),q(t)) \; \le \;   \sum_{i \in B} \frac{\partial I}{\partial p_i} \frac{Dp_i}{Dt} +  \frac{\partial I}{\partial q_i} \frac{Dq_i}{Dt}   }

it suffices to show that

\displaystyle{   \sum_{i,j \in X} H_{ij} p_j  \left( \ln(\frac{p_i}{q_i}) - \frac{p_i q_j}{p_j q_i} \right) \le 0  }

or equivalently (recalling the proof of Theorem 1):

\displaystyle{ \sum_{i,j} H_{ij} p_j  \left( \ln(\frac{p_i}{q_i}) + 1 - \frac{p_i q_j}{p_j q_i} \right) \le 0 }

The last two terms on the left hand side cancel when i = j. Thus, if we break the sum into an i \ne j part and an i = j part, the left side becomes

\displaystyle{   \sum_{i \ne j} H_{ij} p_j  \left( \ln(\frac{p_i}{q_i}) + 1 - \frac{p_i q_j}{p_j q_i} \right) \; + \; \sum_j H_{jj} p_j \ln(\frac{p_j}{q_j}) }

Next we can use the infinitesimal stochastic property of H to write H_{jj} as the sum of -H_{ij} over i not equal to j, obtaining

\displaystyle{ \sum_{i \ne j} H_{ij} p_j  \left( \ln(\frac{p_i}{q_i}) + 1 - \frac{p_i q_j}{p_j q_i} \right) - \sum_{i \ne j} H_{ij} p_j \ln(\frac{p_j}{q_j}) } =

\displaystyle{ \sum_{i \ne j} H_{ij} p_j  \left( \ln(\frac{p_iq_j}{p_j q_i}) + 1 - \frac{p_i q_j}{p_j q_i} \right) }

Since H_{ij} \ge 0 when i \ne j and \ln(s) + 1 - s \le 0 for all s > 0, we conclude that this quantity is \le 0.   █

Proof of Theorem 3. Now suppose also that q is an equilibrium solution of the master equation. Then Dq_i/Dt = dq_i/dt = 0 for all states i, so by Theorem 1 we need to show

\displaystyle{ \sum_{i, j \in X} H_{ij} p_j  \left( \ln(\frac{p_i}{q_i}) - \frac{p_i q_j}{p_j q_i} \right)  \; = \;  -\frac{1}{2} \sum_{i,j \in X} J_{ij} A_{ij} }

We also have \sum_{j \in X} H_{ij} q_j = 0, so the second
term in the sum at left vanishes, and it suffices to show

\displaystyle{  \sum_{i, j \in X} H_{ij} p_j  \ln(\frac{p_i}{q_i}) \; = \;  - \frac{1}{2} \sum_{i,j \in X} J_{ij} A_{ij} }

By definition we have

\displaystyle{  \frac{1}{2} \sum_{i,j} J_{ij} A_{ij}}  =  \displaystyle{  \frac{1}{2} \sum_{i,j}  \left( H_{ij} p_j - H_{ji}p_i \right)    \ln \left( \frac{p_j q_i}{p_i q_j} \right) }

This in turn equals

\displaystyle{  \frac{1}{2} \sum_{i,j}  H_{ij}p_j     \ln \left( \frac{p_j q_i}{p_i q_j} \right) -   \frac{1}{2} \sum_{i,j}  H_{ji}p_i  \ln \left( \frac{p_j q_i}{p_i q_j} \right) }

and we can switch the dummy indices i,j in the second sum, obtaining

\displaystyle{  \frac{1}{2} \sum_{i,j}  H_{ij}p_j     \ln \left( \frac{p_j q_i}{p_i q_j} \right) -   \frac{1}{2} \sum_{i,j}  H_{ij}p_j     \ln \left( \frac{p_i q_j}{p_j q_i} \right) }

or simply

\displaystyle{ \sum_{i,j} H_{ij} p_j \ln \left( \frac{p_j q_i}{p_i q_j} \right) }

But this is

\displaystyle{  \sum_{i,j} H_{ij} p_j \left(\ln ( \frac{p_j}{q_j}) + \ln (\frac{q_i}{p_i}) \right) }

and the first term vanishes because H is infinitesimal stochastic: \sum_i H_{ij} = 0. We thus have

\displaystyle{  \frac{1}{2} \sum_{i,j} J_{ij} A_{ij}} = \sum_{i,j} H_{ij} p_j  \ln (\frac{q_i}{p_i} )

as desired.   █


Information Geometry (Part 14)

11 January, 2016

joint with Blake Pollard

It’s been a long time since you’ve seen an installment of the information geometry series on this blog! Before I took a long break, I was explaining relative entropy and how it changes in evolutionary games. Much of what I said is summarized and carried further here:

• John Baez and Blake Pollard, Relative entropy in biological systems, Entropy 18 (2016), 46. (Blog article here.)

But now Blake has a new paper, and I want to talk about that:

• Blake Pollard, Open Markov processes: a compositional perspective on non-equilibrium steady states in biology, Entropy 18 (2016), 140.

I’ll focus on just one aspect: the principle of minimum entropy production. This is an exciting yet controversial principle in non-equilibrium thermodynamics. Blake examines it in a situation where we can tell exactly what’s happening.

Non-equilibrium steady states

Life exists away from equilibrium. Left isolated, systems will tend toward thermodynamic equilibrium. However, biology is about open systems: physical systems that exchange matter or energy with their surroundings. Open systems can be maintained away from equilibrium by this exchange. This leads to the idea of a non-equilibrium steady state—a state of an open system that doesn’t change, but is not in equilibrium.

A simple example is a pan of water sitting on a stove. Heat passes from the flame to the water and then to the air above. If the flame is very low, the water doesn’t boil and nothing moves. So, we have a steady state, at least approximately. But this is not an equilibrium, because there is a constant flow of energy through the water.

Of course in reality the water will be slowly evaporating, so we don’t really have a steady state. As always, models are approximations. If the water is evaporating slowly enough, it can be useful to approximate the situation with a non-equilibrium steady state.

There is much more to biology than steady states. However, to dip our toe into the chilly waters of non-equilibrium thermodynamics, it is nice to start with steady states. And already here there are puzzles left to solve.

Minimum entropy production

Ilya Prigogine won the Nobel prize for his work on non-equilibrium thermodynamics. One reason is that he had an interesting idea about steady states. He claimed that under certain conditions, a non-equilibrium steady state will minimize entropy production!

There has been a lot of work trying to make the ‘principle of minimum entropy production’ precise and turn it into a theorem. In this book:

• G. Lebon and D. Jou, Understanding Non-equilibrium Thermodynamics, Springer, Berlin, 2008.

the authors give an argument for the principle of minimum entropy production based on four conditions:

time-independent boundary conditions: the surroundings of the system don’t change with time.

linear phenomenological laws: the laws governing the macroscopic behavior of the system are linear.

constant phenomenological coefficients: the laws governing the macroscopic behavior of the system don’t change with time.

symmetry of the phenomenological coefficients: since they are linear, the laws governing the macroscopic behavior of the system can be described by a linear operator, and we demand that in a suitable basis the matrix for this operator is symmetric: T_{ij} = T_{ji}.

The last condition is obviously the subtlest one; it’s sometimes called Onsager reciprocity, and people have spent a lot of time trying to derive it from other conditions.

However, Blake goes in a different direction. He considers a concrete class of open systems, a very large class called ‘open Markov processes’. These systems obey the first three conditions listed above, and the ‘detailed balanced’ open Markov processes also obey the last one. But Blake shows that minimum entropy production holds only approximately—with the approximation being good for steady states that are near equilibrium!

However, he shows that another minimum principle holds exactly, even for steady states that are far from equilibrium. He calls this the ‘principle of minimum dissipation’.

We actually discussed the principle of minimum dissipation in an earlier paper:

• John Baez, Brendan Fong and Blake Pollard, A compositional framework for Markov processes. (Blog article here.)

But one advantage of Blake’s new paper is that it presents the results with a minimum of category theory. Of course I love category theory, and I think it’s the right way to formalize open systems, but it can be intimidating.

Another good thing about Blake’s new paper is that it explicitly compares the principle of minimum entropy to the principle of minimum dissipation. He shows they agree in a certain limit—namely, the limit where the system is close to equilibrium.

Let me explain this. I won’t include the nice example from biology that Blake discusses: a very simple model of membrane transport. For that, read his paper! I’ll just give the general results.

The principle of minimum dissipation

An open Markov process consists of a finite set X of states, a subset B \subseteq X of boundary states, and an infinitesimal stochastic operator H: \mathbb{R}^X \to \mathbb{R}^X, meaning a linear operator with

H_{ij} \geq 0 \ \  \text{for all} \ \ i \neq j

and

\sum_i H_{ij} = 0 \ \  \text{for all} \ \ j

I’ll explain these two conditions in a minute.

For each i \in X we introduce a population p_i  \in [0,\infty). We call the resulting function p : X \to [0,\infty) the population distribution. Populations evolve in time according to the open master equation:

\displaystyle{ \frac{dp_i}{dt} = \sum_j H_{ij}p_j} \ \  \text{for all} \ \ i \in X-B

p_i(t) = b_i(t) \ \  \text{for all} \ \ i \in B

So, the populations p_i obey a linear differential equation at states i that are not in the boundary, but they are specified ‘by the user’ to be chosen functions b_i at the boundary states.

The off-diagonal entries H_{ij}, \ i \neq j are the rates at which population hops from the jth to the ith state. This lets us understand the definition of an infinitesimal stochastic operator. The first condition:

H_{ij} \geq 0 \ \  \text{for all} \ \ i \neq j

says that the rate for population to transition from one state to another is non-negative. The second:

\sum_i H_{ij} = 0 \ \  \text{for all} \ \ j

says that population is conserved, at least if there are no boundary states. Population can flow in or out at boundary states, since the master equation doesn’t hold there.

A steady state is a solution of the open master equation that does not change with time. A steady state for a closed Markov process is typically called an equilibrium. So, an equilibrium obeys the master equation at all states, while for a steady state this may not be true at the boundary states. Again, the reason is that population can flow in or out at the boundary.

We say an equilibrium q : X \to [0,\infty) of a Markov process is detailed balanced if the rate at which population flows from the ith state to the jth state is equal to the rate at which it flows from the jth state to the ith:

H_{ji}q_i = H_{ij}q_j \ \  \text{for all} \ \ i,j \in X

Suppose we’ve got an open Markov process that has a detailed balanced equilibrium q. Then a non-equilibrium steady state p will minimize a function called the ‘dissipation’, subject to constraints on its boundary populations. There’s a nice formula for the dissipation in terms of p and q.

Definition. Given an open Markov process with detailed balanced equilibrium q we define the dissipation for a population distribution p to be

\displaystyle{ D(p) = \frac{1}{2}\sum_{i,j} H_{ij}q_j \left( \frac{p_j}{q_j} - \frac{p_i}{q_i} \right)^2 }

This formula is a bit tricky, but you’ll notice it’s quadratic in p and it vanishes when p = q. So, it’s pretty nice.

Using this concept we can formulate a principle of minimum dissipation, and prove that non-equilibrium steady states obey this principle:

Definition. We say a population distribution p: X \to \mathbb{R} obeys the principle of minimum dissipation with boundary population b: X \to \mathbb{R} if p minimizes D(p) subject to the constraint that

p_i = b_i \ \  \text{for all} \ \ i \in B

Theorem 1. A population distribution p is a steady state with p_i = b_i for all boundary states i if and only if p obeys the principle of minimum dissipation with boundary population b.

Proof. This follows from Theorem 28 in A compositional framework for Markov processes.

Minimum entropy production versus minimum dissipation

How does dissipation compare with entropy production? To answer this, first we must ask: what really is entropy production? And: how does the equilibrium state q show up in the concept of entropy production?

The relative entropy of two population distributions p,q is given by

\displaystyle{ I(p,q) = \sum_i p_i \ln \left( \frac{p_i}{q_i} \right) }

It is well known that for a closed Markov process with q as a detailed balanced equilibrium, the relative entropy is monotonically decreasing with time. This is due to an annoying sign convention in the definition of relative entropy: while entropy is typically increasing, relative entropy typically decreases. We could fix this by putting a minus sign in the above formula or giving this quantity I(p,q) some other name. A lot of people call it the Kullback–Leibler divergence, but I have taken to calling it relative information. For more, see:

• John Baez and Blake Pollard, Relative entropy in biological systems. (Blog article here.)

We say ‘relative entropy’ in the title, but then we explain why ‘relative information’ is a better name, and use that. More importantly, we explain why I(p,q) has the physical meaning of free energy. Free energy tends to decrease, so everything is okay. For details, see Section 4.

Blake has a nice formula for how fast I(p,q) decreases:

Theorem 2. Consider an open Markov process with X as its set of states and B as the set of boundary states. Suppose p(t) obeys the open master equation and q is a detailed balanced equilibrium. For any boundary state i \in B, let

\displaystyle{ \frac{Dp_i}{Dt} = \frac{dp_i}{dt} - \sum_{j \in X} H_{ij}p_j }

measure how much p_i fails to obey the master equation. Then we have

\begin{array}{ccl}   \displaystyle{  \frac{d}{dt}  I(p(t),q) } &=& \displaystyle{ \sum_{i, j \in X} H_{ij} p_j  \left( \ln(\frac{p_i}{q_i}) - \frac{p_i q_j}{p_j q_i} \right)} \\ \\  && \; + \; \displaystyle{ \sum_{i \in B} \frac{\partial I}{\partial p_i} \frac{Dp_i}{Dt}  }  \end{array}

Moreover, the first term is less than or equal to zero.

Proof. For a self-contained proof, see Information geometry (part 15), which is coming up soon. It will be a special case of the theorems there.   █

Blake compares this result to previous work by Schnakenberg:

• J. Schnakenberg, Network theory of microscopic and macroscopic behavior of master equation systems, Rev. Mod. Phys. 48 (1976), 571–585.

The negative of Blake’s first term is this:

\displaystyle{ K(p) = - \sum_{i, j \in X} H_{ij} p_j  \left( \ln(\frac{p_i}{q_i}) - \frac{p_i q_j}{p_j q_i} \right) }

Under certain circumstances, this equals what Schnakenberg calls the entropy production. But a better name for this quantity might be free energy loss, since for a closed Markov process that’s exactly what it is! In this case there are no boundary states, so the theorem above says K(p) is the rate at which relative entropy—or in other words, free energy—decreases.

For an open Markov process, things are more complicated. The theorem above shows that free energy can also flow in or out at the boundary, thanks to the second term in the formula.

Anyway, the sensible thing is to compare a principle of ‘minimum free energy loss’ to the principle of minimum dissipation. The principle of minimum dissipation is true. How about the principle of minimum free energy loss? It turns out to be approximately true near equilibrium.

For this, consider the situation in which p is near to the equilibrium distribution q in the sense that

\displaystyle{ \frac{p_i}{q_i} = 1 + \epsilon_i }

for some small numbers \epsilon_i. We collect these numbers in a vector called \epsilon.

Theorem 3. Consider an open Markov process with X as its set of states and B as the set of boundary states. Suppose q is a detailed balanced equilibrium and let p be arbitrary. Then

K(p) = D(p) + O(\epsilon^2)

where K(p) is the free energy loss, D(p) is the dissipation, \epsilon_i is defined as above, and by O(\epsilon^2) we mean a sum of terms of order \epsilon_i^2.

Proof. First take the free energy loss:

\displaystyle{ K(p) = -\sum_{i, j \in X} H_{ij} p_j  \left( \ln(\frac{p_i}{q_i}) - \frac{p_i q_j}{p_j q_i} \right)}

Expanding the logarithm to first order in \epsilon, we get

\displaystyle{ K(p) =  -\sum_{i, j \in X} H_{ij} p_j  \left( \frac{p_i}{q_i} - 1 - \frac{p_i q_j}{p_j q_i} \right) + O(\epsilon^2) }

Since H is infinitesimal stochastic, \sum_i H_{ij} = 0, so the second term in the sum vanishes, leaving

\displaystyle{ K(p) =  -\sum_{i, j \in X} H_{ij} p_j  \left( \frac{p_i}{q_i} - \frac{p_i q_j}{p_j q_i} \right) \; + O(\epsilon^2) }

or

\displaystyle{ K(p) =  -\sum_{i, j \in X} \left( H_{ij} p_j  \frac{p_i}{q_i} - H_{ij} q_j \frac{p_i}{q_i} \right) \; + O(\epsilon^2) }

Since q is a equilibrium we have \sum_j H_{ij} q_j = 0, so now the last term in the sum vanishes, leaving

\displaystyle{ K(p) =  -\sum_{i, j \in X} H_{ij} \frac{p_i p_j}{q_i} \; + O(\epsilon^2) }

Next, take the dissipation

\displaystyle{ D(p) = \frac{1}{2}\sum_{i,j} H_{ij}q_j \left( \frac{p_j}{q_j} - \frac{p_i}{q_i} \right)^2 }

and expand the square, getting

\displaystyle{ D(p) = \frac{1}{2}\sum_{i,j} H_{ij}q_j \left( \frac{p_j^2}{q_j^2} - 2\frac{p_i p_j}{q_i q_j} +  \frac{p_i^2}{q_i^2} \right) }

Since H is infinitesimal stochastic, \sum_i H_{ij} = 0. The first term is just this times a function of j, summed over j, so it vanishes, leaving

\displaystyle{ D(p) = \frac{1}{2}\sum_{i,j} H_{ij}q_j \left(- 2\frac{p_i p_j}{q_i q_j} +  \frac{p_i^2}{q_i^2} \right) }

Since q is an equilibrium, \sum_j H_{ij} q_j = 0. The last term above is this times a function of i, summed over i, so it vanishes, leaving

\displaystyle{ D(p) = - \sum_{i,j} H_{ij}q_j  \frac{p_i p_j}{q_i q_j} = - \sum_{i,j} H_{ij} \frac{p_i p_j}{q_i}  }

This matches what we got for K(p), up to terms of order O(\epsilon^2).   █

In short: detailed balanced open Markov processes are governed by the principle of minimum dissipation, not minimum entropy production. Minimum dissipation agrees with minimum entropy production only near equilibrium.