Information Geometry (Part 21)

17 August, 2021

Last time I ended with a formula for the ‘Gibbs distribution’: the probability distribution that maximizes entropy subject to constraints on the expected values of some observables.

This formula is well-known, but I’d like to derive it here. My argument won’t be up to the highest standards of rigor: I’ll do a bunch of computations, and it would take more work to state conditions under which these computations are justified. But even a nonrigorous approach is worthwhile, since the computations will give us more than the mere formula for the Gibbs distribution.

I’ll start by reminding you of what I claimed last time. I’ll state it in a way that removes all unnecessary distractions, so go back to Part 20 if you want more explanation.

The Gibbs distribution

Take a measure space \Omega with measure \mu. Suppose there is a probability distribution \pi on \Omega that maximizes the entropy

\displaystyle{ -\int_\Omega \pi(x) \ln \pi(x) \, d\mu(x) }

subject to the requirement that some integrable functions A^1, \dots, A^n on \Omega have expected values equal to some chosen list of numbers q^1, \dots, q^n.

(Unlike last time, now I’m writing A^i and q^i with superscripts rather than subscripts, because I’ll be using the Einstein summation convention: I’ll sum over any repeated index that appears once as a a superscript and once as a subscript.)

Furthermore, suppose \pi depends smoothly on q \in \mathbb{R}^n. I’ll call it \pi_q to indicate its dependence on q. Then, I claim \pi_q is the so-called Gibbs distribution

\displaystyle{  \pi_q(x) =\frac{e^{-p_i A^i(x)}}{\int_\Omega e^{-p_i A^i(x)} \, d\mu(x)}   }


\displaystyle{ p_i = \frac{\partial f(q)}{\partial q^i} }


\displaystyle{ f(q) = - \int_\Omega \pi_q(x) \ln \pi_q(x) \, d\mu(x) }

is the entropy of \pi_q.

Let’s show this is true!

Finding the Gibbs distribution

So, we are trying to find a probability distribution \pi that maximizes entropy subject to these constraints:

\displaystyle{ \int_\Omega \pi(x) A^i(x) \, d\mu(x) = q^i }

We can solve this problem using Lagrange multipliers. We need one Lagrange multiplier, say \beta_i, for each of the above constraints. But it’s easiest if we start by letting \pi range over all of L^1(\Omega), that is, the space of all integrable functions on \Omega. Then, because we want \pi to be a probability distribution, we need to impose one extra constraint

\displaystyle{ \int_\Omega \pi(x) \, d\mu(x) = 1 }

To do this we need an extra Lagrange multiplier, say \gamma.

So, that’s what we’ll do! We’ll look for critical points of this function on L^1(\Omega):

\displaystyle{ - \int \pi \ln \pi \, d\mu - \beta_i  \int \pi A^i \, d\mu - \gamma \int \pi\,  d\mu  }

Here I’m using some tricks to keep things short. First, I’m dropping the dummy variable x which appeared in all of the integrals we had: I’m leaving it implicit. Second, all my integrals are over \Omega so I won’t say that. And third, I’m using the Einstein summation convention, so there’s a sum over i implicit here.

Okay, now let’s do the variational derivative required to find a critical point of this function. When I was a math major taking physics classes, the way physicists did variational derivatives seemed like black magic to me. Then I spent months reading how mathematicians rigorously justified these techniques. I don’t feel like a massive digression into this right now, so I’ll just do the calculations—and if they seem like black magic, I’m sorry!

We need to find \pi obeying

\displaystyle{ \frac{\delta}{\delta \pi(x)} \left(- \int \pi \ln \pi \, d\mu - \beta_i  \int \pi A^i \, d\mu - \gamma \int \pi \, d\mu \right) = 0 }

or in other words

\displaystyle{ \frac{\delta}{\delta \pi(x)} \left(\int \pi \ln \pi \, d\mu + \beta_i  \int \pi A^i \, d\mu + \gamma \int \pi \, d\mu \right) = 0 }

First we need to simplify this expression. The only part that takes any work, if you know how to do variational derivatives, is the first term. Since the derivative of z \ln z is 1 + \ln z, we have

\displaystyle{ \frac{\delta}{\delta \pi(x)} \int \pi \ln \pi \, d\mu  = 1 + \ln \pi(x) }

The second and third terms are easy, so we get

\displaystyle{ \frac{\delta}{\delta \pi(x)} \left( \int \pi \ln \pi \, d\mu + \beta_i \int \pi A^i \, d\mu + \gamma \int \pi \, d\mu \right) =}

        1 + \ln \pi(x) + \beta_i A^i(x) + \gamma

Thus, we need to solve this equation:

\displaystyle{ 1 + \ln \pi(x) + \beta_i A^i(x) + \gamma  = 0}

That’s easy to do:

\pi(x) = e^{-1 - \gamma - \beta_i A^i(x)}

Good! It’s starting to look like the Gibbs distribution!

We now need to choose the Lagrange multipliers \beta_i and \gamma to make the constraints hold. To satisfy this constraint

\displaystyle{ \int \pi \, d\mu = 1 }

we must choose \gamma so that

\displaystyle{ \int  e^{-1 - \gamma - \beta_i A^i } \, d\mu = 1 }

or in other words

\displaystyle{ e^{1 + \gamma} = \int e^{- \beta_i A^i} \, d\mu }

Plugging this into our earlier formula

\pi(x) = e^{-1 - \gamma - \beta_i A^i(x)}

we get this:

\displaystyle{ \pi_q(x) = \frac{e^{- \beta_i A^i(x)}}{\int e^{- \beta_i A^i} \, d\mu} }

Great! Even more like the Gibbs distribution!

By the way, you must have noticed the “1” that showed up here:

\displaystyle{ \frac{\delta}{\delta \pi(x)} \int \pi \ln \pi \, d\mu  = 1 + \ln \pi(x) }

It buzzed around like an annoying fly in the otherwise beautiful calculation, but eventually went away. This is the same irksome “1” that showed up in Part 19. Someday I’d like to say a bit more about it.

Now, where were we? We were trying to show that

\displaystyle{  \pi_q(x) =\frac{e^{-p_i A^i(x)}}{\int e^{-p_i A^i} \, d\mu}   }

minimizes entropy subject to our constraints. So far we’ve shown

\displaystyle{ \pi(x) = \frac{e^{- \beta_i A^i(x)}}{\int e^{- \beta_i A^i} \, d\mu} }

is a critical point. It’s clear that

\pi(x) \ge 0

so \pi really is a probability distribution. We should show it actually maximizes entropy subject to our constraints, but I will skip that. Given that, \pi will be our claimed Gibbs distribution \pi_q if we can show

p_i = \beta_i

This is interesting! It’s saying our Lagrange multipliers \beta_i actually equal the so-called conjugate variables p_i given by

\displaystyle{ p_i = \frac{\partial f}{\partial q^i} }

where f(q) is the entropy of \pi_q:

\displaystyle{ f(q) = - \int_\Omega \pi_q(x) \ln \pi_q(x) \, d\mu(x) }

There are two ways to show this: the easy way and the hard way. The easy way is to reflect on the meaning of Lagrange multipliers, and I’ll sketch that way first. The hard way is to use brute force: just compute p_i and show it equals \beta_i. This is a good test of our computational muscle—but more importantly, it will help us discover some interesting facts about the Gibbs distribution.

The easy way

Consider a simple Lagrange multiplier problem where you’re trying to find a critical point of a smooth function

f \colon \mathbb{R}^2 \to \mathbb{R}

subject to the constraint

g = c

for some smooth function

g \colon \mathbb{R}^2 \to \mathbb{R}

and constant c. (The function f here has nothing to do with the f in the previous sections.) To answer this we introduce a Lagrange multiplier \lambda and seek points where

\nabla ( f - \lambda g) = 0

This works because the above equation says

\nabla f = \lambda \nabla g

Geometrically this means we’re at a point where the gradient of f points at right angles to the level surface of g:

Thus, to first order we can’t change f by moving along the level surface of g.

But also, if we start at a point where

\nabla f = \lambda \nabla g

and we begin moving in any direction, the function f will change at a rate equal to \lambda times the rate of change of g. That’s just what the equation says! And this fact gives a conceptual meaning to the Lagrange multiplier \lambda.

Our situation is more complicated, since our functions are defined on the infinite-dimensional space L^1(\Omega), and we have an n-tuple of constraints with an n-tuple of Lagrange multipliers. But the same principle holds.

So, when we are at a solution \pi_q of our constrained entropy-maximization problem, and we start moving the point \pi_q by changing the value of the ith constraint, namely q^i, the rate at which the entropy changes will be \beta_i times the rate of change of q^i. So, we have

\displaystyle{\frac{\partial f}{\partial q^i} = \beta_i }

But this is just what we needed to show!

The hard way

Here’s another way to show

\displaystyle{\frac{\partial f}{\partial q^i} = \beta_i }

We start by solving our constrained entropy-maximization problem using Lagrange multipliers. As already shown, we get

\displaystyle{ \pi_q(x) = \frac{e^{- \beta_i A^i(x)}}{\int e^{- \beta_i A^i} \, d\mu} }

Then we’ll compute the entropy

f(q) = - \int \pi_q \ln \pi_q \, d\mu

Then we’ll differentiate this with respect to q_i and show we get \beta_i.

Let’s try it! The calculation is a bit heavy, so let’s write Z(q) for the so-called partition function

\displaystyle{ Z(q) = \int e^{- \beta_i A^i} \, d\mu }

so that

\displaystyle{ \pi_q(x) = \frac{e^{- \beta_i A^i(x)}}{Z(q)} }

and the entropy is

\begin{array}{ccl}  f(q) &=& - \displaystyle{ \int  \pi_q  \ln \left( \frac{e^{- \beta_k A^k}}{Z(q)} \right)  \, d\mu }  \\ \\  &=& \displaystyle{ \int \pi_q \left(\beta_k A^k + \ln Z(q) \right)  \, d\mu } \\ \\  \end{array}

This is the sum of two terms. The first term

\displaystyle{ \int \pi_q \beta_k A^k   \, d\mu =  \beta_k \int \pi_q A^k   \, d\mu}

is \beta_k times the expected value of A^k with respect to the probability distribution \pi_q, all summed over k. But the expected value of A^k is q^k, so we get

\displaystyle{ \int  \pi_q \beta_k A^k \, d\mu } =  \beta_k q^k

The second term is easier:

\displaystyle{ \int_\Omega  \pi_q \ln Z(q) \, d\mu = \ln Z(q) }

since \pi_q(x) integrates to 1 and the partition function Z(q) doesn’t depend on x \in \Omega.

Putting together these two terms we get an interesting formula for the entropy:

f(q) = \beta_k q^k + \ln Z(q)

This formula is one reason this brute-force approach is actually worthwhile! I’ll say more about it later.

But for now, let’s use this formula to show what we’re trying to show, namely

\displaystyle{\frac{\partial f}{\partial q^i} = \beta_i }

For starters,

\begin{array}{ccl}  \displaystyle{\frac{\partial f}{\partial q^i}} &=& \displaystyle{\frac{\partial}{\partial q^i} \left(\beta_k q^k + \ln Z(q) \right) } \\ \\  &=& \displaystyle{ \frac{\partial \beta_k}{\partial q^i} q^k + \beta_k   \frac{\partial q^k}{\partial q^i} + \frac{\partial}{\partial q^i} \ln Z(q)   }  \\ \\  &=& \displaystyle{ \frac{\partial \beta_k}{\partial q^i} q^k + \beta_k  \delta^k_i + \frac{\partial}{\partial q^i} \ln Z(q)   } \\ \\  &=& \displaystyle{ \frac{\partial \beta_k}{\partial q^i} q^k + \beta_i + \frac{\partial}{\partial q^i} \ln Z(q)  }  \end{array}

where we played a little Kronecker delta game with the second term.

Now we just need to compute the third term:

\begin{array}{ccl}  \displaystyle{ \frac{\partial}{\partial q^i} \ln Z(q) } &=&  \displaystyle{ \frac{1}{Z(q)} \frac{\partial}{\partial q^i} Z(q) } \\  \\  &=&  \displaystyle{ \frac{1}{Z(q)} \frac{\partial}{\partial q^i}  \int e^{- \beta_j A^j}  \, d\mu  }  \\ \\  &=&  \displaystyle{ \frac{1}{Z(q)} \int \frac{\partial}{\partial q^i} \left(e^{- \beta_j A^j}\right) \, d\mu  }  \\ \\  &=&  \displaystyle{ \frac{1}{Z(q)} \int \frac{\partial}{\partial q^i}\left( - \beta_k A^k \right)  e^{- \beta_j A^j} \, d\mu  }  \\ \\  &=&  \displaystyle{ -\frac{1}{Z(q)} \int \frac{\partial \beta_k}{\partial q^i}  A^k   e^{- \beta_j A^j} \, d\mu  }  \\ \\  &=&  \displaystyle{ - \frac{\partial \beta_k}{\partial q^i} \frac{1}{Z(q)} \int  A^k   e^{- \beta_j A^j} \, d\mu  }  \\ \\  &=&  \displaystyle{ - \frac{\partial \beta_k}{\partial q^i} q^k }  \end{array}

Ah, you don’t know how good it feels, after years of category theory, to be doing calculations like this again!

Now we can finish the job we started:

\begin{array}{ccl}  \displaystyle{\frac{\partial f}{\partial q^i}} &=&  \displaystyle{ \frac{\partial \beta_k}{\partial q^i} q^k + \beta_i + \frac{\partial}{\partial q^i} \ln Z(q)  } \\ \\  &=&  \displaystyle{ \frac{\partial \beta_k}{\partial q^i} q^k + \beta_i - \frac{\partial \beta_k}{\partial q^i} q^k  } \\ \\  &=& \beta_i  \end{array}



We’ve learned the formula for the probability distribution that maximizes entropy subject to some constraints on the expected values of observables. But more importantly, we’ve seen that the anonymous Lagrange multipliers \beta_i that show up in this problem are actually the partial derivatives of entropy! They equal

\displaystyle{ p_i = \frac{\partial f}{\partial q^i} }

Thus, they are rich in meaning. From what we’ve seen earlier, they are ‘surprisals’. They are analogous to momentum in classical mechanics and have the meaning of intensive variables in thermodynamics:

 Classical Mechanics   Thermodynamics   Probability Theory 
 q   position extensive variables probabilities
 p  momentum intensive variables surprisals
 S   action entropy Shannon entropy

Furthermore, by showing \beta_i = p_i the hard way we discovered an interesting fact. There’s a relation between the entropy and the logarithm of the partition function:

f(q) = p_i q^i + \ln Z(q)

(We proved this formula with \beta_i replacing p_i, but now we know those are equal.)

This formula suggests that the logarithm of the partition function is important—and it is! It’s closely related to the concept of free energy—even though ‘energy’, free or otherwise, doesn’t show up at the level of generality we’re working at now.

This formula should also remind you of the tautological 1-form on the cotangent bundle T^\ast Q, namely

\theta = p_i dq^i

It should remind you even more of the contact 1-form on the contact manifold T^\ast Q \times \mathbb{R}, namely

\alpha = -dS + p_i dq^i

Here S is a coordinate on the contact manifold that’s a kind of abstract stand-in for our entropy function f.

So, it’s clear there’s a lot more to say: we’re seeing hints of things here and there, but not yet the full picture.

For all my old posts on information geometry, go here:

Information geometry.

Information Geometry (Part 20)

14 August, 2021

Last time we worked out an analogy between classical mechanics, thermodynamics and probability theory. The latter two look suspiciously similar:

 Classical Mechanics   Thermodynamics   Probability Theory 
 q   position extensive variables probabilities
 p  momentum intensive variables surprisals
 S   action entropy Shannon entropy

This is no coincidence. After all, in the subject of statistical mechanics we explain classical thermodynamics using probability theory—and entropy is revealed to be Shannon entropy (or its quantum analogue).

Now I want to make this precise.

To connect classical thermodynamics to probability theory, I’ll start by discussing ‘statistical manifolds’. I introduced the idea of a statistical manifold in Part 7: it’s a manifold Q equipped with a map sending each point q \in Q to a probability distribution \pi_q on some measure space \Omega. Now I’ll say how these fit into the second column of the above chart.

Then I’ll talk about statistical manifolds of a special sort used in thermodynamics, which I’ll call ‘Gibbsian’, since they really go back to Josiah Willard Gibbs.

In a Gibbsian statistical manifold, for each q \in Q the probability distribution \pi_q is a ‘Gibbs distribution’. Physically, these Gibbs distributions describe thermodynamic equilibria. For example, if you specify the volume, energy and number of particles in a box of gas, there will be a Gibbs distribution describing what the particles do in thermodynamic equilibrium under these conditions. Mathematically, Gibbs distributions maximize entropy subject to some constraints specified by the point q \in Q.

More precisely: in a Gibbsian statistical manifold we have a list of observables A_1, \dots , A_n whose expected values serve as coordinates q_1, \dots, q_n for points q \in Q, and \pi_q is the probability distribution that maximizes entropy subject to the constraint that the expected value of A_i is q_i. We can derive most of the interesting formulas of thermodynamics starting from this!

Statistical manifolds

Let’s fix a measure space \Omega with measure \mu. A statistical manifold is then a manifold Q equipped with a smooth map \pi assigning to each point q \in Q a probability distribution on \Omega, which I’ll call \pi_q. So, \pi_q is a function on \Omega with

\displaystyle{ \int_\Omega \pi_q \, d\mu = 1 }


\pi_q(x) \ge 0

for all x \in \Omega.

The idea here is that the space of all probability distributions on \Omega may be too huge to understand in as much detail as we’d like, so instead we describe some of these probability distributions—a family parametrized by points of some manifold Q—using the map \pi. This is the basic idea behind parametric statistics.

Information geometry is the geometry of statistical manifolds. Any statistical manifold comes with a bunch of interesting geometrical structures. One is the ‘Fisher information metric’, a Riemannian metric I explained in Part 7. Another is a 1-parameter family of connections on the tangent bundle T Q, which is important in Amari’s approach to information geometry. You can read about this here:

• Hiroshi Matsuzoe, Statistical manifolds and affine differential geometry, in Advanced Studies in Pure Mathematics 57, pp. 303–321.

I don’t want to talk about it now—I just wanted to reassure you that I’m not completely ignorant of it!

I want to focus on the story I’ve been telling, which is about entropy. Our statistical manifold Q comes with a smooth entropy function

f \colon Q \to \mathbb{R}


\displaystyle{  f(q) = -\int_\Omega \pi_q(x) \, \ln \pi_q(x) \, d\mu(x)    }

We can use this entropy function to do many of the things we usually do in thermodynamics! For example, at any point q \in Q where this function is differentiable, its differential gives a cotangent vector

p = (df)_q

which has an important physical meaning. In coordinates we have

\displaystyle{ p_i = \frac{\partial f}{\partial q_i} }

and we call p_i the intensive variable conjugate to q_i. For example if q_i is energy, p_i will be ‘coolness’: the reciprocal of temperature.

Defining p this way gives a Lagrangian submanifold

\Lambda = \{ (q,p) \in T^\ast Q : \; x \in M, \; p =  (df)_x \}

of the cotangent bundle T^\ast Q. We can also get contact geometry into the game by defining a contact manifold T^\ast Q \times \mathbb{R} and a Legendrian submanifold

\Sigma = \{ (q,p,S) \in T^\ast Q \times \mathbb{R} : \; x \in M, \; p =  (df)_q , S = f(q) \}

But I’ve been talking about these ideas for the last three episodes, so I won’t say more just now! Instead, I want to throw a new idea into the pot.

Gibbsian statistical manifolds

Thermodynamics, and statistical mechanics, spend a lot of time dealing with statistical manifold of a special sort I’ll call ‘Gibbsian’. In these, each probability distribution \pi_q is a ‘Gibbs distribution’, meaning that it maximizes entropy subject to certain constraints specified by the point q \in Q.

How does this work? For starters, an integrable function

A \colon \Omega \to \mathbb{R}

is called a random variable, or in physics an observable. The expected value of an observable is a smooth real-valued function on our statistical manifold

\langle A \rangle \colon Q \to \mathbb{R}

given by

\displaystyle{ \langle A \rangle(q) = \int_\Omega A(x) \pi_q(x) \, d\mu(x) }

In other words, \langle A \rangle is a function whose value at at any point q \in Q is the expected value of A with respect to the probability distribution \pi_q.

Now, suppose our statistical manifold is n-dimensional and we have n observables A_1, \dots, A_n. Their expected values will be smooth functions on our manifold—and sometimes these functions will be a coordinate system!

This may sound rather unlikely, but it’s really not so outlandish. Indeed, if there’s a point q such that the differentials of the functions \langle A_i \rangle are linearly independent at this point, these functions will be a coordinate system in some neighborhood of this point, by the inverse function theorem. So, we can take this neighborhood, use it as our statistical manifold, and the functions \langle A_i \rangle will be coordinates.

So, let’s assume the expected values of our observables give a coordinate system on Q. Let’s call these coordinates q_1, \dots, q_n, so that

\langle A_i \rangle(q) = q_i

Now for the kicker: we say our statistical manifold is Gibbsian if for each point q \in Q, \pi_q is the probability distribution that maximizes entropy subject to the above condition!

Which condition? The condition saying that

\displaystyle{ \int_\Omega A_i(x) \pi_q(x) \, d\mu(x) = q_i }

for all i. This is just the previous equation spelled out so that you can see it’s a condition on \pi_q.

This assumption of the entropy-maximizing nature of \pi_q is a very powerful, because it implies a useful and nontrivial formula for \pi_q. It’s called the Gibbs distribution:

\displaystyle{  \pi_q(x) = \frac{1}{Z(q)} \exp\left(-\sum_{i = 1}^n p_i A_i(x)\right) }

for all x \in \Omega.

Here p_i is the intensive variable conjugate to q_i, while Z(q) is the partition function: the thing we must divide by to make sure \pi_q integrates to 1. In other words:

\displaystyle{ Z(q) = \int_\Omega \exp\left(-\sum_{i = 1}^n p_i A_i(x) \right) \, d\mu(x)   }

By the way, this formula may look confusing at first, since the left side depends on the point q in our statistical manifold, while there’s no q visible in the right side! Do you see what’s going on?

I’ll tell you: the conjugate variable p_i, sitting on the right side of the above formula, depends on q. Remember, we got it by taking the partial derivative of the entropy in the q_i direction

\displaystyle{ p_i = \frac{\partial f}{\partial q_i} }

and the evaluating this derivative at the point q.

But wait a minute! f here is the entropy—but the entropy of what?

The entropy of \pi_q, of course!

So there’s something circular about our formula for \pi_q. To know \pi_q, you need to know the conjugate variables p_i, but to compute these you need to know the entropy of \pi_q.

This is actually okay. While circular, the formula for \pi_q is still true. It’s harder to work with than you might hope. But it’s still extremely useful.

Next time I’ll prove that this formula for \pi_q is true, and do a few things with it. All this material was discovered by Gibbs in the late 1800’s, and it’s lurking any good book on statistical mechanics—but not phrased in the language of statistical manifolds. The physics textbooks usually consider special cases, like a box of gas where:

q_1 is energy, p_1 is 1/temperature.

q_2 is volume, p_2 is –pressure/temperature.

q_3 is the number of particles, p_3 is chemical potential / pressure.

While these special cases are important and interesting, I’d rather be general!

Technical comments

I said “Any statistical manifold comes with a bunch of interesting geometrical structures”, but in fact some conditions are required. For example, the Fisher information metric is only well-defined and nondegenerate under some conditions on the map \pi. For example, if \pi maps every point of Q to the same probability distribution, the Fisher information metric will vanish.

Similarly, the entropy function f is only smooth under some conditions on \pi.

Furthermore, the integral

\displaystyle{ \int_\Omega \exp\left(-\sum_{i = 1}^n p_i A_i(x) \right) \, d\mu(x)   }

may not converge for all values of the numbers p_1, \dots, p_n. But in my discussion of Gibbsian statistical manifolds, I was assuming that an entropy-maximizing probability distribution \pi_q with

\displaystyle{ \int_\Omega A_i(x) \pi_q(x) \, d\mu(x) = q_i }

actually exists. In this case the probability distribution is also unique (almost everywhere).

For all my old posts on information geometry, go here:

Information geometry.

Information Geometry (Part 19)

8 August, 2021

Last time I figured out the analogue of momentum in probability theory, but I didn’t say what it’s called. Now I will tell you—thanks to some help from Abel Jansma and Toby Bartels.


This is a well-known concept in information theory. It’s also called ‘information content‘.

Let’s see why. First, let’s remember the setup. We have a manifold

\displaystyle{ Q = \{ q \in \mathbb{R}^n : \; q_i > 0, \; \sum_{i=1}^n q_i = 1 \} }

whose points q are nowhere vanishing probability distributions on the set \{1, \dots, n\}. We have a function

f \colon Q \to \mathbb{R}

called the Shannon entropy, defined by

\displaystyle{ f(q) = - \sum_{j = 1}^n q_j \ln q_j }

For each point q \in Q we define a cotangent vector p \in T^\ast_q Q by

p = (df)_q

As mentioned last time, this is the analogue of momentum in probability theory. In the second half of this post I’ll say more about exactly why. But first let’s compute it and see what it actually equals!

Let’s start with a naive calculation, acting as if the probabilities q_1, \dots, q_n were a coordinate system on the manifold Q. We get

\displaystyle{ p_i = \frac{\partial f}{\partial q_i} }

so using the definition of the Shannon entropy we have

\begin{array}{ccl}  p_i &=& \displaystyle{ -\frac{\partial}{\partial q_i} \sum_{j = 1}^n q_j \ln q_j  }\\ \\  &=& \displaystyle{ -\frac{\partial}{\partial q_i} \left( q_i \ln q_i \right) } \\ \\  &=& -\ln(q_i) - 1  \end{array}

Now, the quantity -\ln q_i is called the surprisal of the probability distribution at i. Intuitively, it’s a measure of how surprised you should be if an event of probability q_i occurs. For example, if you flip a fair coin and it lands heads up, your surprisal is ln 2. If you flip 100 fair coins and they all land heads up, your surprisal is 100 times ln 2.

Of course ‘surprise’ is a psychological term, not a term from math or physics, so we shouldn’t take it too seriously here. We can derive the concept of surprisal from three axioms:

  1. The surprisal of an event of probability q is some function of q, say F(q).
  2. The less probable an event is, the larger its surprisal is: q_1 \le q_2 \implies  F(q_1) \ge F(q_2).
  3. The surprisal of two independent events is the sum of their surprisals: F(q_1 q_2) = F(q_1) + F(q_2).

It follows from work on Cauchy’s functional equation that F must be of this form:

F(q) = - \log_b q

for some constant b > 1. We shall choose b, the base of our logarithms, to be e. We had a similar freedom of choice in defining the Shannon entropy, and we will use base e for both to be consistent. If we chose something else, it would change the surprisal and the Shannon entropy by the same constant factor.

So far, so good. But what about the irksome “-1” in our formula?

p_i = -\ln(q_i) - 1

Luckily it turns out we can just get rid of this! The reason is that the probabilities q_i are not really coordinates on the manifold Q. They’re not independent: they must sum to 1. So, when we change them a little, the sum of their changes must vanish. Putting it more technically, the tangent space T_q Q is not all of \mathbb{R}^n, but just the subspace consisting of vectors whose components sum to zero:

\displaystyle{ T_q Q = \{ v \in \mathbb{R}^n : \; \sum_{j = 1}^n v_j = 0 \} }

The cotangent space is the dual of the tangent space. The dual of a subspace

S \subseteq V

is the quotient space

V^\ast/\{ \ell \colon V \to \mathbb{R} : \; \forall v \in S \; \, \ell(v) = 0 \}

The cotangent space T_q^\ast Q thus consists of linear functionals \ell \colon \mathbb{R}^n \to \mathbb{R} modulo those that vanish on vectors v obeying the equation

\displaystyle{ \sum_{j = 1}^n v_j = 0 }

Of course, we can identify the dual of \mathbb{R}^n with \mathbb{R}^n in the usual way, using the Euclidean inner product: a vector u \in \mathbb{R}^n corresponds to the linear functional

\displaystyle{ \ell(v) = \sum_{j = 1}^n u_j v_j }

From this, you can see that a linear functional \ell vanishes on all vectors v obeying the equation

\displaystyle{ \sum_{j = 1}^n v_j = 0 }

if and only if its corresponding vector u has

u_1 = \cdots = u_n

So, we get

T^\ast_q Q \cong \mathbb{R}^n/\{ u \in \mathbb{R}^n : \; u_1 = \cdots = u_n \}

In words: we can describe cotangent vectors to Q as lists of n numbers if we want, but we have to remember that adding the same constant to each number in the list doesn’t change the cotangent vector!

This suggests that our naive formula

p_i = \ln(q_i) - 1

is on the right track, but we’re free to get rid of the constant 1 if we want! And that’s true.

To check this rigorously, we need to show

\displaystyle{ p(v) = -\sum_{j=1}^n \ln(q_i) v_i}

for all v \in T_q Q. We compute:

\begin{array}{ccl}  p(v) &=& df(v) \\ \\  &=& v(f) \\ \\  &=& \displaystyle{ \sum_{j=1}^n v_j \, \frac{\partial f}{\partial q_j} } \\ \\  &=& \displaystyle{ \sum_{j=1}^n v_j (-\ln(q_i) - 1) } \\ \\  &=& \displaystyle{ -\sum_{j=1}^n \ln(q_i) v_i }  \end{array}

where in the second to last step we used our earlier calculation:

\displaystyle{ \frac{\partial f}{\partial q_i} = -\frac{\partial}{\partial q_i} \sum_{j = 1}^n q_j \ln q_j = -\ln(q_i) - 1 }

and in the last step we used

\displaystyle{ \sum_j v_j = 0 }

Back to the big picture

Now let’s take stock of where we are. We can fill in the question marks in the charts from last time, and combine those charts while we’re at it.

 Classical Mechanics   Thermodynamics   Probability Theory 
 q   position extensive variables probabilities
 p  momentum intensive variables surprisals
 S   action entropy Shannon entropy

What’s going on here? In classical mechanics, action is minimized (or at least the system finds a critical point of the action). In thermodynamics, entropy is maximized. In the maximum entropy approach to probability, Shannon entropy is maximized. This leads to a mathematical analogy that’s quite precise. For classical mechanics and thermodynamics, I explained it here:

Classical mechanics versus thermodynamics (part 1).

Classical mechanics versus thermodynamics (part 2).

These posts may give a more approachable introduction to what I’m doing now: now I’m bringing probability theory into the analogy, with a big emphasis on symplectic and contact geometry.

Let me spell out a bit of the analogy more carefully:

Classical Mechanics. In classical mechanics, we have a manifold Q whose points are positions of a particle. There’s an important function on this manifold: Hamilton’s principal function

f \colon Q \to \mathbb{R}

What’s this? It’s basically action: f(q) is the action of the least-action path from the position q_0 at some earlier time t_0 to the position q at time 0. The Hamilton–Jacobi equations say the particle’s momentum p at time 0 is given by

p = (df)_q

Thermodynamics. In thermodynamics, we have a manifold Q whose points are equilibrium states of a system. The coordinates of a point q \in Q are called extensive variables. There’s an important function on this manifold: the entropy

f \colon Q \to \mathbb{R}

There is a cotangent vector p at the point q given by

p = (df)_q

The components of this vector are the intensive variables corresponding to the extensive variables.

Probability Theory. In probability theory, we have a manifold Q whose points are nowhere vanishing probability distributions on a finite set. The coordinates of a point q \in Q are probabilities. There’s an important function on this manifold: the Shannon entropy

f \colon Q \to \mathbb{R}

There is a cotangent vector p at the point q given by

p = (df)_q

The components of this vector are the surprisals corresponding to the probabilities.

In all three cases, T^\ast Q is a symplectic manifold and imposing the constraint p = (df)_q picks out a Lagrangian submanifold

\Lambda = \{ (q,p) \in T^\ast Q: \; p = (df)_q \}

There is also a contact manifold T^\ast Q \times \mathbb{R} where the extra dimension comes with an extra coordinate S that means

• action in classical mechanics,
• entropy in thermodynamics, and
• Shannon entropy in probability theory.

We can then decree that S = f(q) along with p = (df)_q, and these constraints pick out a Legendrian submanifold

\Sigma = \{(q,p,S) \in T^\ast Q \times \mathbb{R} : \; S = f(q), \; p = (df)_q \}

There’s a lot more to do with these ideas, and I’ll continue next time.

For all my old posts on information geometry, go here:

Information geometry.

Information Geometry (Part 18)

5 August, 2021

Last time I sketched how two related forms of geometry, symplectic and contact geometry, show up in thermodynamics. Today I want to explain how they show up in probability theory.

For some reason I haven’t seen much discussion of this! But people should have looked into this. After all, statistical mechanics explains thermodynamics in terms of probability theory, so if some mathematical structure shows up in thermodynamics it should appear in statistical mechanics… and thus ultimately in probability theory.

I just figured out how this works for symplectic and contact geometry.

Suppose a system has n possible states. We’ll call these microstates, following the tradition in statistical mechanics. If you don’t know what ‘microstate’ means, don’t worry about it! But the rough idea is that if you have a macroscopic system like a rock, the precise details of what its atoms are doing are described by a microstate, and many different microstates could be indistinguishable unless you look very carefully.

We’ll call the microstates 1, 2, \dots, n. So, if you don’t want to think about physics, when I say microstate I’ll just mean an integer from 1 to n.

Next, a probability distribution q assigns a real number q_i to each microstate, and these numbers must sum to 1 and be nonnegative. So, we have q \in \mathbb{R}^n, though not every vector in \mathbb{R}^n is a probability distribution.

I’m sure you’re wondering why I’m using q rather than p to stand for an observable instead of a probability distribution. Am I just trying to confuse you?

No: I’m trying to set up an analogy to physics!

Last time I introduced symplectic geometry using classical mechanics. The most important example of a symplectic manifold is the cotangent bundle T^\ast Q of a manifold Q. A point of T^\ast Q is a pair (q,p) consisting of a point q \in Q and a cotangent vector p \in T^\ast_q Q. In classical mechanics the point q describes the position of some physical system, while p describes its momentum.

So, I’m going to set up an analogy like this:

 Classical Mechanics  Probability Theory
  q   position   probability distribution  
  p   momentum ???

But what is to momentum as probability is to position?

A big clue is the appearance of symplectic geometry in thermodynamics, which I also outlined last time. We can use this to get some intuition about the analogue of momentum in probability theory.

In thermodynamics, a system has a manifold Q of states. (These are not the ‘microstates’ I mentioned before: we’ll see the relation later.) There is a function

f \colon Q \to \mathbb{R}

describing the entropy of the system as a function of its state. There is a law of thermodynamics saying that

p = (df)_q

This equation picks out a submanifold of T^\ast Q, namely

\Lambda = \{(q,p) \in T^\ast Q : \; p = (df)_q \}

Moreover this submanifold is Lagrangian: the symplectic structure \omega vanishes when restricted to it:

\displaystyle{ \omega |_\Lambda = 0 }

This is very beautiful, but it goes by so fast you might almost miss it! So let’s clutter it up a bit with coordinates. We often use local coordinates on Q and describe a point q \in Q using these coordinates, getting a point

(q_1, \dots, q_n) \in \mathbb{R}^n

They give rise to local coordinates q_1, \dots, q_n, p_1, \dots, p_n on the cotangent bundle T^\ast Q. The q_i are called extensive variables, because they are typically things that you can measure only by totalling up something over the whole system, like the energy or volume of a cylinder of gas. The p_i are called intensive variables, because they are typically things that you can measure locally at any point, like temperature or pressure.

In these local coordinates, the symplectic structure on T^\ast Q is the 2-form given by

\omega = dp_1 \wedge dq_1 + \cdots + dp_n \wedge dq_n

The equation

p = (df)_q

serves as a law of physics that determines the intensive variables given the extensive ones when our system is in thermodynamic equilibrium. Written out using coordinates, this law says

\displaystyle{ p_i = \frac{\partial f}{\partial q_i} }

It looks pretty bland here, but in fact it gives formulas for the temperature and pressure of a gas, and many other useful formulas in thermodynamics.

Now we are ready to see how all this plays out in probability theory! We’ll get an analogy like this, which goes hand-in-hand with our earlier one:

 Thermodynamics   Probability Theory 
  q   extensive variables   probability distribution  
  p   intensive variables ???

This analogy is clearer than the last, because statistical mechanics reveals that the extensive variables in thermodynamics are really just summaries of probability distributions on microstates. Furthermore, both thermodynamics and probability theory have a concept of entropy.

So, let’s take our manifold Q to consist of probability distributions on the set of microstates I was talking about before: the set \{1, \dots, n\}. Actually, let’s use nowhere vanishing probability distributions:

\displaystyle{ Q = \{ q \in \mathbb{R}^n : \; q_i > 0, \; \sum_{i=1}^n q_i = 1 \} }

I’m requiring q_i > 0 to ensure Q is a manifold, and also to make sure f is differentiable: it ceases to be differentiable when one of the probabilities q_i hits zero.

Since Q is a manifold, its cotangent bundle is a symplectic manifold T^\ast Q. And here’s the good news: we have a god-given entropy function

f \colon Q \to \mathbb{R}

namely the Shannon entropy

\displaystyle{ f(q) = - \sum_{i = 1}^n q_i \ln q_i }

So, everything I just described about thermodynamics works in the setting of plain old probability theory! Starting from our manifold Q and the entropy function, we get all the rest, leading up to the Lagrangian submanifold

\Lambda = \{(q,p) \in T^\ast Q : \; p = (df)_q \}

that describes the relation between extensive and intensive variables.

For computations it helps to pick coordinates on Q. Since the probabilities q_1, \dots, q_n sum to 1, they aren’t independent coordinates on Q. So, we can either pick all but one of them as coordinates, or learn how to deal with non-independent coordinates, which are already completely standard in projective geometry. Let’s do the former, just to keep things simple.

These coordinates on Q give rise in the usual way to coordinates q_i and p_i on the cotangent bundle T^\ast Q. These play the role of extensive and intensive variables, respectively, and it should be very interesting to impose the equation

\displaystyle{ p_i = \frac{\partial f}{\partial q_i} }

where f is the Shannon entropy. This picks out a Lagrangian submanifold \Lambda \subseteq T^\ast Q.

So, the question becomes: what does this mean? If this formula gives the analogue of momentum for probability theory, what does this analogue of momentum mean?

Here’s a preliminary answer: p_i says how fast entropy increases as we increase the probability q_i that our system is in the ith microstate. So if we think of nature as ‘wanting’ to maximize entropy, the quantity p_i says how eager it is to increase the probability q_i.

Indeed, you can think of p_i as a bit like pressure—one of the most famous intensive quantities in thermodynamics. A gas ‘wants’ to expand, and its pressure says precisely how eager it is to expand. Similarly, a probability distribution ‘wants’ to flatten out, to maximize entropy, and p_i says how eager it is to increase the probability q_i in order to do this.

But what can we do with this concept? And what does symplectic geometry do for probability theory?

I will start tackling these questions next time.

One thing I’ll show is that when we reduce thermodynamics to probability theory using the ideas of statistical mechanics, the appearance of symplectic geometry in thermodynamics follows from its appearance in probability theory.

Another thing I want to investigate is how other geometrical structures on the space of probability distributions, like the Fisher information metric, interact with the symplectic structure on its cotangent bundle. This will integrate symplectic geometry and information geometry.

I also want to bring contact geometry into the picture. It’s already easy to see from our work last time how this should go. We treat the entropy S as an independent variable, and replace T^\ast Q with a larger manifold T^\ast Q \times \mathbb{R} having S as an extra coordinate. This is a contact manifold with contact form

\alpha = -dS + p_1 dq_i + \cdots + p_n dq_n

This contact manifold has a submanifold \Sigma where we remember that entropy is a function of the probability distribution q, and define p in terms of q as usual:

\Sigma = \{(q,p,S) \in T^\ast Q \times \mathbb{R} : \; S = f(q), \; p = (df)_q \}

And as we saw last time, \Sigma is a Legendrian submanifold, meaning

\displaystyle{ \alpha|_{\Sigma} = 0 }

But again, we want to understand what these ideas from contact geometry really do for probability theory!

For all my old posts on information geometry, go here:

Information geometry.

Information Geometry (Part 17)

27 July, 2021

I’m getting back into information geometry, which is the geometry of the space of probability distributions, studied using tools from information theory. I’ve written a bunch about it already, which you can see here:

Information geometry.

Now I’m fascinated by something new: how symplectic geometry and contact geometry show up in information geometry. But before I say anything about this, let me say a bit about how they show up in thermodynamics. This is more widely discussed, and it’s a good starting point.

Symplectic geometry was born as the geometry of phase space in classical mechanics: that is, the space of possible positions and momenta of a classical system. The simplest example of a symplectic manifold is the vector space \mathbb{R}^{2n}, with n position coordinates q_i and n momentum coordinates p_i.

It turns out that symplectic manifolds are always even-dimensional, because we can always cover them with coordinate charts that look like \mathbb{R}^{2n}. When we change coordinates, it turns out that the splitting of coordinates into positions and momenta is somewhat arbitrary. For example, the position of a rock on a spring now may determine its momentum a while later, and vice versa. What’s not arbitrary? It’s the so-called ‘symplectic structure’:

\omega = dp_1 \wedge dq_1 + \cdots + dp_n \wedge dq_n

While far from obvious at first, we know by now that the symplectic structure is exactly what needs to be preserved under valid changes of coordinates in classical mechanics! In fact, we can develop the whole formalism of classical mechanics starting from a manifold with a symplectic structure.

Symplectic geometry also shows up in thermodynamics. In thermodynamics we can start with a system in equilibrium whose state is described by some variables q_1, \dots, q_n. Its entropy will be a function of these variables, say

S = f(q_1, \dots, q_n)

We can then take the partial derivatives of entropy and call them something:

\displaystyle{ p_i = \frac{\partial f}{\partial q_i} }

These new variables p_i are said to be ‘conjugate’ to the q_i, and they turn out to be very interesting. For example, if q_i is energy then p_i is ‘coolness’: the reciprocal of temperature. The coolness of a system is its change in entropy per change in energy.

Often the variables q_i are ‘extensive’: that is, you can measure them only by looking at your whole system and totaling up some quantity. Examples are energy and volume. Then the new variables p_i are ‘intensive’: that is, you can measure them at any one location in your system. Examples are coolness and pressure.

Now for a twist: sometimes we do not know the function f ahead of time. Then we cannot define the p_i as above. We’re forced into a different approach where we treat them as independent quantities, at least until someone tells us what f is.

In this approach, we start with a space \mathbb{R}^{2n} having n coordinates called q_i and n coordinates called p_i. This is a symplectic manifold, with the symplectic struture \omega described earlier!

But what about the entropy? We don’t yet know what it is as a function of the q_i, but we may still want to talk about it. So, we build a space \mathbb{R}^{2n+1} having one extra coordinate S in addition to the q_i and p_i. This new coordinate stands for entropy. And this new space has an important 1-form on it:

\alpha = -dS + p_1 dq_i + \cdots + p_n dq_n

This is called the ‘contact 1-form’.

This makes \mathbb{R}^{2n+1} into an example of a ‘contact manifold’. Contact geometry is the odd-dimensional partner of symplectic geometry. Just as symplectic manifolds are always even-dimensional, contact manifolds are always odd-dimensional.

What is the point of the contact 1-form? Well, suppose someone tells us the function f relating entropy to the coordinates q_i. Now we know that we want

S = f

and also

\displaystyle{ p_i = \frac{\partial f}{\partial q_i} }

So, we can impose these equations, which pick out a subset of \mathbb{R}^{2n+1}. You can check that this subset, say \Sigma, is an n-dimensional submanifold. But even better, the contact 1-form vanishes when restricted to this submanifold:

\left.\alpha\right|_\Sigma = 0

Let’s see why! Suppose x \in \Sigma and suppose v \in T_x \Sigma is a vector tangent to \Sigma at this point x. It suffices to show

\alpha(v) = 0

Using the definition of \alpha this equation says

\displaystyle{ -dS(v) + \sum_i p_i dq_i(v) = 0 }

But on the surface \Sigma we have

S = f, \qquad  \displaystyle{ p_i = \frac{\partial f}{\partial q_i} }

So, the equation we’re trying to show can be written as

\displaystyle{ -df(v) + \sum_i \frac{\partial f}{\partial q_i} dq_i(v) = 0 }

But this follows from

\displaystyle{ df = \sum_i \frac{\partial f}{\partial q_i} dq_i }

which holds because f is a function only of the coordinates q_i.

So, any formula for entropy S = f(q_1, \dots, q_n) picks out a so-called ‘Legendrian submanifold’ of \mathbb{R}^{2n+1}: that is, an n-dimensional submanifold such that the contact 1-form vanishes when restricted to this submanifold. And the idea is that this submanifold tells you everything you need to know about a thermodynamic system.

Indeed, V. I. Arnol’d says this was implicitly known to the great founder of statistical mechanics, Josiah Willard Gibbs. Arnol’d calls \mathbb{R}^5 with coordinates energy, entropy, temperature, pressure and volume the ‘Gibbs manifold’, and he proclaims:

Gibbs’ thesis: substances are Legendrian submanifolds of the Gibbs manifold.

This is from here:

• V. I. Arnol’d, Contact geometry: the geometrical method of Gibbs’ thermodynamics, Proceedings of the Gibbs Symposium (New Haven, CT, 1989), AMS, Providence, Rhode Island, 1990.

A bit more detail

Now I want to say everything again, with a bit of extra detail, assuming more familiarity with manifolds. Above I was using \mathbb{R}^n with coordinates q_1, \dots, q_n to describe the ‘extensive’ variables of a thermodynamic system. But let’s be a bit more general and use any smooth n-dimensional manifold Q. Even if Q is a vector space, this viewpoint is nice because it’s manifestly coordinate-independent!

So: starting from Q we build the cotangent bundle T^\ast Q. A point in cotangent describes both extensive variables, namely q \in Q, and ‘intensive’ variables, namely a cotangent vector p \in T^\ast_q Q.

The manifold T^\ast Q has a 1-form \theta on it called the tautological 1-form. We can describe it as follows. Given a tangent vector v \in T_{(q,p)} T^\ast Q we have to say what \theta(v) is. Using the projection

\pi \colon T^\ast Q \to Q

we can project v down to a tangent vector d\pi(v) at the point q. But the 1-form p eats tangent vectors at q and spits out numbers! So, we set

\theta(v) = p(d\pi(v))

This is sort of mind-boggling at first, but it’s worth pondering until it makes sense. It helps to work out what \theta looks like in local coordinates. Starting with any local coordinates q_i on an open set of Q, we get local coordinates q_i, p_i on the cotangent bundle of this open set in the usual way. On this open set you then get

\theta = p_1 dq_1 + \cdots + p_n dq_n

This is a standard calculation, which is really worth doing!

It follows that we can define a symplectic structure \omega by

\omega = d \theta

and get this formula in local coordinates:

\omega = dp_1 \wedge dq_1 + \cdots + dp_n \wedge dq_n

Now, suppose we choose a smooth function

f \colon Q \to \mathbb{R}

which describes the entropy. We get a 1-form df, which we can think of as a map

df \colon Q \to T^\ast Q

assigning to each choice q of extensive variables the pair (q,p) of extensive and intensive variables where

p = df_q

The image of the map df is a ‘Lagrangian submanifold‘ of T^\ast Q: that is, an n-dimensional submanifold \Lambda such that

\left.\omega\right|_{\Lambda} = 0

Lagrangian submanifolds are to symplectic geometry as Legendrian submanifolds are to contact geometry! What we’re seeing here is that if Gibbs had preferred symplectic geometry, he could have described substances as Lagrangian submanifolds rather than Legendrian submanifolds. But this approach would only keep track of the derivatives of entropy, df, not the actual value of the entropy function f.

If we prefer to keep track of the actual value of f using contact geometry, we can do that. For this we add an extra dimension to T^\ast Q and form the manifold T^\ast Q \times \mathbb{R}. The extra dimension represents entropy, so we’ll use S as our name for the coordinate on \mathbb{R}.

We can make T^\ast Q \times \mathbb{R} into a contact manifold with contact 1-form

\alpha = -d S + \theta

In local coordinates we get

\alpha = -dS + p_1 dq_i + \cdots + p_n dq_n

just as we had earlier. And just as before, if we choose a smooth function f \colon Q \to \mathbb{R} describing entropy, the subset

\Sigma = \{(q,p,S) \in T^\ast Q \times \mathbb{R} : \; S = f(q), \; p = df_q \}

is a Legendrian submanifold of T^\ast Q \times \mathbb{R}.

Okay, this concludes my lightning review of symplectic and contact geometry in thermodynamics! Next time I’ll talk about something a bit less well understood: how they show up in statistical mechanics.

Fisher’s Fundamental Theorem (Part 4)

13 July, 2021

I wrote a paper that summarizes my work connecting natural selection to information theory:

• John Baez, The fundamental theorem of natural selection.

Check it out! If you have any questions or see any mistakes, please let me know.

Just for fun, here’s the abstract and introduction.

Abstract. Suppose we have n different types of self-replicating entity, with the population P_i of the ith type changing at a rate equal to P_i times the fitness f_i of that type. Suppose the fitness f_i is any continuous function of all the populations P_1, \dots, P_n. Let p_i be the fraction of replicators that are of the ith type. Then p = (p_1, \dots, p_n) is a time-dependent probability distribution, and we prove that its speed as measured by the Fisher information metric equals the variance in fitness. In rough terms, this says that the speed at which information is updated through natural selection equals the variance in fitness. This result can be seen as a modified version of Fisher’s fundamental theorem of natural selection. We compare it to Fisher’s original result as interpreted by Price, Ewens and Edwards.


In 1930, Fisher stated his “fundamental theorem of natural selection” as follows:

The rate of increase in fitness of any organism at any time is equal to its genetic variance in fitness at that time

Some tried to make this statement precise as follows:

The time derivative of the mean fitness of a population equals the variance of its fitness.

But this is only true under very restrictive conditions, so a controversy was ignited.

An interesting resolution was proposed by Price, and later amplified by Ewens and Edwards. We can formalize their idea as follows. Suppose we have n types of self-replicating entity, and idealize the population of the ith type as a real-valued function P_i(t). Suppose

\displaystyle{ \frac{d}{dt} P_i(t) = f_i(P_1(t), \dots, P_n(t)) \, P_i(t) }

where the fitness f_i is a differentiable function of the populations of every type of replicator. The mean fitness at time t is

\displaystyle{ \overline{f}(t) = \sum_{i=1}^n p_i(t) \, f_i(P_1(t), \dots, P_n(t)) }

where p_i(t) is the fraction of replicators of the ith type:

\displaystyle{ p_i(t) = \frac{P_i(t)}{\phantom{\Big|} \sum_{j = 1}^n P_j(t) } }

By the product rule, the rate of change of the mean fitness is the sum of two terms:

\displaystyle{ \frac{d}{dt} \overline{f}(t) = \sum_{i=1}^n \dot{p}_i(t) \, f_i(P_1(t), \dots, P_n(t)) \; + \; }

\displaystyle{ \sum_{i=1}^n p_i(t) \,\frac{d}{dt} f_i(P_1(t), \dots, P_n(t)) }

The first of these two terms equals the variance of the fitness at time t. We give the easy proof in Theorem 1. Unfortunately, the conceptual significance of this first term is much less clear than that of the total rate of change of mean fitness. Ewens concluded that “the theorem does not provide the substantial biological statement that Fisher claimed”.

But there is another way out, based on an idea Fisher himself introduced in 1922: Fisher information. Fisher information gives rise to a Riemannian metric on the space of probability distributions on a finite set, called the ‘Fisher information metric’—or in the context of evolutionary game theory, the ‘Shahshahani metric’. Using this metric we can define the speed at which a time-dependent probability distribution changes with time. We call this its ‘Fisher speed’. Under just the assumptions already stated, we prove in Theorem 2 that the Fisher speed of the probability distribution

p(t) = (p_1(t), \dots, p_n(t))

is the variance of the fitness at time t.

As explained by Harper, natural selection can be thought of as a learning process, and studied using ideas from information geometry—that is, the geometry of the space of probability distributions. As p(t) changes with time, the rate at which information is updated is closely connected to its Fisher speed. Thus, our revised version of the fundamental theorem of natural selection can be loosely stated as follows:

As a population changes with time, the rate at which information is updated equals the variance of fitness.

The precise statement, with all the hypotheses, is in Theorem 2. But one lesson is this: variance in fitness may not cause ‘progress’ in the sense of increased mean fitness, but it does cause change!

For more details in a user-friendly blog format, read the whole series:

Part 1: the obscurity of Fisher’s original paper.

Part 2: a precise statement of Fisher’s fundamental theorem of natural selection, and conditions under which it holds.

Part 3: a modified version of the fundamental theorem of natural selection, which holds much more generally.

Part 4: my paper on the fundamental theorem of natural selection.

Nonequilibrium Thermodynamics in Biology (Part 2)

16 June, 2021

Larry Li, Bill Cannon and I ran a session on non-equilibrium thermodynamics in biology at SMB2021, the annual meeting of the Society for Mathematical Biology. You can see talk slides here!

Here’s the basic idea:

Since Lotka, physical scientists have argued that living things belong to a class of complex and orderly systems that exist not despite the second law of thermodynamics, but because of it. Life and evolution, through natural selection of dissipative structures, are based on non-equilibrium thermodynamics. The challenge is to develop an understanding of what the respective physical laws can tell us about flows of energy and matter in living systems, and about growth, death and selection. This session addresses current challenges including understanding emergence, regulation and control across scales, and entropy production, from metabolism in microbes to evolving ecosystems.

Click on the links to see slides for most of the talks:

Persistence, permanence, and global stability in reaction network models: some results inspired by thermodynamic principles
Gheorghe Craciun, University of Wisconsin–Madison

The standard mathematical model for the dynamics of concentrations in biochemical networks is called mass-action kinetics. We describe mass-action kinetics and discuss the connection between special classes of mass-action systems (such as detailed balanced and complex balanced systems) and the Boltzmann equation. We also discuss the connection between the ‘global attractor conjecture’ for complex balanced mass-action systems and Boltzmann’s H-theorem. We also describe some implications for biochemical mechanisms that implement noise filtering and cellular homeostasis.

The principle of maximum caliber of nonequilibria
Ken Dill, Stony Brook University

Maximum Caliber is a principle for inferring pathways and rate distributions of kinetic processes. The structure and foundations of MaxCal are much like those of Maximum Entropy for static distributions. We have explored how MaxCal may serve as a general variational principle for nonequilibrium statistical physics—giving well-known results, such as the Green-Kubo relations, Onsager’s reciprocal relations and Prigogine’s Minimum Entropy Production principle near equilibrium, but is also applicable far from equilibrium. I will also discuss some applications, such as finding reaction coordinates in molecular simulations non-linear dynamics in gene circuits, power-law-tail distributions in ‘social-physics’ networks, and others.

Nonequilibrium biomolecular information processes
Pierre Gaspard, Université libre de Bruxelles

Nearly 70 years have passed since the discovery of DNA structure and its role in coding genetic information. Yet, the kinetics and thermodynamics of genetic information processing in DNA replication, transcription, and translation remain poorly understood. These template-directed copolymerization processes are running away from equilibrium, being powered by extracellular energy sources. Recent advances show that their kinetic equations can be exactly solved in terms of so-called iterated function systems. Remarkably, iterated function systems can determine the effects of genome sequence on replication errors, up to a million times faster than kinetic Monte Carlo algorithms. With these new methods, fundamental links can be established between molecular information processing and the second law of thermodynamics, shedding a new light on genetic drift, mutations, and evolution.

Nonequilibrium dynamics of disturbed ecosystems
John Harte, University of California, Berkeley

The Maximum Entropy Theory of Ecology (METE) predicts the shapes of macroecological metrics in relatively static ecosystems, across spatial scales, taxonomic categories, and habitats, using constraints imposed by static state variables. In disturbed ecosystems, however, with time-varying state variables, its predictions often fail. We extend macroecological theory from static to dynamic, by combining the MaxEnt inference procedure with explicit mechanisms governing disturbance. In the static limit, the resulting theory, DynaMETE, reduces to METE but also predicts a new scaling relationship among static state variables. Under disturbances, expressed as shifts in demographic, ontogenic growth, or migration rates, DynaMETE predicts the time trajectories of the state variables as well as the time-varying shapes of macroecological metrics such as the species abundance distribution and the distribution of metabolic rates over
individuals. An iterative procedure for solving the dynamic theory is presented. Characteristic signatures of the deviation from static predictions of macroecological patterns are shown to result from different kinds of disturbance. By combining MaxEnt inference with explicit dynamical mechanisms of disturbance, DynaMETE is a candidate theory of macroecology for ecosystems responding to anthropogenic or natural disturbances.

Stochastic chemical reaction networks
Supriya Krishnamurthy, Stockholm University

The study of chemical reaction networks (CRN’s) is a very active field. Earlier well-known results (Feinberg Chem. Enc. Sci. 42 2229 (1987), Anderson et al Bull. Math. Biol. 72 1947 (2010)) identify a topological quantity called deficiency, easy to compute for CRNs of any size, which, when exactly equal to zero, leads to a unique factorized (non-equilibrium) steady-state for these networks. No general results exist however for the steady states of non-zero-deficiency networks. In recent work, we show how to write the full moment-hierarchy for any non-zero-deficiency CRN obeying mass-action kinetics, in terms of equations for the factorial moments. Using these, we can recursively predict values for lower moments from higher moments, reversing the procedure usually used to solve moment hierarchies. We show, for non-trivial examples, that in this manner we can predict any moment of interest, for CRN’s with non-zero deficiency and non-factorizable steady states. It is however an open question how scalable these techniques are for large networks.

Heat flows adjust local ion concentrations in favor of prebiotic chemistry
Christof Mast, Ludwig-Maximilians-Universität München

Prebiotic reactions often require certain initial concentrations of ions. For example, the activity of RNA enzymes requires a lot of divalent magnesium salt, whereas too much monovalent sodium salt leads to a reduction in enzyme function. However, it is known from leaching experiments that prebiotically relevant geomaterial such as basalt releases mainly a lot of sodium and only little magnesium. A natural solution to this problem is heat fluxes through thin rock fractures, through which magnesium is actively enriched and sodium is depleted by thermogravitational convection and thermophoresis. This process establishes suitable conditions for ribozyme function from a basaltic leach. It can take place in a spatially distributed system of rock cracks and is therefore particularly stable to natural fluctuations and disturbances.

Deficiency of chemical reaction networks and thermodynamics
Matteo Polettini, University of Luxembourg

Deficiency is a topological property of a Chemical Reaction Network linked to important dynamical features, in particular of deterministic fixed points and of stochastic stationary states. Here we link it to thermodynamics: in particular we discuss the validity of a strong vs. weak zeroth law, the existence of time-reversed mass-action kinetics, and the possibility to formulate marginal fluctuation relations. Finally we illustrate some subtleties of the Python module we created for MCMC stochastic simulation of CRNs, soon to be made public.

Large deviations theory and emergent landscapes in biological dynamics
Hong Qian, University of Washington

The mathematical theory of large deviations provides a nonequilibrium thermodynamic description of complex biological systems that consist of heterogeneous individuals. In terms of the notions of stochastic elementary reactions and pure kinetic species, the continuous-time, integer-valued Markov process dictates a thermodynamic structure that generalizes (i) Gibbs’ microscopic chemical thermodynamics of equilibrium matters to nonequilibrium small systems such as living cells and tissues; and (ii) Gibbs’ potential function to the landscapes for biological dynamics, such as that of C. H. Waddington and S. Wright.

Using the maximum entropy production principle to understand and predict microbial biogeochemistry
Joseph Vallino, Marine Biological Laboratory, Woods Hole

Natural microbial communities contain billions of individuals per liter and can exceed a trillion cells per liter in sediments, as well as harbor thousands of species in the same volume. The high species diversity contributes to extensive metabolic functional capabilities to extract chemical energy from the environment, such as methanogenesis, sulfate reduction, anaerobic photosynthesis, chemoautotrophy, and many others, most of which are only expressed by bacteria and archaea. Reductionist modeling of natural communities is problematic, as we lack knowledge on growth kinetics for most organisms and have even less understanding on the mechanisms governing predation, viral lysis, and predator avoidance in these systems. As a result, existing models that describe microbial communities contain dozens to hundreds of parameters, and state variables are extensively aggregated. Overall, the models are little more than non-linear parameter fitting exercises that have limited, to no, extrapolation potential, as there are few principles governing organization and function of complex self-assembling systems. Over the last decade, we have been developing a systems approach that models microbial communities as a distributed metabolic network that focuses on metabolic function rather than describing individuals or species. We use an optimization approach to determine which metabolic functions in the network should be up regulated versus those that should be down regulated based on the non-equilibrium thermodynamics principle of maximum entropy production (MEP). Derived from statistical mechanics, MEP proposes that steady state systems will likely organize to maximize free energy dissipation rate. We have extended this conjecture to apply to non-steady state systems and have proposed that living systems maximize entropy production integrated over time and space, while non-living systems maximize instantaneous entropy production. Our presentation will provide a brief overview of the theory and approach, as well as present several examples of applying MEP to describe the biogeochemistry of microbial systems in laboratory experiments and natural ecosystems.

Reduction and the quasi-steady state approximation
Carsten Wiuf, University of Copenhagen

Chemical reactions often occur at different time-scales. In applications of chemical reaction network theory it is often desirable to reduce a reaction network to a smaller reaction network by elimination of fast species or fast reactions. There exist various techniques for doing so, e.g. the Quasi-Steady-State Approximation or the Rapid Equilibrium Approximation. However, these methods are not always mathematically justifiable. Here, a method is presented for which (so-called) non-interacting species are eliminated by means of QSSA. It is argued that this method is mathematically sound. Various examples are given (Michaelis-Menten mechanism, two-substrate mechanism, …) and older related techniques from the 50s and 60s are briefly discussed.

Fisher’s Fundamental Theorem (Part 3)

8 October, 2020

Last time we stated and proved a simple version of Fisher’s fundamental theorem of natural selection, which says that under some conditions, the rate of increase of the mean fitness equals the variance of the fitness. But the conditions we gave were very restrictive: namely, that the fitness of each species of replicator is constant, not depending on how many of these replicators there are, or any other replicators.

To broaden the scope of Fisher’s fundamental theorem we need to do one of two things:

1) change the left side of the equation: talk about some other quantity other than rate of change of mean fitness.

2) change the right side of the question: talk about some other quantity than the variance in fitness.

Or we could do both! People have spent a lot of time generalizing Fisher’s fundamental theorem. I don’t think there are, or should be, any hard rules on what counts as a generalization.

But today we’ll take alternative 1). We’ll show the square of something called the ‘Fisher speed’ always equals the variance in fitness. One nice thing about this result is that we can drop the restrictive condition I mentioned. Another nice thing is that the Fisher speed is a concept from information theory! It’s defined using the Fisher metric on the space of probability distributions.

And yes—that metric is named after the same guy who proved Fisher’s fundamental theorem! So, arguably, Fisher should have proved this generalization of Fisher’s fundamental theorem. But in fact it seems that I was the first to prove it, around February 1st, 2017. Some similar results were already known, and I will discuss those someday. But they’re a bit different.

A good way to think about the Fisher speed is that it’s ‘the rate at which information is being updated’. A population of replicators of different species gives a probability distribution. Like any probability distribution, this has information in it. As the populations of our replicators change, the Fisher speed says the rate at which this information is being updated. So, in simple terms, we’ll show

The square of the rate at which information is updated is equal to the variance in fitness.

This is quite a change from Fisher’s original idea, namely:

The rate of increase of mean fitness is equal to the variance in fitness.

But it has the advantage of always being true… as long the population dynamics are described by the general framework we introduced last time. So let me remind you of the general setup, and then prove the result!

The setup

We start out with population functions P_i \colon \mathbb{R} \to (0,\infty), one for each species of replicator i = 1,\dots,n, obeying the Lotka–Volterra equation

\displaystyle{ \frac{d P_i}{d t} = f_i(P_1, \dots, P_n) P_i }

for some differentiable functions f_i \colon (0,\infty) \to \mathbb{R} called fitness functions. The probability of a replicator being in the ith species is

\displaystyle{  p_i(t) = \frac{P_i(t)}{\sum_j P_j(t)} }

Using the Lotka–Volterra equation we showed last time that these probabilities obey the replicator equation

\displaystyle{ \dot{p}_i = \left( f_i(P) - \overline f(P) \right)  p_i }

Here P is short for the whole list of populations (P_1(t), \dots, P_n(t)), and

\displaystyle{ \overline f(P) = \sum_j f_j(P) p_j  }

is the mean fitness.

The Fisher metric

The space of probability distributions on the set \{1, \dots, n\} is called the (n-1)-simplex

\Delta^{n-1} = \{ (x_1, \dots, x_n) : \; x_i \ge 0, \; \displaystyle{ \sum_{i=1}^n x_i = 1 } \}

It’s called \Delta^{n-1} because it’s (n-1)-dimensional. When n = 3 it looks like the letter \Delta:

The Fisher metric is a Riemannian metric on the interior of the (n-1)-simplex. That is, given a point p in the interior of \Delta^{n-1} and two tangent vectors v,w at this point the Fisher metric gives a number

g(v,w) = \displaystyle{ \sum_{i=1}^n \frac{v_i w_i}{p_i}  }

Here we are describing the tangent vectors v,w as vectors in \mathbb{R}^n with the property that the sum of their components is zero: that’s what makes them tangent to the (n-1)-simplex. And we’re demanding that x be in the interior of the simplex to avoid dividing by zero, since on the boundary of the simplex we have p_i = 0 for at least one choice of $i.$

If we have a probability distribution p(t) moving around in the interior of the (n-1)-simplex as a function of time, its Fisher speed is

\displaystyle{ \sqrt{g(\dot{p}(t), \dot{p}(t))} = \sqrt{\sum_{i=1}^n \frac{\dot{p}_i(t)^2}{p_i(t)}} }

if the derivative \dot{p}(t) exists. This is the usual formula for the speed of a curve moving in a Riemannian manifold, specialized to the case at hand.

Now we’ve got all the formulas we’ll need to prove the result we want. But for those who don’t already know and love it, it’s worthwhile saying a bit more about the Fisher metric.

The factor of 1/x_i in the Fisher metric changes the geometry of the simplex so that it becomes round, like a portion of a sphere:

But the reason the Fisher metric is important, I think, is its connection to relative information. Given two probability distributions p, q \in \Delta^{n-1}, the information of q relative to p is

\displaystyle{ I(q,p) = \sum_{i = 1}^n q_i \ln\left(\frac{q_i}{p_i}\right)   }

You can show this is the expected amount of information gained if p was your prior distribution and you receive information that causes you to update your prior to q. So, sometimes it’s called the information gain. It’s also called relative entropy or—my least favorite, since it sounds so mysterious—the Kullback–Leibler divergence.

Suppose p(t) is a smooth curve in the interior of the (n-1)-simplex. We can ask the rate at which information is gained as time passes. Perhaps surprisingly, a calculation gives

\displaystyle{ \frac{d}{dt} I(p(t), p(t_0))\Big|_{t = t_0} = 0 }

That is, in some sense ‘to first order’ no information is being gained at any moment t_0 \in \mathbb{R}. However, we have

\displaystyle{  \frac{d^2}{dt^2} I(p(t), p(t_0))\Big|_{t = t_0} =  g(\dot{p}(t_0), \dot{p}(t_0))}

So, the square of the Fisher speed has a nice interpretation in terms of relative entropy!

For a derivation of these last two equations, see Part 7 of my posts on information geometry. For more on the meaning of relative entropy, see Part 6.

The result

It’s now extremely easy to show what we want, but let me state it formally so all the assumptions are crystal clear.

Theorem. Suppose the functions P_i \colon \mathbb{R} \to (0,\infty) obey the Lotka–Volterra equations:

\displaystyle{ \dot P_i = f_i(P) P_i}

for some differentiable functions f_i \colon (0,\infty)^n \to \mathbb{R} called fitness functions. Define probabilities and the mean fitness as above, and define the variance of the fitness by

\displaystyle{ \mathrm{Var}(f(P)) =  \sum_j ( f_j(P) - \overline f(P))^2 \, p_j }

Then if none of the populations P_i are zero, the square of the Fisher speed of the probability distribution p(t) = (p_1(t), \dots , p_n(t)) is the variance of the fitness:

g(\dot{p}, \dot{p})  = \mathrm{Var}(f(P))

Proof. The proof is near-instantaneous. We take the square of the Fisher speed:

\displaystyle{ g(\dot{p}, \dot{p}) = \sum_{i=1}^n \frac{\dot{p}_i(t)^2}{p_i(t)} }

and plug in the replicator equation:

\displaystyle{ \dot{p}_i = (f_i(P) - \overline f(P)) p_i }

We obtain:

\begin{array}{ccl} \displaystyle{ g(\dot{p}, \dot{p})} &=&  \displaystyle{ \sum_{i=1}^n \left( f_i(P) - \overline f(P) \right)^2 p_i } \\ \\  &=& \mathrm{Var}(f(P))  \end{array}

as desired.   █

It’s hard to imagine anything simpler than this. We see that given the Lotka–Volterra equation, what causes information to be updated is nothing more and nothing less than variance in fitness!

The whole series:

Part 1: the obscurity of Fisher’s original paper.

Part 2: a precise statement of Fisher’s fundamental theorem of natural selection, and conditions under which it holds.

Part 3: a modified version of the fundamental theorem of natural selection, which holds much more generally.

Part 4: my paper on the fundamental theorem of natural selection.

Ascendancy vs. Reserve

22 September, 2020

Why is biodiversity ‘good’? To what extent is this sort of goodness even relevant to ecosystems—as opposed to us humans? I’d like to study this mathematically.

To do this, we’d need to extract some answerable questions out of the morass of subtlety and complexity. For example: what role does biodiversity play in the ability of ecosystems to be robust under sudden changes of external conditions? This is already plenty hard to study mathematically, since it requires understanding ‘biodiversity’ and ‘robustness’.

Luckily there has already been a lot of work on the mathematics of biodiversity and its connection to entropy. For example:

• Tom Leinster, Measuring biodiversity, Azimuth, 7 November 2011.

But how does biodiversity help robustness?

There’s been a lot of work on this. This paper has some inspiring passages:

• Robert E. Ulanowicz,, Sally J. Goerner, Bernard Lietaer and Rocio Gomez, Quantifying sustainability: Resilience, efficiency and the return of information theory, Ecological Complexity 6 (2009), 27–36.

I’m not sure the math lives up to their claims, but I like these lines:

In other words, (14) says that the capacity for a system to undergo evolutionary change or self-organization consists of two aspects: It must be capable of exercising sufficient directed power (ascendancy) to maintain its integrity over time. Simultaneously, it must possess a reserve of flexible actions that can be used to meet the exigencies of novel disturbances. According to (14) these two aspects are literally complementary.

The two aspects are ‘ascendancy’, which is something like efficiency or being optimized, and ‘reserve capacity’, which is something like random junk that might come in handy if something unexpected comes up.

You know those gadgets you kept in the back of your kitchen drawer and never needed… until you did? If you’re aiming for ‘ascendancy’ you’d clear out those drawers. But if you keep that stuff, you’ve got more ‘reserve capacity’. They both have their good points. Ideally you want to strike a wise balance. You’ve probably sensed this every time you clean out your house: should I keep this thing because I might need it, or should I get rid of it?

I think it would be great to make these concepts precise. The paper at hand attempts this by taking a matrix of nonnegative numbers T_{i j} to describe flows in an ecological network. They define a kind of entropy for this matrix, very similar in look to Shannon entropy. Then they write this as a sum of two parts: a part closely analogous to mutual information, and a part closely analogous to conditional entropy. This decomposition is standard in information theory. This is their equation (14).

If you want to learn more about the underlying math, click on this picture:

The new idea of these authors is that in the context of an ecological network, the mutual information can be understood as ‘ascendancy’, while the conditional entropy can be understood as ‘reserve capacity’.

I don’t know if I believe this! But I like the general idea of a balance between ascendancy and reserve capacity.

They write:

While the dynamics of this dialectic interaction can be quite subtle and highly complex, one thing is boldly clear—systems with either vanishingly small ascendancy or insignificant reserves are destined to perish before long. A system lacking ascendancy has neither the extent of activity nor the internal organization needed to survive. By contrast, systems that are so tightly constrained and honed to a particular environment appear ‘‘brittle’’ in the sense of Holling (1986) or ‘‘senescent’’ in the sense of Salthe (1993) and are prone to collapse in the face of even minor novel disturbances. Systems that endure—that is, are sustainable—lie somewhere between these extremes. But, where?

Entropy in the Universe

25 January, 2020

If you click on this picture, you’ll see a zoomable image of the Milky Way with 84 million stars:

But stars contribute only a tiny fraction of the total entropy in the observable Universe. If it’s random information you want, look elsewhere!

First: what’s the ‘observable Universe’, exactly?

The further you look out into the Universe, the further you look back in time. You can’t see through the hot gas from 380,000 years after the Big Bang. That ‘wall of fire’ marks the limits of the observable Universe.

But as the Universe expands, the distant ancient stars and gas we see have moved even farther away, so they’re no longer observable. Thus, the so-called ‘observable Universe’ is really the ‘formerly observable Universe’. Its edge is 46.5 billion light years away now!

This is true even though the Universe is only 13.8 billion years old. A standard challenge in understanding general relativity is to figure out how this is possible, given that nothing can move faster than light.

What’s the total number of stars in the observable Universe? Estimates go up as telescopes improve. Right now people think there are between 100 and 400 billion stars in the Milky Way. They think there are between 170 billion and 2 trillion galaxies in the Universe.

In 2009, Chas Egan and Charles Lineweaver estimated the total entropy of all the stars in the observable Universe at 1081 bits. You should think of these as qubits: it’s the amount of information to describe the quantum state of everything in all these stars.

But the entropy of interstellar and intergalactic gas and dust is about ten times more the entropy of stars! It’s about 1082 bits.

The entropy in all the photons in the Universe is even more! The Universe is full of radiation left over from the Big Bang. The photons in the observable Universe left over from the Big Bang have a total entropy of about 1090 bits. It’s called the ‘cosmic microwave background radiation’.

The neutrinos from the Big Bang also carry about 1090 bits—a bit less than the photons. The gravitons carry much less, about 1088 bits. That’s because they decoupled from other matter and radiation very early, and have been cooling ever since. On the other hand, photons in the cosmic microwave background radiation were formed by annihilating
electron-positron pairs until about 10 seconds after the Big Bang. Thus the graviton radiation is expected to be cooler than the microwave background radiation: about 0.6 kelvin as compared to 2.7 kelvin.

Black holes have immensely more entropy than anything listed so far. Egan and Lineweaver estimate the entropy of stellar-mass black holes in the observable Universe at 1098 bits. This is connected to why black holes are so stable: the Second Law says entropy likes to increase.

But the entropy of black holes grows quadratically with mass! So black holes tend to merge and form bigger black holes — ultimately forming the ‘supermassive’ black holes at the centers of most galaxies. These dominate the entropy of the observable Universe: about 10104 bits.

Hawking predicted that black holes slowly radiate away their mass when they’re in a cold enough environment. But the Universe is much too hot for supermassive black holes to be losing mass now. Instead, they very slowly grow by eating the cosmic microwave background, even when they’re not eating stars, gas and dust.

So, only in the far future will the Universe cool down enough for large black holes to start slowly decaying via Hawking radiation. Entropy will continue to increase… going mainly into photons and gravitons! This process will take a very long time. Assuming nothing is falling into it and no unknown effects intervene, a solar-mass black hole takes about 1067 years to evaporate due to Hawking radiation — while a really big one, comparable to the mass of a galaxy, should take about 1099 years.

If our current most popular ideas on dark energy are correct, the Universe will continue to expand exponentially. Thanks to this, there will be a cosmological event horizon surrounding each observer, which will radiate Hawking radiation at a temperature of roughly 10-30 kelvin.

In this scenario the Universe in the very far future will mainly consist of massless particles produced as Hawking radiation at this temperature: photons and gravitons. The entropy within the exponentially expanding ball of space that is today our ‘observable Universe’ will continue to increase exponentially… but more to the point, the entropy density will approach that of a gas of photons and gravitons in thermal equilibrium at 10-30 kelvin.

Of course, it’s quite likely that some new physics will turn up, between now and then, that changes the story! I hope so: this would be a rather dull ending to the Universe.

For more details, go here:

• Chas A. Egan and Charles H. Lineweaver, A larger estimate of the entropy of the universe, The Astrophysical Journal 710 (2010), 1825.

Also read my page on information.