Stirling’s Formula in Words

8 October, 2021


I recently sketched a quick proof of Stirling’s asymptotic formula for the factorial. But what does this formula really mean?

Above is my favorite explanation. You can’t see the number 2\pi in the words here, but it’s hiding in the formula for a Gaussian probability distribution. But what about the factorial, and the rest?

My description in words was informal. I’m really talking about a
Poisson distribution. If raindrops land at an average rate r, this says that after time t the probability of k having landed is

\displaystyle{  \frac{(rt)^k e^{-rt}}{k!} }

This is where the factorial comes from, and also the number e.

Since the average rate of rainfall is r, at time t the expected number of drops that have landed will be rt. Since I said “wait until the expected number of drops that have landed is n”, we want rt = n. Then the probability of k having landed is

\displaystyle{  \frac{n^k e^{-n}}{k!} }

Next, what’s the formula for a Gaussian with mean n and standard deviation \sqrt{n}? Written as a function of k it’s

\displaystyle{    \frac{e^{-(k-n)^2/2n}}{\sqrt{2 \pi n}} }

If this matches the Poisson distribution above in the limit of large n, the two functions must match at the point k = n, at least asymptotically, so

\displaystyle{  \frac{n^n e^{-n}}{n!}  \sim  \frac{1}{\sqrt{2 \pi n}} }

And this becomes Stirling’s formula after a tiny bit of algebra!

I learned about this on Twitter: Ilya Razenshtyn showed how to prove Stirling’s formula starting from probability theory this way. But it’s much easier to use his ideas to check that my paragraph in words implies Stirling’s formula, as I’ve just done.

Information Geometry (Part 21)

17 August, 2021

Last time I ended with a formula for the ‘Gibbs distribution’: the probability distribution that maximizes entropy subject to constraints on the expected values of some observables.

This formula is well-known, but I’d like to derive it here. My argument won’t be up to the highest standards of rigor: I’ll do a bunch of computations, and it would take more work to state conditions under which these computations are justified. But even a nonrigorous approach is worthwhile, since the computations will give us more than the mere formula for the Gibbs distribution.

I’ll start by reminding you of what I claimed last time. I’ll state it in a way that removes all unnecessary distractions, so go back to Part 20 if you want more explanation.

The Gibbs distribution

Take a measure space \Omega with measure \mu. Suppose there is a probability distribution \pi on \Omega that maximizes the entropy

\displaystyle{ -\int_\Omega \pi(x) \ln \pi(x) \, d\mu(x) }

subject to the requirement that some integrable functions A^1, \dots, A^n on \Omega have expected values equal to some chosen list of numbers q^1, \dots, q^n.

(Unlike last time, now I’m writing A^i and q^i with superscripts rather than subscripts, because I’ll be using the Einstein summation convention: I’ll sum over any repeated index that appears once as a a superscript and once as a subscript.)

Furthermore, suppose \pi depends smoothly on q \in \mathbb{R}^n. I’ll call it \pi_q to indicate its dependence on q. Then, I claim \pi_q is the so-called Gibbs distribution

\displaystyle{  \pi_q(x) =\frac{e^{-p_i A^i(x)}}{\int_\Omega e^{-p_i A^i(x)} \, d\mu(x)}   }


\displaystyle{ p_i = \frac{\partial f(q)}{\partial q^i} }


\displaystyle{ f(q) = - \int_\Omega \pi_q(x) \ln \pi_q(x) \, d\mu(x) }

is the entropy of \pi_q.

Let’s show this is true!

Finding the Gibbs distribution

So, we are trying to find a probability distribution \pi that maximizes entropy subject to these constraints:

\displaystyle{ \int_\Omega \pi(x) A^i(x) \, d\mu(x) = q^i }

We can solve this problem using Lagrange multipliers. We need one Lagrange multiplier, say \beta_i, for each of the above constraints. But it’s easiest if we start by letting \pi range over all of L^1(\Omega), that is, the space of all integrable functions on \Omega. Then, because we want \pi to be a probability distribution, we need to impose one extra constraint

\displaystyle{ \int_\Omega \pi(x) \, d\mu(x) = 1 }

To do this we need an extra Lagrange multiplier, say \gamma.

So, that’s what we’ll do! We’ll look for critical points of this function on L^1(\Omega):

\displaystyle{ - \int \pi \ln \pi \, d\mu - \beta_i  \int \pi A^i \, d\mu - \gamma \int \pi\,  d\mu  }

Here I’m using some tricks to keep things short. First, I’m dropping the dummy variable x which appeared in all of the integrals we had: I’m leaving it implicit. Second, all my integrals are over \Omega so I won’t say that. And third, I’m using the Einstein summation convention, so there’s a sum over i implicit here.

Okay, now let’s do the variational derivative required to find a critical point of this function. When I was a math major taking physics classes, the way physicists did variational derivatives seemed like black magic to me. Then I spent months reading how mathematicians rigorously justified these techniques. I don’t feel like a massive digression into this right now, so I’ll just do the calculations—and if they seem like black magic, I’m sorry!

We need to find \pi obeying

\displaystyle{ \frac{\delta}{\delta \pi(x)} \left(- \int \pi \ln \pi \, d\mu - \beta_i  \int \pi A^i \, d\mu - \gamma \int \pi \, d\mu \right) = 0 }

or in other words

\displaystyle{ \frac{\delta}{\delta \pi(x)} \left(\int \pi \ln \pi \, d\mu + \beta_i  \int \pi A^i \, d\mu + \gamma \int \pi \, d\mu \right) = 0 }

First we need to simplify this expression. The only part that takes any work, if you know how to do variational derivatives, is the first term. Since the derivative of z \ln z is 1 + \ln z, we have

\displaystyle{ \frac{\delta}{\delta \pi(x)} \int \pi \ln \pi \, d\mu  = 1 + \ln \pi(x) }

The second and third terms are easy, so we get

\displaystyle{ \frac{\delta}{\delta \pi(x)} \left( \int \pi \ln \pi \, d\mu + \beta_i \int \pi A^i \, d\mu + \gamma \int \pi \, d\mu \right) =}

        1 + \ln \pi(x) + \beta_i A^i(x) + \gamma

Thus, we need to solve this equation:

\displaystyle{ 1 + \ln \pi(x) + \beta_i A^i(x) + \gamma  = 0}

That’s easy to do:

\pi(x) = e^{-1 - \gamma - \beta_i A^i(x)}

Good! It’s starting to look like the Gibbs distribution!

We now need to choose the Lagrange multipliers \beta_i and \gamma to make the constraints hold. To satisfy this constraint

\displaystyle{ \int \pi \, d\mu = 1 }

we must choose \gamma so that

\displaystyle{ \int  e^{-1 - \gamma - \beta_i A^i } \, d\mu = 1 }

or in other words

\displaystyle{ e^{1 + \gamma} = \int e^{- \beta_i A^i} \, d\mu }

Plugging this into our earlier formula

\pi(x) = e^{-1 - \gamma - \beta_i A^i(x)}

we get this:

\displaystyle{ \pi_q(x) = \frac{e^{- \beta_i A^i(x)}}{\int e^{- \beta_i A^i} \, d\mu} }

Great! Even more like the Gibbs distribution!

By the way, you must have noticed the “1” that showed up here:

\displaystyle{ \frac{\delta}{\delta \pi(x)} \int \pi \ln \pi \, d\mu  = 1 + \ln \pi(x) }

It buzzed around like an annoying fly in the otherwise beautiful calculation, but eventually went away. This is the same irksome “1” that showed up in Part 19. Someday I’d like to say a bit more about it.

Now, where were we? We were trying to show that

\displaystyle{  \pi_q(x) =\frac{e^{-p_i A^i(x)}}{\int e^{-p_i A^i} \, d\mu}   }

minimizes entropy subject to our constraints. So far we’ve shown

\displaystyle{ \pi(x) = \frac{e^{- \beta_i A^i(x)}}{\int e^{- \beta_i A^i} \, d\mu} }

is a critical point. It’s clear that

\pi(x) \ge 0

so \pi really is a probability distribution. We should show it actually maximizes entropy subject to our constraints, but I will skip that. Given that, \pi will be our claimed Gibbs distribution \pi_q if we can show

p_i = \beta_i

This is interesting! It’s saying our Lagrange multipliers \beta_i actually equal the so-called conjugate variables p_i given by

\displaystyle{ p_i = \frac{\partial f}{\partial q^i} }

where f(q) is the entropy of \pi_q:

\displaystyle{ f(q) = - \int_\Omega \pi_q(x) \ln \pi_q(x) \, d\mu(x) }

There are two ways to show this: the easy way and the hard way. The easy way is to reflect on the meaning of Lagrange multipliers, and I’ll sketch that way first. The hard way is to use brute force: just compute p_i and show it equals \beta_i. This is a good test of our computational muscle—but more importantly, it will help us discover some interesting facts about the Gibbs distribution.

The easy way

Consider a simple Lagrange multiplier problem where you’re trying to find a critical point of a smooth function

f \colon \mathbb{R}^2 \to \mathbb{R}

subject to the constraint

g = c

for some smooth function

g \colon \mathbb{R}^2 \to \mathbb{R}

and constant c. (The function f here has nothing to do with the f in the previous sections.) To answer this we introduce a Lagrange multiplier \lambda and seek points where

\nabla ( f - \lambda g) = 0

This works because the above equation says

\nabla f = \lambda \nabla g

Geometrically this means we’re at a point where the gradient of f points at right angles to the level surface of g:

Thus, to first order we can’t change f by moving along the level surface of g.

But also, if we start at a point where

\nabla f = \lambda \nabla g

and we begin moving in any direction, the function f will change at a rate equal to \lambda times the rate of change of g. That’s just what the equation says! And this fact gives a conceptual meaning to the Lagrange multiplier \lambda.

Our situation is more complicated, since our functions are defined on the infinite-dimensional space L^1(\Omega), and we have an n-tuple of constraints with an n-tuple of Lagrange multipliers. But the same principle holds.

So, when we are at a solution \pi_q of our constrained entropy-maximization problem, and we start moving the point \pi_q by changing the value of the ith constraint, namely q^i, the rate at which the entropy changes will be \beta_i times the rate of change of q^i. So, we have

\displaystyle{\frac{\partial f}{\partial q^i} = \beta_i }

But this is just what we needed to show!

The hard way

Here’s another way to show

\displaystyle{\frac{\partial f}{\partial q^i} = \beta_i }

We start by solving our constrained entropy-maximization problem using Lagrange multipliers. As already shown, we get

\displaystyle{ \pi_q(x) = \frac{e^{- \beta_i A^i(x)}}{\int e^{- \beta_i A^i} \, d\mu} }

Then we’ll compute the entropy

f(q) = - \int \pi_q \ln \pi_q \, d\mu

Then we’ll differentiate this with respect to q_i and show we get \beta_i.

Let’s try it! The calculation is a bit heavy, so let’s write Z(q) for the so-called partition function

\displaystyle{ Z(q) = \int e^{- \beta_i A^i} \, d\mu }

so that

\displaystyle{ \pi_q(x) = \frac{e^{- \beta_i A^i(x)}}{Z(q)} }

and the entropy is

\begin{array}{ccl}  f(q) &=& - \displaystyle{ \int  \pi_q  \ln \left( \frac{e^{- \beta_k A^k}}{Z(q)} \right)  \, d\mu }  \\ \\  &=& \displaystyle{ \int \pi_q \left(\beta_k A^k + \ln Z(q) \right)  \, d\mu } \\ \\  \end{array}

This is the sum of two terms. The first term

\displaystyle{ \int \pi_q \beta_k A^k   \, d\mu =  \beta_k \int \pi_q A^k   \, d\mu}

is \beta_k times the expected value of A^k with respect to the probability distribution \pi_q, all summed over k. But the expected value of A^k is q^k, so we get

\displaystyle{ \int  \pi_q \beta_k A^k \, d\mu } =  \beta_k q^k

The second term is easier:

\displaystyle{ \int_\Omega  \pi_q \ln Z(q) \, d\mu = \ln Z(q) }

since \pi_q(x) integrates to 1 and the partition function Z(q) doesn’t depend on x \in \Omega.

Putting together these two terms we get an interesting formula for the entropy:

f(q) = \beta_k q^k + \ln Z(q)

This formula is one reason this brute-force approach is actually worthwhile! I’ll say more about it later.

But for now, let’s use this formula to show what we’re trying to show, namely

\displaystyle{\frac{\partial f}{\partial q^i} = \beta_i }

For starters,

\begin{array}{ccl}  \displaystyle{\frac{\partial f}{\partial q^i}} &=& \displaystyle{\frac{\partial}{\partial q^i} \left(\beta_k q^k + \ln Z(q) \right) } \\ \\  &=& \displaystyle{ \frac{\partial \beta_k}{\partial q^i} q^k + \beta_k   \frac{\partial q^k}{\partial q^i} + \frac{\partial}{\partial q^i} \ln Z(q)   }  \\ \\  &=& \displaystyle{ \frac{\partial \beta_k}{\partial q^i} q^k + \beta_k  \delta^k_i + \frac{\partial}{\partial q^i} \ln Z(q)   } \\ \\  &=& \displaystyle{ \frac{\partial \beta_k}{\partial q^i} q^k + \beta_i + \frac{\partial}{\partial q^i} \ln Z(q)  }  \end{array}

where we played a little Kronecker delta game with the second term.

Now we just need to compute the third term:

\begin{array}{ccl}  \displaystyle{ \frac{\partial}{\partial q^i} \ln Z(q) } &=&  \displaystyle{ \frac{1}{Z(q)} \frac{\partial}{\partial q^i} Z(q) } \\  \\  &=&  \displaystyle{ \frac{1}{Z(q)} \frac{\partial}{\partial q^i}  \int e^{- \beta_j A^j}  \, d\mu  }  \\ \\  &=&  \displaystyle{ \frac{1}{Z(q)} \int \frac{\partial}{\partial q^i} \left(e^{- \beta_j A^j}\right) \, d\mu  }  \\ \\  &=&  \displaystyle{ \frac{1}{Z(q)} \int \frac{\partial}{\partial q^i}\left( - \beta_k A^k \right)  e^{- \beta_j A^j} \, d\mu  }  \\ \\  &=&  \displaystyle{ -\frac{1}{Z(q)} \int \frac{\partial \beta_k}{\partial q^i}  A^k   e^{- \beta_j A^j} \, d\mu  }  \\ \\  &=&  \displaystyle{ - \frac{\partial \beta_k}{\partial q^i} \frac{1}{Z(q)} \int  A^k   e^{- \beta_j A^j} \, d\mu  }  \\ \\  &=&  \displaystyle{ - \frac{\partial \beta_k}{\partial q^i} q^k }  \end{array}

Ah, you don’t know how good it feels, after years of category theory, to be doing calculations like this again!

Now we can finish the job we started:

\begin{array}{ccl}  \displaystyle{\frac{\partial f}{\partial q^i}} &=&  \displaystyle{ \frac{\partial \beta_k}{\partial q^i} q^k + \beta_i + \frac{\partial}{\partial q^i} \ln Z(q)  } \\ \\  &=&  \displaystyle{ \frac{\partial \beta_k}{\partial q^i} q^k + \beta_i - \frac{\partial \beta_k}{\partial q^i} q^k  } \\ \\  &=& \beta_i  \end{array}



We’ve learned the formula for the probability distribution that maximizes entropy subject to some constraints on the expected values of observables. But more importantly, we’ve seen that the anonymous Lagrange multipliers \beta_i that show up in this problem are actually the partial derivatives of entropy! They equal

\displaystyle{ p_i = \frac{\partial f}{\partial q^i} }

Thus, they are rich in meaning. From what we’ve seen earlier, they are ‘surprisals’. They are analogous to momentum in classical mechanics and have the meaning of intensive variables in thermodynamics:

 Classical Mechanics   Thermodynamics   Probability Theory 
 q   position extensive variables probabilities
 p  momentum intensive variables surprisals
 S   action entropy Shannon entropy

Furthermore, by showing \beta_i = p_i the hard way we discovered an interesting fact. There’s a relation between the entropy and the logarithm of the partition function:

f(q) = p_i q^i + \ln Z(q)

(We proved this formula with \beta_i replacing p_i, but now we know those are equal.)

This formula suggests that the logarithm of the partition function is important—and it is! It’s closely related to the concept of free energy—even though ‘energy’, free or otherwise, doesn’t show up at the level of generality we’re working at now.

This formula should also remind you of the tautological 1-form on the cotangent bundle T^\ast Q, namely

\theta = p_i dq^i

It should remind you even more of the contact 1-form on the contact manifold T^\ast Q \times \mathbb{R}, namely

\alpha = -dS + p_i dq^i

Here S is a coordinate on the contact manifold that’s a kind of abstract stand-in for our entropy function f.

So, it’s clear there’s a lot more to say: we’re seeing hints of things here and there, but not yet the full picture.

For all my old posts on information geometry, go here:

Information geometry.

Information Geometry (Part 20)

14 August, 2021

Last time we worked out an analogy between classical mechanics, thermodynamics and probability theory. The latter two look suspiciously similar:

 Classical Mechanics   Thermodynamics   Probability Theory 
 q   position extensive variables probabilities
 p  momentum intensive variables surprisals
 S   action entropy Shannon entropy

This is no coincidence. After all, in the subject of statistical mechanics we explain classical thermodynamics using probability theory—and entropy is revealed to be Shannon entropy (or its quantum analogue).

Now I want to make this precise.

To connect classical thermodynamics to probability theory, I’ll start by discussing ‘statistical manifolds’. I introduced the idea of a statistical manifold in Part 7: it’s a manifold Q equipped with a map sending each point q \in Q to a probability distribution \pi_q on some measure space \Omega. Now I’ll say how these fit into the second column of the above chart.

Then I’ll talk about statistical manifolds of a special sort used in thermodynamics, which I’ll call ‘Gibbsian’, since they really go back to Josiah Willard Gibbs.

In a Gibbsian statistical manifold, for each q \in Q the probability distribution \pi_q is a ‘Gibbs distribution’. Physically, these Gibbs distributions describe thermodynamic equilibria. For example, if you specify the volume, energy and number of particles in a box of gas, there will be a Gibbs distribution describing what the particles do in thermodynamic equilibrium under these conditions. Mathematically, Gibbs distributions maximize entropy subject to some constraints specified by the point q \in Q.

More precisely: in a Gibbsian statistical manifold we have a list of observables A_1, \dots , A_n whose expected values serve as coordinates q_1, \dots, q_n for points q \in Q, and \pi_q is the probability distribution that maximizes entropy subject to the constraint that the expected value of A_i is q_i. We can derive most of the interesting formulas of thermodynamics starting from this!

Statistical manifolds

Let’s fix a measure space \Omega with measure \mu. A statistical manifold is then a manifold Q equipped with a smooth map \pi assigning to each point q \in Q a probability distribution on \Omega, which I’ll call \pi_q. So, \pi_q is a function on \Omega with

\displaystyle{ \int_\Omega \pi_q \, d\mu = 1 }


\pi_q(x) \ge 0

for all x \in \Omega.

The idea here is that the space of all probability distributions on \Omega may be too huge to understand in as much detail as we’d like, so instead we describe some of these probability distributions—a family parametrized by points of some manifold Q—using the map \pi. This is the basic idea behind parametric statistics.

Information geometry is the geometry of statistical manifolds. Any statistical manifold comes with a bunch of interesting geometrical structures. One is the ‘Fisher information metric’, a Riemannian metric I explained in Part 7. Another is a 1-parameter family of connections on the tangent bundle T Q, which is important in Amari’s approach to information geometry. You can read about this here:

• Hiroshi Matsuzoe, Statistical manifolds and affine differential geometry, in Advanced Studies in Pure Mathematics 57, pp. 303–321.

I don’t want to talk about it now—I just wanted to reassure you that I’m not completely ignorant of it!

I want to focus on the story I’ve been telling, which is about entropy. Our statistical manifold Q comes with a smooth entropy function

f \colon Q \to \mathbb{R}


\displaystyle{  f(q) = -\int_\Omega \pi_q(x) \, \ln \pi_q(x) \, d\mu(x)    }

We can use this entropy function to do many of the things we usually do in thermodynamics! For example, at any point q \in Q where this function is differentiable, its differential gives a cotangent vector

p = (df)_q

which has an important physical meaning. In coordinates we have

\displaystyle{ p_i = \frac{\partial f}{\partial q_i} }

and we call p_i the intensive variable conjugate to q_i. For example if q_i is energy, p_i will be ‘coolness’: the reciprocal of temperature.

Defining p this way gives a Lagrangian submanifold

\Lambda = \{ (q,p) \in T^\ast Q : \; x \in M, \; p =  (df)_x \}

of the cotangent bundle T^\ast Q. We can also get contact geometry into the game by defining a contact manifold T^\ast Q \times \mathbb{R} and a Legendrian submanifold

\Sigma = \{ (q,p,S) \in T^\ast Q \times \mathbb{R} : \; x \in M, \; p =  (df)_q , S = f(q) \}

But I’ve been talking about these ideas for the last three episodes, so I won’t say more just now! Instead, I want to throw a new idea into the pot.

Gibbsian statistical manifolds

Thermodynamics, and statistical mechanics, spend a lot of time dealing with statistical manifold of a special sort I’ll call ‘Gibbsian’. In these, each probability distribution \pi_q is a ‘Gibbs distribution’, meaning that it maximizes entropy subject to certain constraints specified by the point q \in Q.

How does this work? For starters, an integrable function

A \colon \Omega \to \mathbb{R}

is called a random variable, or in physics an observable. The expected value of an observable is a smooth real-valued function on our statistical manifold

\langle A \rangle \colon Q \to \mathbb{R}

given by

\displaystyle{ \langle A \rangle(q) = \int_\Omega A(x) \pi_q(x) \, d\mu(x) }

In other words, \langle A \rangle is a function whose value at at any point q \in Q is the expected value of A with respect to the probability distribution \pi_q.

Now, suppose our statistical manifold is n-dimensional and we have n observables A_1, \dots, A_n. Their expected values will be smooth functions on our manifold—and sometimes these functions will be a coordinate system!

This may sound rather unlikely, but it’s really not so outlandish. Indeed, if there’s a point q such that the differentials of the functions \langle A_i \rangle are linearly independent at this point, these functions will be a coordinate system in some neighborhood of this point, by the inverse function theorem. So, we can take this neighborhood, use it as our statistical manifold, and the functions \langle A_i \rangle will be coordinates.

So, let’s assume the expected values of our observables give a coordinate system on Q. Let’s call these coordinates q_1, \dots, q_n, so that

\langle A_i \rangle(q) = q_i

Now for the kicker: we say our statistical manifold is Gibbsian if for each point q \in Q, \pi_q is the probability distribution that maximizes entropy subject to the above condition!

Which condition? The condition saying that

\displaystyle{ \int_\Omega A_i(x) \pi_q(x) \, d\mu(x) = q_i }

for all i. This is just the previous equation spelled out so that you can see it’s a condition on \pi_q.

This assumption of the entropy-maximizing nature of \pi_q is a very powerful, because it implies a useful and nontrivial formula for \pi_q. It’s called the Gibbs distribution:

\displaystyle{  \pi_q(x) = \frac{1}{Z(q)} \exp\left(-\sum_{i = 1}^n p_i A_i(x)\right) }

for all x \in \Omega.

Here p_i is the intensive variable conjugate to q_i, while Z(q) is the partition function: the thing we must divide by to make sure \pi_q integrates to 1. In other words:

\displaystyle{ Z(q) = \int_\Omega \exp\left(-\sum_{i = 1}^n p_i A_i(x) \right) \, d\mu(x)   }

By the way, this formula may look confusing at first, since the left side depends on the point q in our statistical manifold, while there’s no q visible in the right side! Do you see what’s going on?

I’ll tell you: the conjugate variable p_i, sitting on the right side of the above formula, depends on q. Remember, we got it by taking the partial derivative of the entropy in the q_i direction

\displaystyle{ p_i = \frac{\partial f}{\partial q_i} }

and the evaluating this derivative at the point q.

But wait a minute! f here is the entropy—but the entropy of what?

The entropy of \pi_q, of course!

So there’s something circular about our formula for \pi_q. To know \pi_q, you need to know the conjugate variables p_i, but to compute these you need to know the entropy of \pi_q.

This is actually okay. While circular, the formula for \pi_q is still true. It’s harder to work with than you might hope. But it’s still extremely useful.

Next time I’ll prove that this formula for \pi_q is true, and do a few things with it. All this material was discovered by Gibbs in the late 1800’s, and it’s lurking any good book on statistical mechanics—but not phrased in the language of statistical manifolds. The physics textbooks usually consider special cases, like a box of gas where:

q_1 is energy, p_1 is 1/temperature.

q_2 is volume, p_2 is –pressure/temperature.

q_3 is the number of particles, p_3 is chemical potential / pressure.

While these special cases are important and interesting, I’d rather be general!

Technical comments

I said “Any statistical manifold comes with a bunch of interesting geometrical structures”, but in fact some conditions are required. For example, the Fisher information metric is only well-defined and nondegenerate under some conditions on the map \pi. For example, if \pi maps every point of Q to the same probability distribution, the Fisher information metric will vanish.

Similarly, the entropy function f is only smooth under some conditions on \pi.

Furthermore, the integral

\displaystyle{ \int_\Omega \exp\left(-\sum_{i = 1}^n p_i A_i(x) \right) \, d\mu(x)   }

may not converge for all values of the numbers p_1, \dots, p_n. But in my discussion of Gibbsian statistical manifolds, I was assuming that an entropy-maximizing probability distribution \pi_q with

\displaystyle{ \int_\Omega A_i(x) \pi_q(x) \, d\mu(x) = q_i }

actually exists. In this case the probability distribution is also unique (almost everywhere).

For all my old posts on information geometry, go here:

Information geometry.

Information Geometry (Part 19)

8 August, 2021

Last time I figured out the analogue of momentum in probability theory, but I didn’t say what it’s called. Now I will tell you—thanks to some help from Abel Jansma and Toby Bartels.


This is a well-known concept in information theory. It’s also called ‘information content‘.

Let’s see why. First, let’s remember the setup. We have a manifold

\displaystyle{ Q = \{ q \in \mathbb{R}^n : \; q_i > 0, \; \sum_{i=1}^n q_i = 1 \} }

whose points q are nowhere vanishing probability distributions on the set \{1, \dots, n\}. We have a function

f \colon Q \to \mathbb{R}

called the Shannon entropy, defined by

\displaystyle{ f(q) = - \sum_{j = 1}^n q_j \ln q_j }

For each point q \in Q we define a cotangent vector p \in T^\ast_q Q by

p = (df)_q

As mentioned last time, this is the analogue of momentum in probability theory. In the second half of this post I’ll say more about exactly why. But first let’s compute it and see what it actually equals!

Let’s start with a naive calculation, acting as if the probabilities q_1, \dots, q_n were a coordinate system on the manifold Q. We get

\displaystyle{ p_i = \frac{\partial f}{\partial q_i} }

so using the definition of the Shannon entropy we have

\begin{array}{ccl}  p_i &=& \displaystyle{ -\frac{\partial}{\partial q_i} \sum_{j = 1}^n q_j \ln q_j  }\\ \\  &=& \displaystyle{ -\frac{\partial}{\partial q_i} \left( q_i \ln q_i \right) } \\ \\  &=& -\ln(q_i) - 1  \end{array}

Now, the quantity -\ln q_i is called the surprisal of the probability distribution at i. Intuitively, it’s a measure of how surprised you should be if an event of probability q_i occurs. For example, if you flip a fair coin and it lands heads up, your surprisal is ln 2. If you flip 100 fair coins and they all land heads up, your surprisal is 100 times ln 2.

Of course ‘surprise’ is a psychological term, not a term from math or physics, so we shouldn’t take it too seriously here. We can derive the concept of surprisal from three axioms:

  1. The surprisal of an event of probability q is some function of q, say F(q).
  2. The less probable an event is, the larger its surprisal is: q_1 \le q_2 \implies  F(q_1) \ge F(q_2).
  3. The surprisal of two independent events is the sum of their surprisals: F(q_1 q_2) = F(q_1) + F(q_2).

It follows from work on Cauchy’s functional equation that F must be of this form:

F(q) = - \log_b q

for some constant b > 1. We shall choose b, the base of our logarithms, to be e. We had a similar freedom of choice in defining the Shannon entropy, and we will use base e for both to be consistent. If we chose something else, it would change the surprisal and the Shannon entropy by the same constant factor.

So far, so good. But what about the irksome “-1” in our formula?

p_i = -\ln(q_i) - 1

Luckily it turns out we can just get rid of this! The reason is that the probabilities q_i are not really coordinates on the manifold Q. They’re not independent: they must sum to 1. So, when we change them a little, the sum of their changes must vanish. Putting it more technically, the tangent space T_q Q is not all of \mathbb{R}^n, but just the subspace consisting of vectors whose components sum to zero:

\displaystyle{ T_q Q = \{ v \in \mathbb{R}^n : \; \sum_{j = 1}^n v_j = 0 \} }

The cotangent space is the dual of the tangent space. The dual of a subspace

S \subseteq V

is the quotient space

V^\ast/\{ \ell \colon V \to \mathbb{R} : \; \forall v \in S \; \, \ell(v) = 0 \}

The cotangent space T_q^\ast Q thus consists of linear functionals \ell \colon \mathbb{R}^n \to \mathbb{R} modulo those that vanish on vectors v obeying the equation

\displaystyle{ \sum_{j = 1}^n v_j = 0 }

Of course, we can identify the dual of \mathbb{R}^n with \mathbb{R}^n in the usual way, using the Euclidean inner product: a vector u \in \mathbb{R}^n corresponds to the linear functional

\displaystyle{ \ell(v) = \sum_{j = 1}^n u_j v_j }

From this, you can see that a linear functional \ell vanishes on all vectors v obeying the equation

\displaystyle{ \sum_{j = 1}^n v_j = 0 }

if and only if its corresponding vector u has

u_1 = \cdots = u_n

So, we get

T^\ast_q Q \cong \mathbb{R}^n/\{ u \in \mathbb{R}^n : \; u_1 = \cdots = u_n \}

In words: we can describe cotangent vectors to Q as lists of n numbers if we want, but we have to remember that adding the same constant to each number in the list doesn’t change the cotangent vector!

This suggests that our naive formula

p_i = \ln(q_i) - 1

is on the right track, but we’re free to get rid of the constant 1 if we want! And that’s true.

To check this rigorously, we need to show

\displaystyle{ p(v) = -\sum_{j=1}^n \ln(q_i) v_i}

for all v \in T_q Q. We compute:

\begin{array}{ccl}  p(v) &=& df(v) \\ \\  &=& v(f) \\ \\  &=& \displaystyle{ \sum_{j=1}^n v_j \, \frac{\partial f}{\partial q_j} } \\ \\  &=& \displaystyle{ \sum_{j=1}^n v_j (-\ln(q_i) - 1) } \\ \\  &=& \displaystyle{ -\sum_{j=1}^n \ln(q_i) v_i }  \end{array}

where in the second to last step we used our earlier calculation:

\displaystyle{ \frac{\partial f}{\partial q_i} = -\frac{\partial}{\partial q_i} \sum_{j = 1}^n q_j \ln q_j = -\ln(q_i) - 1 }

and in the last step we used

\displaystyle{ \sum_j v_j = 0 }

Back to the big picture

Now let’s take stock of where we are. We can fill in the question marks in the charts from last time, and combine those charts while we’re at it.

 Classical Mechanics   Thermodynamics   Probability Theory 
 q   position extensive variables probabilities
 p  momentum intensive variables surprisals
 S   action entropy Shannon entropy

What’s going on here? In classical mechanics, action is minimized (or at least the system finds a critical point of the action). In thermodynamics, entropy is maximized. In the maximum entropy approach to probability, Shannon entropy is maximized. This leads to a mathematical analogy that’s quite precise. For classical mechanics and thermodynamics, I explained it here:

Classical mechanics versus thermodynamics (part 1).

Classical mechanics versus thermodynamics (part 2).

These posts may give a more approachable introduction to what I’m doing now: now I’m bringing probability theory into the analogy, with a big emphasis on symplectic and contact geometry.

Let me spell out a bit of the analogy more carefully:

Classical Mechanics. In classical mechanics, we have a manifold Q whose points are positions of a particle. There’s an important function on this manifold: Hamilton’s principal function

f \colon Q \to \mathbb{R}

What’s this? It’s basically action: f(q) is the action of the least-action path from the position q_0 at some earlier time t_0 to the position q at time 0. The Hamilton–Jacobi equations say the particle’s momentum p at time 0 is given by

p = (df)_q

Thermodynamics. In thermodynamics, we have a manifold Q whose points are equilibrium states of a system. The coordinates of a point q \in Q are called extensive variables. There’s an important function on this manifold: the entropy

f \colon Q \to \mathbb{R}

There is a cotangent vector p at the point q given by

p = (df)_q

The components of this vector are the intensive variables corresponding to the extensive variables.

Probability Theory. In probability theory, we have a manifold Q whose points are nowhere vanishing probability distributions on a finite set. The coordinates of a point q \in Q are probabilities. There’s an important function on this manifold: the Shannon entropy

f \colon Q \to \mathbb{R}

There is a cotangent vector p at the point q given by

p = (df)_q

The components of this vector are the surprisals corresponding to the probabilities.

In all three cases, T^\ast Q is a symplectic manifold and imposing the constraint p = (df)_q picks out a Lagrangian submanifold

\Lambda = \{ (q,p) \in T^\ast Q: \; p = (df)_q \}

There is also a contact manifold T^\ast Q \times \mathbb{R} where the extra dimension comes with an extra coordinate S that means

• action in classical mechanics,
• entropy in thermodynamics, and
• Shannon entropy in probability theory.

We can then decree that S = f(q) along with p = (df)_q, and these constraints pick out a Legendrian submanifold

\Sigma = \{(q,p,S) \in T^\ast Q \times \mathbb{R} : \; S = f(q), \; p = (df)_q \}

There’s a lot more to do with these ideas, and I’ll continue next time.

For all my old posts on information geometry, go here:

Information geometry.

Information Geometry (Part 18)

5 August, 2021

Last time I sketched how two related forms of geometry, symplectic and contact geometry, show up in thermodynamics. Today I want to explain how they show up in probability theory.

For some reason I haven’t seen much discussion of this! But people should have looked into this. After all, statistical mechanics explains thermodynamics in terms of probability theory, so if some mathematical structure shows up in thermodynamics it should appear in statistical mechanics… and thus ultimately in probability theory.

I just figured out how this works for symplectic and contact geometry.

Suppose a system has n possible states. We’ll call these microstates, following the tradition in statistical mechanics. If you don’t know what ‘microstate’ means, don’t worry about it! But the rough idea is that if you have a macroscopic system like a rock, the precise details of what its atoms are doing are described by a microstate, and many different microstates could be indistinguishable unless you look very carefully.

We’ll call the microstates 1, 2, \dots, n. So, if you don’t want to think about physics, when I say microstate I’ll just mean an integer from 1 to n.

Next, a probability distribution q assigns a real number q_i to each microstate, and these numbers must sum to 1 and be nonnegative. So, we have q \in \mathbb{R}^n, though not every vector in \mathbb{R}^n is a probability distribution.

I’m sure you’re wondering why I’m using q rather than p to stand for an observable instead of a probability distribution. Am I just trying to confuse you?

No: I’m trying to set up an analogy to physics!

Last time I introduced symplectic geometry using classical mechanics. The most important example of a symplectic manifold is the cotangent bundle T^\ast Q of a manifold Q. A point of T^\ast Q is a pair (q,p) consisting of a point q \in Q and a cotangent vector p \in T^\ast_q Q. In classical mechanics the point q describes the position of some physical system, while p describes its momentum.

So, I’m going to set up an analogy like this:

 Classical Mechanics  Probability Theory
  q   position   probability distribution  
  p   momentum ???

But what is to momentum as probability is to position?

A big clue is the appearance of symplectic geometry in thermodynamics, which I also outlined last time. We can use this to get some intuition about the analogue of momentum in probability theory.

In thermodynamics, a system has a manifold Q of states. (These are not the ‘microstates’ I mentioned before: we’ll see the relation later.) There is a function

f \colon Q \to \mathbb{R}

describing the entropy of the system as a function of its state. There is a law of thermodynamics saying that

p = (df)_q

This equation picks out a submanifold of T^\ast Q, namely

\Lambda = \{(q,p) \in T^\ast Q : \; p = (df)_q \}

Moreover this submanifold is Lagrangian: the symplectic structure \omega vanishes when restricted to it:

\displaystyle{ \omega |_\Lambda = 0 }

This is very beautiful, but it goes by so fast you might almost miss it! So let’s clutter it up a bit with coordinates. We often use local coordinates on Q and describe a point q \in Q using these coordinates, getting a point

(q_1, \dots, q_n) \in \mathbb{R}^n

They give rise to local coordinates q_1, \dots, q_n, p_1, \dots, p_n on the cotangent bundle T^\ast Q. The q_i are called extensive variables, because they are typically things that you can measure only by totalling up something over the whole system, like the energy or volume of a cylinder of gas. The p_i are called intensive variables, because they are typically things that you can measure locally at any point, like temperature or pressure.

In these local coordinates, the symplectic structure on T^\ast Q is the 2-form given by

\omega = dp_1 \wedge dq_1 + \cdots + dp_n \wedge dq_n

The equation

p = (df)_q

serves as a law of physics that determines the intensive variables given the extensive ones when our system is in thermodynamic equilibrium. Written out using coordinates, this law says

\displaystyle{ p_i = \frac{\partial f}{\partial q_i} }

It looks pretty bland here, but in fact it gives formulas for the temperature and pressure of a gas, and many other useful formulas in thermodynamics.

Now we are ready to see how all this plays out in probability theory! We’ll get an analogy like this, which goes hand-in-hand with our earlier one:

 Thermodynamics   Probability Theory 
  q   extensive variables   probability distribution  
  p   intensive variables ???

This analogy is clearer than the last, because statistical mechanics reveals that the extensive variables in thermodynamics are really just summaries of probability distributions on microstates. Furthermore, both thermodynamics and probability theory have a concept of entropy.

So, let’s take our manifold Q to consist of probability distributions on the set of microstates I was talking about before: the set \{1, \dots, n\}. Actually, let’s use nowhere vanishing probability distributions:

\displaystyle{ Q = \{ q \in \mathbb{R}^n : \; q_i > 0, \; \sum_{i=1}^n q_i = 1 \} }

I’m requiring q_i > 0 to ensure Q is a manifold, and also to make sure f is differentiable: it ceases to be differentiable when one of the probabilities q_i hits zero.

Since Q is a manifold, its cotangent bundle is a symplectic manifold T^\ast Q. And here’s the good news: we have a god-given entropy function

f \colon Q \to \mathbb{R}

namely the Shannon entropy

\displaystyle{ f(q) = - \sum_{i = 1}^n q_i \ln q_i }

So, everything I just described about thermodynamics works in the setting of plain old probability theory! Starting from our manifold Q and the entropy function, we get all the rest, leading up to the Lagrangian submanifold

\Lambda = \{(q,p) \in T^\ast Q : \; p = (df)_q \}

that describes the relation between extensive and intensive variables.

For computations it helps to pick coordinates on Q. Since the probabilities q_1, \dots, q_n sum to 1, they aren’t independent coordinates on Q. So, we can either pick all but one of them as coordinates, or learn how to deal with non-independent coordinates, which are already completely standard in projective geometry. Let’s do the former, just to keep things simple.

These coordinates on Q give rise in the usual way to coordinates q_i and p_i on the cotangent bundle T^\ast Q. These play the role of extensive and intensive variables, respectively, and it should be very interesting to impose the equation

\displaystyle{ p_i = \frac{\partial f}{\partial q_i} }

where f is the Shannon entropy. This picks out a Lagrangian submanifold \Lambda \subseteq T^\ast Q.

So, the question becomes: what does this mean? If this formula gives the analogue of momentum for probability theory, what does this analogue of momentum mean?

Here’s a preliminary answer: p_i says how fast entropy increases as we increase the probability q_i that our system is in the ith microstate. So if we think of nature as ‘wanting’ to maximize entropy, the quantity p_i says how eager it is to increase the probability q_i.

Indeed, you can think of p_i as a bit like pressure—one of the most famous intensive quantities in thermodynamics. A gas ‘wants’ to expand, and its pressure says precisely how eager it is to expand. Similarly, a probability distribution ‘wants’ to flatten out, to maximize entropy, and p_i says how eager it is to increase the probability q_i in order to do this.

But what can we do with this concept? And what does symplectic geometry do for probability theory?

I will start tackling these questions next time.

One thing I’ll show is that when we reduce thermodynamics to probability theory using the ideas of statistical mechanics, the appearance of symplectic geometry in thermodynamics follows from its appearance in probability theory.

Another thing I want to investigate is how other geometrical structures on the space of probability distributions, like the Fisher information metric, interact with the symplectic structure on its cotangent bundle. This will integrate symplectic geometry and information geometry.

I also want to bring contact geometry into the picture. It’s already easy to see from our work last time how this should go. We treat the entropy S as an independent variable, and replace T^\ast Q with a larger manifold T^\ast Q \times \mathbb{R} having S as an extra coordinate. This is a contact manifold with contact form

\alpha = -dS + p_1 dq_i + \cdots + p_n dq_n

This contact manifold has a submanifold \Sigma where we remember that entropy is a function of the probability distribution q, and define p in terms of q as usual:

\Sigma = \{(q,p,S) \in T^\ast Q \times \mathbb{R} : \; S = f(q), \; p = (df)_q \}

And as we saw last time, \Sigma is a Legendrian submanifold, meaning

\displaystyle{ \alpha|_{\Sigma} = 0 }

But again, we want to understand what these ideas from contact geometry really do for probability theory!

For all my old posts on information geometry, go here:

Information geometry.

Information Geometry (Part 17)

27 July, 2021

I’m getting back into information geometry, which is the geometry of the space of probability distributions, studied using tools from information theory. I’ve written a bunch about it already, which you can see here:

Information geometry.

Now I’m fascinated by something new: how symplectic geometry and contact geometry show up in information geometry. But before I say anything about this, let me say a bit about how they show up in thermodynamics. This is more widely discussed, and it’s a good starting point.

Symplectic geometry was born as the geometry of phase space in classical mechanics: that is, the space of possible positions and momenta of a classical system. The simplest example of a symplectic manifold is the vector space \mathbb{R}^{2n}, with n position coordinates q_i and n momentum coordinates p_i.

It turns out that symplectic manifolds are always even-dimensional, because we can always cover them with coordinate charts that look like \mathbb{R}^{2n}. When we change coordinates, it turns out that the splitting of coordinates into positions and momenta is somewhat arbitrary. For example, the position of a rock on a spring now may determine its momentum a while later, and vice versa. What’s not arbitrary? It’s the so-called ‘symplectic structure’:

\omega = dp_1 \wedge dq_1 + \cdots + dp_n \wedge dq_n

While far from obvious at first, we know by now that the symplectic structure is exactly what needs to be preserved under valid changes of coordinates in classical mechanics! In fact, we can develop the whole formalism of classical mechanics starting from a manifold with a symplectic structure.

Symplectic geometry also shows up in thermodynamics. In thermodynamics we can start with a system in equilibrium whose state is described by some variables q_1, \dots, q_n. Its entropy will be a function of these variables, say

S = f(q_1, \dots, q_n)

We can then take the partial derivatives of entropy and call them something:

\displaystyle{ p_i = \frac{\partial f}{\partial q_i} }

These new variables p_i are said to be ‘conjugate’ to the q_i, and they turn out to be very interesting. For example, if q_i is energy then p_i is ‘coolness’: the reciprocal of temperature. The coolness of a system is its change in entropy per change in energy.

Often the variables q_i are ‘extensive’: that is, you can measure them only by looking at your whole system and totaling up some quantity. Examples are energy and volume. Then the new variables p_i are ‘intensive’: that is, you can measure them at any one location in your system. Examples are coolness and pressure.

Now for a twist: sometimes we do not know the function f ahead of time. Then we cannot define the p_i as above. We’re forced into a different approach where we treat them as independent quantities, at least until someone tells us what f is.

In this approach, we start with a space \mathbb{R}^{2n} having n coordinates called q_i and n coordinates called p_i. This is a symplectic manifold, with the symplectic struture \omega described earlier!

But what about the entropy? We don’t yet know what it is as a function of the q_i, but we may still want to talk about it. So, we build a space \mathbb{R}^{2n+1} having one extra coordinate S in addition to the q_i and p_i. This new coordinate stands for entropy. And this new space has an important 1-form on it:

\alpha = -dS + p_1 dq_i + \cdots + p_n dq_n

This is called the ‘contact 1-form’.

This makes \mathbb{R}^{2n+1} into an example of a ‘contact manifold’. Contact geometry is the odd-dimensional partner of symplectic geometry. Just as symplectic manifolds are always even-dimensional, contact manifolds are always odd-dimensional.

What is the point of the contact 1-form? Well, suppose someone tells us the function f relating entropy to the coordinates q_i. Now we know that we want

S = f

and also

\displaystyle{ p_i = \frac{\partial f}{\partial q_i} }

So, we can impose these equations, which pick out a subset of \mathbb{R}^{2n+1}. You can check that this subset, say \Sigma, is an n-dimensional submanifold. But even better, the contact 1-form vanishes when restricted to this submanifold:

\left.\alpha\right|_\Sigma = 0

Let’s see why! Suppose x \in \Sigma and suppose v \in T_x \Sigma is a vector tangent to \Sigma at this point x. It suffices to show

\alpha(v) = 0

Using the definition of \alpha this equation says

\displaystyle{ -dS(v) + \sum_i p_i dq_i(v) = 0 }

But on the surface \Sigma we have

S = f, \qquad  \displaystyle{ p_i = \frac{\partial f}{\partial q_i} }

So, the equation we’re trying to show can be written as

\displaystyle{ -df(v) + \sum_i \frac{\partial f}{\partial q_i} dq_i(v) = 0 }

But this follows from

\displaystyle{ df = \sum_i \frac{\partial f}{\partial q_i} dq_i }

which holds because f is a function only of the coordinates q_i.

So, any formula for entropy S = f(q_1, \dots, q_n) picks out a so-called ‘Legendrian submanifold’ of \mathbb{R}^{2n+1}: that is, an n-dimensional submanifold such that the contact 1-form vanishes when restricted to this submanifold. And the idea is that this submanifold tells you everything you need to know about a thermodynamic system.

Indeed, V. I. Arnol’d says this was implicitly known to the great founder of statistical mechanics, Josiah Willard Gibbs. Arnol’d calls \mathbb{R}^5 with coordinates energy, entropy, temperature, pressure and volume the ‘Gibbs manifold’, and he proclaims:

Gibbs’ thesis: substances are Legendrian submanifolds of the Gibbs manifold.

This is from here:

• V. I. Arnol’d, Contact geometry: the geometrical method of Gibbs’ thermodynamics, Proceedings of the Gibbs Symposium (New Haven, CT, 1989), AMS, Providence, Rhode Island, 1990.

A bit more detail

Now I want to say everything again, with a bit of extra detail, assuming more familiarity with manifolds. Above I was using \mathbb{R}^n with coordinates q_1, \dots, q_n to describe the ‘extensive’ variables of a thermodynamic system. But let’s be a bit more general and use any smooth n-dimensional manifold Q. Even if Q is a vector space, this viewpoint is nice because it’s manifestly coordinate-independent!

So: starting from Q we build the cotangent bundle T^\ast Q. A point in cotangent describes both extensive variables, namely q \in Q, and ‘intensive’ variables, namely a cotangent vector p \in T^\ast_q Q.

The manifold T^\ast Q has a 1-form \theta on it called the tautological 1-form. We can describe it as follows. Given a tangent vector v \in T_{(q,p)} T^\ast Q we have to say what \theta(v) is. Using the projection

\pi \colon T^\ast Q \to Q

we can project v down to a tangent vector d\pi(v) at the point q. But the 1-form p eats tangent vectors at q and spits out numbers! So, we set

\theta(v) = p(d\pi(v))

This is sort of mind-boggling at first, but it’s worth pondering until it makes sense. It helps to work out what \theta looks like in local coordinates. Starting with any local coordinates q_i on an open set of Q, we get local coordinates q_i, p_i on the cotangent bundle of this open set in the usual way. On this open set you then get

\theta = p_1 dq_1 + \cdots + p_n dq_n

This is a standard calculation, which is really worth doing!

It follows that we can define a symplectic structure \omega by

\omega = d \theta

and get this formula in local coordinates:

\omega = dp_1 \wedge dq_1 + \cdots + dp_n \wedge dq_n

Now, suppose we choose a smooth function

f \colon Q \to \mathbb{R}

which describes the entropy. We get a 1-form df, which we can think of as a map

df \colon Q \to T^\ast Q

assigning to each choice q of extensive variables the pair (q,p) of extensive and intensive variables where

p = df_q

The image of the map df is a ‘Lagrangian submanifold‘ of T^\ast Q: that is, an n-dimensional submanifold \Lambda such that

\left.\omega\right|_{\Lambda} = 0

Lagrangian submanifolds are to symplectic geometry as Legendrian submanifolds are to contact geometry! What we’re seeing here is that if Gibbs had preferred symplectic geometry, he could have described substances as Lagrangian submanifolds rather than Legendrian submanifolds. But this approach would only keep track of the derivatives of entropy, df, not the actual value of the entropy function f.

If we prefer to keep track of the actual value of f using contact geometry, we can do that. For this we add an extra dimension to T^\ast Q and form the manifold T^\ast Q \times \mathbb{R}. The extra dimension represents entropy, so we’ll use S as our name for the coordinate on \mathbb{R}.

We can make T^\ast Q \times \mathbb{R} into a contact manifold with contact 1-form

\alpha = -d S + \theta

In local coordinates we get

\alpha = -dS + p_1 dq_i + \cdots + p_n dq_n

just as we had earlier. And just as before, if we choose a smooth function f \colon Q \to \mathbb{R} describing entropy, the subset

\Sigma = \{(q,p,S) \in T^\ast Q \times \mathbb{R} : \; S = f(q), \; p = df_q \}

is a Legendrian submanifold of T^\ast Q \times \mathbb{R}.

Okay, this concludes my lightning review of symplectic and contact geometry in thermodynamics! Next time I’ll talk about something a bit less well understood: how they show up in statistical mechanics.

Fisher’s Fundamental Theorem (Part 4)

13 July, 2021

I wrote a paper that summarizes my work connecting natural selection to information theory:

• John Baez, The fundamental theorem of natural selection.

Check it out! If you have any questions or see any mistakes, please let me know.

Just for fun, here’s the abstract and introduction.

Abstract. Suppose we have n different types of self-replicating entity, with the population P_i of the ith type changing at a rate equal to P_i times the fitness f_i of that type. Suppose the fitness f_i is any continuous function of all the populations P_1, \dots, P_n. Let p_i be the fraction of replicators that are of the ith type. Then p = (p_1, \dots, p_n) is a time-dependent probability distribution, and we prove that its speed as measured by the Fisher information metric equals the variance in fitness. In rough terms, this says that the speed at which information is updated through natural selection equals the variance in fitness. This result can be seen as a modified version of Fisher’s fundamental theorem of natural selection. We compare it to Fisher’s original result as interpreted by Price, Ewens and Edwards.


In 1930, Fisher stated his “fundamental theorem of natural selection” as follows:

The rate of increase in fitness of any organism at any time is equal to its genetic variance in fitness at that time

Some tried to make this statement precise as follows:

The time derivative of the mean fitness of a population equals the variance of its fitness.

But this is only true under very restrictive conditions, so a controversy was ignited.

An interesting resolution was proposed by Price, and later amplified by Ewens and Edwards. We can formalize their idea as follows. Suppose we have n types of self-replicating entity, and idealize the population of the ith type as a real-valued function P_i(t). Suppose

\displaystyle{ \frac{d}{dt} P_i(t) = f_i(P_1(t), \dots, P_n(t)) \, P_i(t) }

where the fitness f_i is a differentiable function of the populations of every type of replicator. The mean fitness at time t is

\displaystyle{ \overline{f}(t) = \sum_{i=1}^n p_i(t) \, f_i(P_1(t), \dots, P_n(t)) }

where p_i(t) is the fraction of replicators of the ith type:

\displaystyle{ p_i(t) = \frac{P_i(t)}{\phantom{\Big|} \sum_{j = 1}^n P_j(t) } }

By the product rule, the rate of change of the mean fitness is the sum of two terms:

\displaystyle{ \frac{d}{dt} \overline{f}(t) = \sum_{i=1}^n \dot{p}_i(t) \, f_i(P_1(t), \dots, P_n(t)) \; + \; }

\displaystyle{ \sum_{i=1}^n p_i(t) \,\frac{d}{dt} f_i(P_1(t), \dots, P_n(t)) }

The first of these two terms equals the variance of the fitness at time t. We give the easy proof in Theorem 1. Unfortunately, the conceptual significance of this first term is much less clear than that of the total rate of change of mean fitness. Ewens concluded that “the theorem does not provide the substantial biological statement that Fisher claimed”.

But there is another way out, based on an idea Fisher himself introduced in 1922: Fisher information. Fisher information gives rise to a Riemannian metric on the space of probability distributions on a finite set, called the ‘Fisher information metric’—or in the context of evolutionary game theory, the ‘Shahshahani metric’. Using this metric we can define the speed at which a time-dependent probability distribution changes with time. We call this its ‘Fisher speed’. Under just the assumptions already stated, we prove in Theorem 2 that the Fisher speed of the probability distribution

p(t) = (p_1(t), \dots, p_n(t))

is the variance of the fitness at time t.

As explained by Harper, natural selection can be thought of as a learning process, and studied using ideas from information geometry—that is, the geometry of the space of probability distributions. As p(t) changes with time, the rate at which information is updated is closely connected to its Fisher speed. Thus, our revised version of the fundamental theorem of natural selection can be loosely stated as follows:

As a population changes with time, the rate at which information is updated equals the variance of fitness.

The precise statement, with all the hypotheses, is in Theorem 2. But one lesson is this: variance in fitness may not cause ‘progress’ in the sense of increased mean fitness, but it does cause change!

For more details in a user-friendly blog format, read the whole series:

Part 1: the obscurity of Fisher’s original paper.

Part 2: a precise statement of Fisher’s fundamental theorem of natural selection, and conditions under which it holds.

Part 3: a modified version of the fundamental theorem of natural selection, which holds much more generally.

Part 4: my paper on the fundamental theorem of natural selection.

Fisher’s Fundamental Theorem (Part 3)

8 October, 2020

Last time we stated and proved a simple version of Fisher’s fundamental theorem of natural selection, which says that under some conditions, the rate of increase of the mean fitness equals the variance of the fitness. But the conditions we gave were very restrictive: namely, that the fitness of each species of replicator is constant, not depending on how many of these replicators there are, or any other replicators.

To broaden the scope of Fisher’s fundamental theorem we need to do one of two things:

1) change the left side of the equation: talk about some other quantity other than rate of change of mean fitness.

2) change the right side of the question: talk about some other quantity than the variance in fitness.

Or we could do both! People have spent a lot of time generalizing Fisher’s fundamental theorem. I don’t think there are, or should be, any hard rules on what counts as a generalization.

But today we’ll take alternative 1). We’ll show the square of something called the ‘Fisher speed’ always equals the variance in fitness. One nice thing about this result is that we can drop the restrictive condition I mentioned. Another nice thing is that the Fisher speed is a concept from information theory! It’s defined using the Fisher metric on the space of probability distributions.

And yes—that metric is named after the same guy who proved Fisher’s fundamental theorem! So, arguably, Fisher should have proved this generalization of Fisher’s fundamental theorem. But in fact it seems that I was the first to prove it, around February 1st, 2017. Some similar results were already known, and I will discuss those someday. But they’re a bit different.

A good way to think about the Fisher speed is that it’s ‘the rate at which information is being updated’. A population of replicators of different species gives a probability distribution. Like any probability distribution, this has information in it. As the populations of our replicators change, the Fisher speed says the rate at which this information is being updated. So, in simple terms, we’ll show

The square of the rate at which information is updated is equal to the variance in fitness.

This is quite a change from Fisher’s original idea, namely:

The rate of increase of mean fitness is equal to the variance in fitness.

But it has the advantage of always being true… as long the population dynamics are described by the general framework we introduced last time. So let me remind you of the general setup, and then prove the result!

The setup

We start out with population functions P_i \colon \mathbb{R} \to (0,\infty), one for each species of replicator i = 1,\dots,n, obeying the Lotka–Volterra equation

\displaystyle{ \frac{d P_i}{d t} = f_i(P_1, \dots, P_n) P_i }

for some differentiable functions f_i \colon (0,\infty) \to \mathbb{R} called fitness functions. The probability of a replicator being in the ith species is

\displaystyle{  p_i(t) = \frac{P_i(t)}{\sum_j P_j(t)} }

Using the Lotka–Volterra equation we showed last time that these probabilities obey the replicator equation

\displaystyle{ \dot{p}_i = \left( f_i(P) - \overline f(P) \right)  p_i }

Here P is short for the whole list of populations (P_1(t), \dots, P_n(t)), and

\displaystyle{ \overline f(P) = \sum_j f_j(P) p_j  }

is the mean fitness.

The Fisher metric

The space of probability distributions on the set \{1, \dots, n\} is called the (n-1)-simplex

\Delta^{n-1} = \{ (x_1, \dots, x_n) : \; x_i \ge 0, \; \displaystyle{ \sum_{i=1}^n x_i = 1 } \}

It’s called \Delta^{n-1} because it’s (n-1)-dimensional. When n = 3 it looks like the letter \Delta:

The Fisher metric is a Riemannian metric on the interior of the (n-1)-simplex. That is, given a point p in the interior of \Delta^{n-1} and two tangent vectors v,w at this point the Fisher metric gives a number

g(v,w) = \displaystyle{ \sum_{i=1}^n \frac{v_i w_i}{p_i}  }

Here we are describing the tangent vectors v,w as vectors in \mathbb{R}^n with the property that the sum of their components is zero: that’s what makes them tangent to the (n-1)-simplex. And we’re demanding that x be in the interior of the simplex to avoid dividing by zero, since on the boundary of the simplex we have p_i = 0 for at least one choice of $i.$

If we have a probability distribution p(t) moving around in the interior of the (n-1)-simplex as a function of time, its Fisher speed is

\displaystyle{ \sqrt{g(\dot{p}(t), \dot{p}(t))} = \sqrt{\sum_{i=1}^n \frac{\dot{p}_i(t)^2}{p_i(t)}} }

if the derivative \dot{p}(t) exists. This is the usual formula for the speed of a curve moving in a Riemannian manifold, specialized to the case at hand.

Now we’ve got all the formulas we’ll need to prove the result we want. But for those who don’t already know and love it, it’s worthwhile saying a bit more about the Fisher metric.

The factor of 1/x_i in the Fisher metric changes the geometry of the simplex so that it becomes round, like a portion of a sphere:

But the reason the Fisher metric is important, I think, is its connection to relative information. Given two probability distributions p, q \in \Delta^{n-1}, the information of q relative to p is

\displaystyle{ I(q,p) = \sum_{i = 1}^n q_i \ln\left(\frac{q_i}{p_i}\right)   }

You can show this is the expected amount of information gained if p was your prior distribution and you receive information that causes you to update your prior to q. So, sometimes it’s called the information gain. It’s also called relative entropy or—my least favorite, since it sounds so mysterious—the Kullback–Leibler divergence.

Suppose p(t) is a smooth curve in the interior of the (n-1)-simplex. We can ask the rate at which information is gained as time passes. Perhaps surprisingly, a calculation gives

\displaystyle{ \frac{d}{dt} I(p(t), p(t_0))\Big|_{t = t_0} = 0 }

That is, in some sense ‘to first order’ no information is being gained at any moment t_0 \in \mathbb{R}. However, we have

\displaystyle{  \frac{d^2}{dt^2} I(p(t), p(t_0))\Big|_{t = t_0} =  g(\dot{p}(t_0), \dot{p}(t_0))}

So, the square of the Fisher speed has a nice interpretation in terms of relative entropy!

For a derivation of these last two equations, see Part 7 of my posts on information geometry. For more on the meaning of relative entropy, see Part 6.

The result

It’s now extremely easy to show what we want, but let me state it formally so all the assumptions are crystal clear.

Theorem. Suppose the functions P_i \colon \mathbb{R} \to (0,\infty) obey the Lotka–Volterra equations:

\displaystyle{ \dot P_i = f_i(P) P_i}

for some differentiable functions f_i \colon (0,\infty)^n \to \mathbb{R} called fitness functions. Define probabilities and the mean fitness as above, and define the variance of the fitness by

\displaystyle{ \mathrm{Var}(f(P)) =  \sum_j ( f_j(P) - \overline f(P))^2 \, p_j }

Then if none of the populations P_i are zero, the square of the Fisher speed of the probability distribution p(t) = (p_1(t), \dots , p_n(t)) is the variance of the fitness:

g(\dot{p}, \dot{p})  = \mathrm{Var}(f(P))

Proof. The proof is near-instantaneous. We take the square of the Fisher speed:

\displaystyle{ g(\dot{p}, \dot{p}) = \sum_{i=1}^n \frac{\dot{p}_i(t)^2}{p_i(t)} }

and plug in the replicator equation:

\displaystyle{ \dot{p}_i = (f_i(P) - \overline f(P)) p_i }

We obtain:

\begin{array}{ccl} \displaystyle{ g(\dot{p}, \dot{p})} &=&  \displaystyle{ \sum_{i=1}^n \left( f_i(P) - \overline f(P) \right)^2 p_i } \\ \\  &=& \mathrm{Var}(f(P))  \end{array}

as desired.   █

It’s hard to imagine anything simpler than this. We see that given the Lotka–Volterra equation, what causes information to be updated is nothing more and nothing less than variance in fitness!

The whole series:

Part 1: the obscurity of Fisher’s original paper.

Part 2: a precise statement of Fisher’s fundamental theorem of natural selection, and conditions under which it holds.

Part 3: a modified version of the fundamental theorem of natural selection, which holds much more generally.

Part 4: my paper on the fundamental theorem of natural selection.

Markov Decision Processes

6 October, 2020

The National Institute of Mathematical and Biological Sciences is having an online seminar on ‘adaptive management’. It should be fun for people who want to understand Markov decision processes—like me!

NIMBioS Adaptive Management Webinar Series, 2020 October 26-29 (Monday-Thursday).

Adaptive management seeks to determine sound management strategies in the face of uncertainty concerning the behavior of the system being managed. Specifically, it attempts to find strategies for managing dynamic systems while learning the behavior of the system. These webinars review the key concept of a Markov Decision Process (MDP) and demonstrate how quantitative adaptive management strategies can be developed using MDPs. Additional conceptual, computational and application aspects will be discussed, including dynamic programming and Bayesian formalization of learning.

Here are the topics:

Session 1: Introduction to decision problems
Session 2: Introduction to Markov decision processes (MDPs)
Session 3: Solving Markov decision processes (MDPs)
Session 4: Modeling beliefs
Session 5: Conjugacy and discrete model adaptive management (AM)
Session 6: More on AM problems (Dirichlet/multinomial and Gaussian prior/likelihood)
Session 7: Partially observable Markov decision processes (POMDPs)
Session 8: Frontier topics (projection methods, approximate DP, communicating solutions)


Fock Space Techniques for Stochastic Physics

2 October, 2020

I’ve been fascinated for a long time about the relation between classical probability theory and quantum mechanics. This story took a strange new turn when people discovered that stochastic Petri nets, good for describing classical probabilistic models of interacting entities, can also be described using ideas from the quantum field theory!

I’ll be talking about this at the online category theory seminar at UNAM, the National Autonomous University of Mexico, on Wednesday October 7th at 18:00 UTC (11 am Pacific Time):

Fock space techniques for stochastic physics

Abstract. Some ideas from quantum theory are beginning to percolate back to classical probability theory. For example, the master equation for a chemical reaction network—also known as a stochastic Petri net—describes particle interactions in a stochastic rather than quantum way. If we look at this equation from the perspective of quantum theory, this formalism turns out to involve creation and annihilation operators, coherent states and other well-known ideas—but with a few big differences.

You can watch the talk here:

You can also see the slides of this talk. Click on any picture in the slides, or any text in blue, and get more information!

My students Joe Moeller and Jade Master will also be giving talks in this seminar—on Petri nets and structured cospans.