Information Geometry (Part 20)

Last time we worked out an analogy between classical mechanics, thermodynamics and probability theory. The latter two look suspiciously similar:

 Classical Mechanics   Thermodynamics   Probability Theory 
 q   position extensive variables probabilities
 p  momentum intensive variables surprisals
 S   action entropy Shannon entropy

This is no coincidence. After all, in the subject of statistical mechanics we explain classical thermodynamics using probability theory—and entropy is revealed to be Shannon entropy (or its quantum analogue).

Now I want to make this precise.

To connect classical thermodynamics to probability theory, I’ll start by discussing ‘statistical manifolds’. I introduced the idea of a statistical manifold in Part 7: it’s a manifold Q equipped with a map sending each point q \in Q to a probability distribution \pi_q on some measure space \Omega. Now I’ll say how these fit into the second column of the above chart.

Then I’ll talk about statistical manifolds of a special sort used in thermodynamics, which I’ll call ‘Gibbsian’, since they really go back to Josiah Willard Gibbs.

In a Gibbsian statistical manifold, for each q \in Q the probability distribution \pi_q is a ‘Gibbs distribution’. Physically, these Gibbs distributions describe thermodynamic equilibria. For example, if you specify the volume, energy and number of particles in a box of gas, there will be a Gibbs distribution describing what the particles do in thermodynamic equilibrium under these conditions. Mathematically, Gibbs distributions maximize entropy subject to some constraints specified by the point q \in Q.

More precisely: in a Gibbsian statistical manifold we have a list of observables A_1, \dots , A_n whose expected values serve as coordinates q_1, \dots, q_n for points q \in Q, and \pi_q is the probability distribution that maximizes entropy subject to the constraint that the expected value of A_i is q_i. We can derive most of the interesting formulas of thermodynamics starting from this!

Statistical manifolds

Let’s fix a measure space \Omega with measure \mu. A statistical manifold is then a manifold Q equipped with a smooth map \pi assigning to each point q \in Q a probability distribution on \Omega, which I’ll call \pi_q. So, \pi_q is a function on \Omega with

\displaystyle{ \int_\Omega \pi_q \, d\mu = 1 }

and

\pi_q(x) \ge 0

for all x \in \Omega.

The idea here is that the space of all probability distributions on \Omega may be too huge to understand in as much detail as we’d like, so instead we describe some of these probability distributions—a family parametrized by points of some manifold Q—using the map \pi. This is the basic idea behind parametric statistics.

Information geometry is the geometry of statistical manifolds. Any statistical manifold comes with a bunch of interesting geometrical structures. One is the ‘Fisher information metric’, a Riemannian metric I explained in Part 7. Another is a 1-parameter family of connections on the tangent bundle T Q, which is important in Amari’s approach to information geometry. You can read about this here:

• Hiroshi Matsuzoe, Statistical manifolds and affine differential geometry, in Advanced Studies in Pure Mathematics 57, pp. 303–321.

I don’t want to talk about it now—I just wanted to reassure you that I’m not completely ignorant of it!

I want to focus on the story I’ve been telling, which is about entropy. Our statistical manifold Q comes with a smooth entropy function

f \colon Q \to \mathbb{R}

namely

\displaystyle{  f(q) = -\int_\Omega \pi_q(x) \, \ln \pi_q(x) \, d\mu(x)    }

We can use this entropy function to do many of the things we usually do in thermodynamics! For example, at any point q \in Q where this function is differentiable, its differential gives a cotangent vector

p = (df)_q

which has an important physical meaning. In coordinates we have

\displaystyle{ p_i = \frac{\partial f}{\partial q_i} }

and we call p_i the intensive variable conjugate to q_i. For example if q_i is energy, p_i will be ‘coolness’: the reciprocal of temperature.

Defining p this way gives a Lagrangian submanifold

\Lambda = \{ (q,p) \in T^\ast Q : \; x \in M, \; p =  (df)_x \}

of the cotangent bundle T^\ast Q. We can also get contact geometry into the game by defining a contact manifold T^\ast Q \times \mathbb{R} and a Legendrian submanifold

\Sigma = \{ (q,p,S) \in T^\ast Q \times \mathbb{R} : \; x \in M, \; p =  (df)_q , S = f(q) \}

But I’ve been talking about these ideas for the last three episodes, so I won’t say more just now! Instead, I want to throw a new idea into the pot.

Gibbsian statistical manifolds

Thermodynamics, and statistical mechanics, spend a lot of time dealing with statistical manifold of a special sort I’ll call ‘Gibbsian’. In these, each probability distribution \pi_q is a ‘Gibbs distribution’, meaning that it maximizes entropy subject to certain constraints specified by the point q \in Q.

How does this work? For starters, an integrable function

A \colon \Omega \to \mathbb{R}

is called a random variable, or in physics an observable. The expected value of an observable is a smooth real-valued function on our statistical manifold

\langle A \rangle \colon Q \to \mathbb{R}

given by

\displaystyle{ \langle A \rangle(q) = \int_\Omega A(x) \pi_q(x) \, d\mu(x) }

In other words, \langle A \rangle is a function whose value at at any point q \in Q is the expected value of A with respect to the probability distribution \pi_q.

Now, suppose our statistical manifold is n-dimensional and we have n observables A_1, \dots, A_n. Their expected values will be smooth functions on our manifold—and sometimes these functions will be a coordinate system!

This may sound rather unlikely, but it’s really not so outlandish. Indeed, if there’s a point q such that the differentials of the functions \langle A_i \rangle are linearly independent at this point, these functions will be a coordinate system in some neighborhood of this point, by the inverse function theorem. So, we can take this neighborhood, use it as our statistical manifold, and the functions \langle A_i \rangle will be coordinates.

So, let’s assume the expected values of our observables give a coordinate system on Q. Let’s call these coordinates q_1, \dots, q_n, so that

\langle A_i \rangle(q) = q_i

Now for the kicker: we say our statistical manifold is Gibbsian if for each point q \in Q, \pi_q is the probability distribution that maximizes entropy subject to the above condition!

Which condition? The condition saying that

\displaystyle{ \int_\Omega A_i(x) \pi_q(x) \, d\mu(x) = q_i }

for all i. This is just the previous equation spelled out so that you can see it’s a condition on \pi_q.

This assumption of the entropy-maximizing nature of \pi_q is a very powerful, because it implies a useful and nontrivial formula for \pi_q. It’s called the Gibbs distribution:

\displaystyle{  \pi_q(x) = \frac{1}{Z(q)} \exp\left(-\sum_{i = 1}^n p_i A_i(x)\right) }

for all x \in \Omega.

Here p_i is the intensive variable conjugate to q_i, while Z(q) is the partition function: the thing we must divide by to make sure \pi_q integrates to 1. In other words:

\displaystyle{ Z(q) = \int_\Omega \exp\left(-\sum_{i = 1}^n p_i A_i(x) \right) \, d\mu(x)   }

By the way, this formula may look confusing at first, since the left side depends on the point q in our statistical manifold, while there’s no q visible in the right side! Do you see what’s going on?

I’ll tell you: the conjugate variable p_i, sitting on the right side of the above formula, depends on q. Remember, we got it by taking the partial derivative of the entropy in the q_i direction

\displaystyle{ p_i = \frac{\partial f}{\partial q_i} }

and the evaluating this derivative at the point q.

But wait a minute! f here is the entropy—but the entropy of what?

The entropy of \pi_q, of course!

So there’s something circular about our formula for \pi_q. To know \pi_q, you need to know the conjugate variables p_i, but to compute these you need to know the entropy of \pi_q.

This is actually okay. While circular, the formula for \pi_q is still true. It’s harder to work with than you might hope. But it’s still extremely useful.

Next time I’ll prove that this formula for \pi_q is true, and do a few things with it. All this material was discovered by Gibbs in the late 1800’s, and it’s lurking any good book on statistical mechanics—but not phrased in the language of statistical manifolds. The physics textbooks usually consider special cases, like a box of gas where:

q_1 is energy, p_1 is 1/temperature.

q_2 is volume, p_2 is –pressure/temperature.

q_3 is the number of particles, p_3 is chemical potential / pressure.

While these special cases are important and interesting, I’d rather be general!

Technical comments

I said “Any statistical manifold comes with a bunch of interesting geometrical structures”, but in fact some conditions are required. For example, the Fisher information metric is only well-defined and nondegenerate under some conditions on the map \pi. For example, if \pi maps every point of Q to the same probability distribution, the Fisher information metric will vanish.

Similarly, the entropy function f is only smooth under some conditions on \pi.

Furthermore, the integral

\displaystyle{ \int_\Omega \exp\left(-\sum_{i = 1}^n p_i A_i(x) \right) \, d\mu(x)   }

may not converge for all values of the numbers p_1, \dots, p_n. But in my discussion of Gibbsian statistical manifolds, I was assuming that an entropy-maximizing probability distribution \pi_q with

\displaystyle{ \int_\Omega A_i(x) \pi_q(x) \, d\mu(x) = q_i }

actually exists. In this case the probability distribution is also unique (almost everywhere).


For all my old posts on information geometry, go here:

Information geometry.

16 Responses to Information Geometry (Part 20)

  1. Toby Bartels says:

    Typos: You have a lot of unLaTeXed dollar signs in the last paragraph of the introduction and the first paragraph of the first section.

    • John Baez says:

      Whoops, and I was using some older notation I had for statistical manifolds. I fixed that. I’m trying to use Q for a statistical manifold now that I’m treating points q \in Q as analogous to points in a ‘configuration space’ in classical mechanics. This goes against the conventions in Part 3, but I’ll deal with that in the unlucky event that I turn this stuff into a book.

      • Toby Bartels says:

        I see, the introduction had all the wrong letters originally! There's still an unLaTeXed $q \in Q$ in the last paragraph of the introduction and an unLaTeXed $Omega,$ in the first paragraph of the first section.

  2. francisrlb says:

    I think there might be a typo at the end: “I was assuming that an entropy-minimizing probability distribution \pi_q “. Do you mean maximizing? In the integral coming right after, is A missing a subscript (this also occurs earlier)?

    Also, could you please go a bit more into the definition of Gibbsian in your next installment? I’m not understanding what’s assumed and what’s implied by the condition.
    Thanks!

    • John Baez says:

      Thanks for catching those mistakes. I tried to fix them.

      What don’t you get about the definition of Gibbsian? I defined it precisely – or so I thought. You may not be familiar with Gibbs distributions, so maybe I need to explain something that seems obvious to me. But you have to tell me what you’re puzzled by! It’s hard to help without knowing what’s the problem.

      What I mainly plan to do next time is prove that the Gibbs distribution

      \displaystyle{  \pi_q(x) = \frac{1}{Z(q)} \exp\left(-\sum_{i = 1}^n p_i A_i(x)\right) }

      maximizes the entropy

      \displaystyle{  f(q) = -\int_\Omega \pi_q(x) \, \ln \pi_q(x) \, d\mu(x)    }

      subject to the constraints that

      \displaystyle{ \int_\Omega A_i(x) \pi_q(x) \, d\mu(x) = q_i }

      for all i.

      • francisrlb says:

        I guess I’m confused about what the q_{i} are supposed to be: you defined them starting from A_{i} and the distribution, but now you use them to define another distribution. So do we assume that we have coordinates q_{i} on Q and observables A_{i} on \Omega already, and then consider probability distributions for which the expectatin of A_{i} is q_{i}?

    • John Baez says:

      Francis wrote:

      I guess I’m confused about what the q_{i} are supposed to be: you defined them starting from A_{i} and the distribution, but now you use them to define another distribution.

      No, there’s only one distribution in this story: or more precisely, one map \pi sending each point q of a manifold Q to a probability distribution \pi_q.

      Starting from this and some observables A_1, \dots A_n, I defined the function

      q_i \colon Q \to \mathbb{R}

      to be the expected value of A_i in the distribution \pi_q:

      \displaystyle{ \int_\Omega A_i(x) \pi_q(x) \, d\mu(x) = q_i }

      Then I assumed that the functions q_1, \dots, q_n are a coordinate system on Q (and I explained why this is easy to achieve).

      Then I made a big extra assumption: \pi_q is the probability distribution with the largest possible entropy such that

      \displaystyle{ \int_\Omega A_i(x) \pi_q(x) \, d\mu(x) = q_i }

      for all i. In other words: if you tell me the values of the coordinates q_1, \dots, q_n, then \pi_q is not just any old probability distribution having these values as the expected values of the observables A_1, \dots, A_n. It’s the probability distribution with the largest possible entropy having these values as the expected values of the observables A_1, \dots, A_n.

      I’m not changing \pi_q here, I’m just making an extra assumption about it. This is a common assumption in thermodynamics.

      Without this extra assumption there’s nothing very exciting we can do until someone tells us a formula for \pi_q. But with this extra assumption I can show that \pi_q must obey this equation:

      \displaystyle{  \pi_q(x) = \frac{1}{Z(q)} \exp\left(-\sum_{i = 1}^n p_i A_i(x)\right) }

      (I defined p_i and Z(q) in the post.)

      I will actually show this equation next time.

      I hope the logic is clearer now. If there’s something puzzling still, just ask.

  3. John Baez says:

    I added a bit about the traditional reasons for caring about Gibbsian statistical manifolds and the variables q_i and p_i:

    Next time I’ll prove that this formula for \pi_q is true, and do a few things with it. All this material was discovered by Josiah Willard Gibbs in the late 1800’s, and it’s lurking any good book on statistical mechanics—but they don’t phrase it in the language of statistical manifolds. The physics textbooks usually consider special cases, like the case where:

    q_1 is energy, p_1 is 1/temperature.

    q_2 is volume, p_2 is –pressure/temperature

    While these special cases make the subject important, I’d rather be general!

  4. Great post! Question: what is the information analogue of FORCE?

    • John Baez says:

      Hi! That’s a good question. I will answer it someday, as I keep discussing the analogy between classical mechanics and information geometry. There are actually a few different possibilities, depending on what we take as the analogue of time.

  5. Toby Bartels says:

    It is pretty convoluted; I had to go through it a few times.

    I think that it would help to use different symbols for q (a point of the manifold Q) and q_i (a coordinate function on Q). Because even though q can be written as a tuple using the coordinates q_i, still q comes well before q_i in the development.

    • John Baez says:

      Everyone in differential geometry uses x_i as coordinates of a point called x, everyone in physics uses q_i as coordinates of a position called q, etc. This is confusing when you first see it, and you might feel tempted to write x_i(x) for the coordinate function x_i evaluated at the point x. But it’s good to get used to.

      I’m sorry my exposition seemed convoluted. Maybe it’s because I was trying to explain two subjects in one blog post: statistical manifolds and thermodynamics. They’re closely related. In a statistical manifold each point labels a probability distribution. In thermodynamics each point of a manifold labels a probability distribution that maximizes entropy subject to constraints on some observables, and the expected values of these observables serve as coordinates.

      I didn’t have a whole lot I wanted to say about statistical manifolds this time, except as a lead-in to thermodynamics. So I just laid out the basic formalism and then took a hard right turn into thermodynamics. If I ever turn this stuff into a book I’ll try to give the readers a more gentle ride.

      • Toby Bartels says:

        Yes, but they do this when they start with the coordinates. Here the coordinates are derived at the end of a series of steps; I think that's adding to the confusion. I see why you did it this way, but it is an unusual order to do things.

        Anyway, I'm really hoping that people who see these comments might think ‹Ah, that's why I'm confused, but now I understand.› rather than that you'll change the notation, especially since I don’t have a natural alternative to suggest.

      • John Baez says:

        Yeah. Socrates once complained that the problem with books is that you can’t ask a book questions. But on a blog you can ask questions — and other people can read the answers! This is what I like about blogs.

You can use Markdown or HTML in your comments. You can also use LaTeX, like this: $latex E = m c^2 $. The word 'latex' comes right after the first dollar sign, with a space after it.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.