Information Geometry (Part 20)

Last time we worked out an analogy between classical mechanics, thermodynamics and probability theory. The latter two look suspiciously similar:

	Classical Mechanics	Thermodynamics	Probability Theory
q	position	extensive variables	probabilities
p	momentum	intensive variables	surprisals
S	action	entropy	Shannon entropy

This is no coincidence. After all, in the subject of statistical mechanics we explain classical thermodynamics using probability theory—and entropy is revealed to be Shannon entropy (or its quantum analogue).

Now I want to make this precise.

To connect classical thermodynamics to probability theory, I’ll start by discussing ‘statistical manifolds’. I introduced the idea of a statistical manifold in Part 7: it’s a manifold $Q$ equipped with a map sending each point $q \in Q$ to a probability distribution $\pi_q$ on some measure space $\Omega.$ Now I’ll say how these fit into the second column of the above chart.

Then I’ll talk about statistical manifolds of a special sort used in thermodynamics, which I’ll call ‘Gibbsian’, since they really go back to Josiah Willard Gibbs.

In a Gibbsian statistical manifold, for each $q \in Q$ the probability distribution $\pi_q$ is a ‘Gibbs distribution’. Physically, these Gibbs distributions describe thermodynamic equilibria. For example, if you specify the volume, energy and number of particles in a box of gas, there will be a Gibbs distribution describing what the particles do in thermodynamic equilibrium under these conditions. Mathematically, Gibbs distributions maximize entropy subject to some constraints specified by the point $q \in Q.$

More precisely: in a Gibbsian statistical manifold we have a list of observables $A_1, \dots , A_n$ whose expected values serve as coordinates $q_1, \dots, q_n$ for points $q \in Q,$ and $\pi_q$ is the probability distribution that maximizes entropy subject to the constraint that the expected value of $A_i$ is $q_i.$ We can derive most of the interesting formulas of thermodynamics starting from this!

Statistical manifolds

Let’s fix a measure space $\Omega$ with measure $\mu.$ A statistical manifold is then a manifold $Q$ equipped with a smooth map $\pi$ assigning to each point $q \in Q$ a probability distribution on $\Omega,$ which I’ll call $\pi_q.$ So, $\pi_q$ is a function on $\Omega$ with

$\displaystyle{ \int_\Omega \pi_q \, d\mu = 1 }$

and

$\pi_q(x) \ge 0$

for all $x \in \Omega.$

The idea here is that the space of all probability distributions on $\Omega$ may be too huge to understand in as much detail as we’d like, so instead we describe some of these probability distributions—a family parametrized by points of some manifold $Q$ —using the map $\pi.$ This is the basic idea behind parametric statistics.

Information geometry is the geometry of statistical manifolds. Any statistical manifold comes with a bunch of interesting geometrical structures. One is the ‘Fisher information metric’, a Riemannian metric I explained in Part 7. Another is a 1-parameter family of connections on the tangent bundle $T Q,$ which is important in Amari’s approach to information geometry. You can read about this here:

• Hiroshi Matsuzoe, Statistical manifolds and affine differential geometry, in Advanced Studies in Pure Mathematics 57, pp. 303–321.

I don’t want to talk about it now—I just wanted to reassure you that I’m not completely ignorant of it!

I want to focus on the story I’ve been telling, which is about entropy. Our statistical manifold $Q$ comes with a smooth entropy function

$f \colon Q \to \mathbb{R}$

namely

$\displaystyle{ f(q) = -\int_\Omega \pi_q(x) \, \ln \pi_q(x) \, d\mu(x) }$

We can use this entropy function to do many of the things we usually do in thermodynamics! For example, at any point $q \in Q$ where this function is differentiable, its differential gives a cotangent vector

$p = (df)_q$

which has an important physical meaning. In coordinates we have

$\displaystyle{ p_i = \frac{\partial f}{\partial q_i} }$

and we call $p_i$ the intensive variable conjugate to $q_i.$ For example if $q_i$ is energy, $p_i$ will be ‘coolness’: the reciprocal of temperature.

Defining $p$ this way gives a Lagrangian submanifold

$\Lambda = \{ (q,p) \in T^\ast Q : \; x \in M, \; p = (df)_x \}$

of the cotangent bundle $T^\ast Q.$ We can also get contact geometry into the game by defining a contact manifold $T^\ast Q \times \mathbb{R}$ and a Legendrian submanifold

$\Sigma = \{ (q,p,S) \in T^\ast Q \times \mathbb{R} : \; x \in M, \; p = (df)_q , S = f(q) \}$

But I’ve been talking about these ideas for the last three episodes, so I won’t say more just now! Instead, I want to throw a new idea into the pot.

Gibbsian statistical manifolds

Thermodynamics, and statistical mechanics, spend a lot of time dealing with statistical manifold of a special sort I’ll call ‘Gibbsian’. In these, each probability distribution $\pi_q$ is a ‘Gibbs distribution’, meaning that it maximizes entropy subject to certain constraints specified by the point $q \in Q.$

How does this work? For starters, an integrable function

$A \colon \Omega \to \mathbb{R}$

is called a random variable, or in physics an observable. The expected value of an observable is a smooth real-valued function on our statistical manifold

$\langle A \rangle \colon Q \to \mathbb{R}$

given by

$\displaystyle{ \langle A \rangle(q) = \int_\Omega A(x) \pi_q(x) \, d\mu(x) }$

In other words, $\langle A \rangle$ is a function whose value at at any point $q \in Q$ is the expected value of $A$ with respect to the probability distribution $\pi_q.$

Now, suppose our statistical manifold is n-dimensional and we have n observables $A_1, \dots, A_n.$ Their expected values will be smooth functions on our manifold—and sometimes these functions will be a coordinate system!

This may sound rather unlikely, but it’s really not so outlandish. Indeed, if there’s a point $q$ such that the differentials of the functions $\langle A_i \rangle$ are linearly independent at this point, these functions will be a coordinate system in some neighborhood of this point, by the inverse function theorem. So, we can take this neighborhood, use it as our statistical manifold, and the functions $\langle A_i \rangle$ will be coordinates.

So, let’s assume the expected values of our observables give a coordinate system on $Q.$ Let’s call these coordinates $q_1, \dots, q_n,$ so that

$\langle A_i \rangle(q) = q_i$

Now for the kicker: we say our statistical manifold is Gibbsian if for each point $q \in Q,$ $\pi_q$ is the probability distribution that maximizes entropy subject to the above condition!

Which condition? The condition saying that

$\displaystyle{ \int_\Omega A_i(x) \pi_q(x) \, d\mu(x) = q_i }$

for all i. This is just the previous equation spelled out so that you can see it’s a condition on $\pi_q.$

This assumption of the entropy-maximizing nature of $\pi_q$ is a very powerful, because it implies a useful and nontrivial formula for $\pi_q.$ It’s called the Gibbs distribution:

$\displaystyle{ \pi_q(x) = \frac{1}{Z(q)} \exp\left(-\sum_{i = 1}^n p_i A_i(x)\right) }$

for all $x \in \Omega.$

Here $p_i$ is the intensive variable conjugate to $q_i,$ while $Z(q)$ is the partition function: the thing we must divide by to make sure $\pi_q$ integrates to 1. In other words:

$\displaystyle{ Z(q) = \int_\Omega \exp\left(-\sum_{i = 1}^n p_i A_i(x) \right) \, d\mu(x) }$

By the way, this formula may look confusing at first, since the left side depends on the point $q$ in our statistical manifold, while there’s no $q$ visible in the right side! Do you see what’s going on?

I’ll tell you: the conjugate variable $p_i,$ sitting on the right side of the above formula, depends on $q.$ Remember, we got it by taking the partial derivative of the entropy in the $q_i$ direction

$\displaystyle{ p_i = \frac{\partial f}{\partial q_i} }$

and the evaluating this derivative at the point $q.$

But wait a minute! $f$ here is the entropy—but the entropy of what?

The entropy of $\pi_q,$ of course!

So there’s something circular about our formula for $\pi_q.$ To know $\pi_q,$ you need to know the conjugate variables $p_i,$ but to compute these you need to know the entropy of $\pi_q.$

This is actually okay. While circular, the formula for $\pi_q$ is still true. It’s harder to work with than you might hope. But it’s still extremely useful.

Next time I’ll prove that this formula for $\pi_q$ is true, and do a few things with it. All this material was discovered by Gibbs in the late 1800’s, and it’s lurking any good book on statistical mechanics—but not phrased in the language of statistical manifolds. The physics textbooks usually consider special cases, like a box of gas where:

• $q_1$ is energy, $p_1$ is 1/temperature.

• $q_2$ is volume, $p_2$ is –pressure/temperature.

• $q_3$ is the number of particles, $p_3$ is chemical potential / pressure.

While these special cases are important and interesting, I’d rather be general!

Technical comments

I said “Any statistical manifold comes with a bunch of interesting geometrical structures”, but in fact some conditions are required. For example, the Fisher information metric is only well-defined and nondegenerate under some conditions on the map $\pi.$ For example, if $\pi$ maps every point of $Q$ to the same probability distribution, the Fisher information metric will vanish.

Similarly, the entropy function $f$ is only smooth under some conditions on $\pi.$

Furthermore, the integral

$\displaystyle{ \int_\Omega \exp\left(-\sum_{i = 1}^n p_i A_i(x) \right) \, d\mu(x) }$

may not converge for all values of the numbers $p_1, \dots, p_n.$ But in my discussion of Gibbsian statistical manifolds, I was assuming that an entropy-maximizing probability distribution $\pi_q$ with

$\displaystyle{ \int_\Omega A_i(x) \pi_q(x) \, d\mu(x) = q_i }$

actually exists. In this case the probability distribution is also unique (almost everywhere).

For all my old posts on information geometry, go here:

• Information geometry.

This entry was posted on Saturday, August 14th, 2021 at 10:02 pm and is filed under information and entropy, mathematics, physics, probability. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

16 Responses to Information Geometry (Part 20)

Toby Bartels says:

14 August, 2021 at 10:10 pm

Typos: You have a lot of unLaTeXed dollar signs in the last paragraph of the introduction and the first paragraph of the first section.

Reply
- John Baez says:
  
  15 August, 2021 at 7:18 pm
  
  Whoops, and I was using some older notation I had for statistical manifolds. I fixed that. I’m trying to use $Q$ for a statistical manifold now that I’m treating points $q \in Q$ as analogous to points in a ‘configuration space’ in classical mechanics. This goes against the conventions in Part 3, but I’ll deal with that in the unlucky event that I turn this stuff into a book.
  
  Reply
  - Toby Bartels says:
    
    15 August, 2021 at 7:34 pm
    
    I see, the introduction had all the wrong letters originally! There's still an unLaTeXed $q \in Q$ in the last paragraph of the introduction and an unLaTeXed $Omega,$ in the first paragraph of the first section.
    
    Reply
francisrlb says:

15 August, 2021 at 1:51 pm

I think there might be a typo at the end: “I was assuming that an entropy-minimizing probability distribution \pi_q “. Do you mean maximizing? In the integral coming right after, is A missing a subscript (this also occurs earlier)?

Also, could you please go a bit more into the definition of Gibbsian in your next installment? I’m not understanding what’s assumed and what’s implied by the condition.
Thanks!

Reply
- John Baez says:
  
  15 August, 2021 at 6:44 pm
  
  Thanks for catching those mistakes. I tried to fix them.
  
  What don’t you get about the definition of Gibbsian? I defined it precisely – or so I thought. You may not be familiar with Gibbs distributions, so maybe I need to explain something that seems obvious to me. But you have to tell me what you’re puzzled by! It’s hard to help without knowing what’s the problem.
  
  What I mainly plan to do next time is prove that the Gibbs distribution
  
  $\displaystyle{ \pi_q(x) = \frac{1}{Z(q)} \exp\left(-\sum_{i = 1}^n p_i A_i(x)\right) }$
  
  maximizes the entropy
  
  $\displaystyle{ f(q) = -\int_\Omega \pi_q(x) \, \ln \pi_q(x) \, d\mu(x) }$
  
  subject to the constraints that
  
  $\displaystyle{ \int_\Omega A_i(x) \pi_q(x) \, d\mu(x) = q_i }$
  
  for all i.
  
  Reply
  - francisrlb says:
    
    16 August, 2021 at 12:27 am
    
    I guess I’m confused about what the q_{i} are supposed to be: you defined them starting from A_{i} and the distribution, but now you use them to define another distribution. So do we assume that we have coordinates q_{i} on Q and observables A_{i} on \Omega already, and then consider probability distributions for which the expectatin of A_{i} is q_{i}?
    
    Reply
- John Baez says:
  
  16 August, 2021 at 2:39 am
  
  Francis wrote:
  
  I guess I’m confused about what the $q_{i}$ are supposed to be: you defined them starting from $A_{i}$ and the distribution, but now you use them to define another distribution.
  
  No, there’s only one distribution in this story: or more precisely, one map $\pi$ sending each point $q$ of a manifold $Q$ to a probability distribution $\pi_q.$
  
  Starting from this and some observables $A_1, \dots A_n,$ I defined the function
  
  $q_i \colon Q \to \mathbb{R}$
  
  to be the expected value of $A_i$ in the distribution $\pi_q$ :
  
  $\displaystyle{ \int_\Omega A_i(x) \pi_q(x) \, d\mu(x) = q_i }$
  
  Then I assumed that the functions $q_1, \dots, q_n$ are a coordinate system on $Q$ (and I explained why this is easy to achieve).
  
  Then I made a big extra assumption: $\pi_q$ is the probability distribution with the largest possible entropy such that
  
  $\displaystyle{ \int_\Omega A_i(x) \pi_q(x) \, d\mu(x) = q_i }$
  
  for all i. In other words: if you tell me the values of the coordinates $q_1, \dots, q_n,$ then $\pi_q$ is not just any old probability distribution having these values as the expected values of the observables $A_1, \dots, A_n.$ It’s the probability distribution with the largest possible entropy having these values as the expected values of the observables $A_1, \dots, A_n.$
  
  I’m not changing $\pi_q$ here, I’m just making an extra assumption about it. This is a common assumption in thermodynamics.
  
  Without this extra assumption there’s nothing very exciting we can do until someone tells us a formula for $\pi_q$ . But with this extra assumption I can show that $\pi_q$ must obey this equation:
  
  $\displaystyle{ \pi_q(x) = \frac{1}{Z(q)} \exp\left(-\sum_{i = 1}^n p_i A_i(x)\right) }$
  
  (I defined $p_i$ and $Z(q)$ in the post.)
  
  I will actually show this equation next time.
  
  I hope the logic is clearer now. If there’s something puzzling still, just ask.
  
  Reply
John Baez says:

15 August, 2021 at 8:22 pm

I added a bit about the traditional reasons for caring about Gibbsian statistical manifolds and the variables $q_i$ and $p_i$ :

Next time I’ll prove that this formula for $\pi_q$ is true, and do a few things with it. All this material was discovered by Josiah Willard Gibbs in the late 1800’s, and it’s lurking any good book on statistical mechanics—but they don’t phrase it in the language of statistical manifolds. The physics textbooks usually consider special cases, like the case where:

• $q_1$ is energy, $p_1$ is 1/temperature.

• $q_2$ is volume, $p_2$ is –pressure/temperature

While these special cases make the subject important, I’d rather be general!

Reply
JØꓘI❤️🗡SYLVIE=Ylem. Dp-brane🦋ϘΔΞϴ³Σx²ШЮѪ (@vielbein) says:

15 August, 2021 at 8:38 pm

Great post! Question: what is the information analogue of FORCE?

Reply
- John Baez says:
  
  15 August, 2021 at 9:28 pm
  
  Hi! That’s a good question. I will answer it someday, as I keep discussing the analogy between classical mechanics and information geometry. There are actually a few different possibilities, depending on what we take as the analogue of time.
  
  Reply
Toby Bartels says:

16 August, 2021 at 2:44 am

It is pretty convoluted; I had to go through it a few times.

I think that it would help to use different symbols for $q$ (a point of the manifold $Q$ ) and $q_i$ (a coordinate function on $Q$ ). Because even though $q$ can be written as a tuple using the coordinates $q_i$ , still $q$ comes well before $q_i$ in the development.

Reply
- John Baez says:
  
  16 August, 2021 at 3:16 am
  
  Everyone in differential geometry uses $x_i$ as coordinates of a point called $x,$ everyone in physics uses $q_i$ as coordinates of a position called $q,$ etc. This is confusing when you first see it, and you might feel tempted to write $x_i(x)$ for the coordinate function $x_i$ evaluated at the point $x.$ But it’s good to get used to.
  
  I’m sorry my exposition seemed convoluted. Maybe it’s because I was trying to explain two subjects in one blog post: statistical manifolds and thermodynamics. They’re closely related. In a statistical manifold each point labels a probability distribution. In thermodynamics each point of a manifold labels a probability distribution that maximizes entropy subject to constraints on some observables, and the expected values of these observables serve as coordinates.
  
  I didn’t have a whole lot I wanted to say about statistical manifolds this time, except as a lead-in to thermodynamics. So I just laid out the basic formalism and then took a hard right turn into thermodynamics. If I ever turn this stuff into a book I’ll try to give the readers a more gentle ride.
  
  Reply
  - Toby Bartels says:
    
    16 August, 2021 at 4:10 am
    
    Yes, but they do this when they start with the coordinates. Here the coordinates are derived at the end of a series of steps; I think that's adding to the confusion. I see why you did it this way, but it is an unusual order to do things.
    
    Anyway, I'm really hoping that people who see these comments might think ‹Ah, that's why I'm confused, but now I understand.› rather than that you'll change the notation, especially since I don’t have a natural alternative to suggest.
    
    Reply
  - John Baez says:
    
    16 August, 2021 at 5:50 am
    
    Yeah. Socrates once complained that the problem with books is that you can’t ask a book questions. But on a blog you can ask questions — and other people can read the answers! This is what I like about blogs.
    
    Reply
davidlduffy says:

1 September, 2021 at 12:53 am

You might like
https://www.annualreviews.org/doi/full/10.1146/annurev-statistics-060116-054026
esp Fig 1.

Reply
- John Baez says:
  
  5 September, 2021 at 4:47 pm
  
  That’s interesting, but I’m suspicious of it for some reason.
  
  Reply

You can use Markdown or HTML in your comments. You can also use LaTeX, like this: $latex E = m c^2 $. The word 'latex' comes right after the first dollar sign, with a space after it. Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

	John Baez on Agent-Based Models (Part …
	Grandpa D on Agent-Based Models (Part …
	Anton Sherwood on Well Temperaments (Part 3…
	John Baez on Well Temperaments (Part 3…
	Anton Sherwood on Well Temperaments (Part 3…
	John Baez on Hexagonal Tiling Honeycomb
	Victor V Albert on Hexagonal Tiling Honeycomb
	Matias Schrauf on Agent-Based Models (Part …
	John Tromp on The Busy Beaver Game
	Wolfgang on Protonium

Azimuth