## Information Geometry (Part 18)

Last time I sketched how two related forms of geometry, symplectic and contact geometry, show up in thermodynamics. Today I want to explain how they show up in probability theory.

For some reason I haven’t seen much discussion of this! But people should have looked into this. After all, statistical mechanics explains thermodynamics in terms of probability theory, so if some mathematical structure shows up in thermodynamics it should appear in statistical mechanics… and thus ultimately in probability theory.

I just figured out how this works for symplectic and contact geometry.

Suppose a system has $n$ possible states. We’ll call these microstates, following the tradition in statistical mechanics. If you don’t know what ‘microstate’ means, don’t worry about it! But the rough idea is that if you have a macroscopic system like a rock, the precise details of what its atoms are doing are described by a microstate, and many different microstates could be indistinguishable unless you look very carefully.

We’ll call the microstates $1, 2, \dots, n.$ So, if you don’t want to think about physics, when I say microstate I’ll just mean an integer from 1 to n.

Next, a probability distribution $q$ assigns a real number $q_i$ to each microstate, and these numbers must sum to 1 and be nonnegative. So, we have $q \in \mathbb{R}^n,$ though not every vector in $\mathbb{R}^n$ is a probability distribution.

I’m sure you’re wondering why I’m using $q$ rather than $p$ to stand for an observable instead of a probability distribution. Am I just trying to confuse you?

No: I’m trying to set up an analogy to physics!

Last time I introduced symplectic geometry using classical mechanics. The most important example of a symplectic manifold is the cotangent bundle $T^\ast Q$ of a manifold $Q.$ A point of $T^\ast Q$ is a pair $(q,p)$ consisting of a point $q \in Q$ and a cotangent vector $p \in T^\ast_q Q.$ In classical mechanics the point $q$ describes the position of some physical system, while $p$ describes its momentum.

So, I’m going to set up an analogy like this:

 Classical Mechanics Probability Theory $q$ position probability distribution $p$ momentum ???

But what is to momentum as probability is to position?

A big clue is the appearance of symplectic geometry in thermodynamics, which I also outlined last time. We can use this to get some intuition about the analogue of momentum in probability theory.

In thermodynamics, a system has a manifold $Q$ of states. (These are not the ‘microstates’ I mentioned before: we’ll see the relation later.) There is a function

$f \colon Q \to \mathbb{R}$

describing the entropy of the system as a function of its state. There is a law of thermodynamics saying that

$p = (df)_q$

This equation picks out a submanifold of $T^\ast Q,$ namely

$\Lambda = \{(q,p) \in T^\ast Q : \; p = (df)_q \}$

Moreover this submanifold is Lagrangian: the symplectic structure $\omega$ vanishes when restricted to it:

$\displaystyle{ \omega |_\Lambda = 0 }$

This is very beautiful, but it goes by so fast you might almost miss it! So let’s clutter it up a bit with coordinates. We often use local coordinates on $Q$ and describe a point $q \in Q$ using these coordinates, getting a point

$(q_1, \dots, q_n) \in \mathbb{R}^n$

They give rise to local coordinates $q_1, \dots, q_n, p_1, \dots, p_n$ on the cotangent bundle $T^\ast Q.$ The $q_i$ are called extensive variables, because they are typically things that you can measure only by totalling up something over the whole system, like the energy or volume of a cylinder of gas. The $p_i$ are called intensive variables, because they are typically things that you can measure locally at any point, like temperature or pressure.

In these local coordinates, the symplectic structure on $T^\ast Q$ is the 2-form given by

$\omega = dp_1 \wedge dq_1 + \cdots + dp_n \wedge dq_n$

The equation

$p = (df)_q$

serves as a law of physics that determines the intensive variables given the extensive ones when our system is in thermodynamic equilibrium. Written out using coordinates, this law says

$\displaystyle{ p_i = \frac{\partial f}{\partial q_i} }$

It looks pretty bland here, but in fact it gives formulas for the temperature and pressure of a gas, and many other useful formulas in thermodynamics.

Now we are ready to see how all this plays out in probability theory! We’ll get an analogy like this, which goes hand-in-hand with our earlier one:

 Thermodynamics Probability Theory $q$ extensive variables probability distribution $p$ intensive variables ???

This analogy is clearer than the last, because statistical mechanics reveals that the extensive variables in thermodynamics are really just summaries of probability distributions on microstates. Furthermore, both thermodynamics and probability theory have a concept of entropy.

So, let’s take our manifold $Q$ to consist of probability distributions on the set of microstates I was talking about before: the set $\{1, \dots, n\}.$ Actually, let’s use nowhere vanishing probability distributions:

$\displaystyle{ Q = \{ q \in \mathbb{R}^n : \; q_i > 0, \; \sum_{i=1}^n q_i = 1 \} }$

I’m requiring $q_i > 0$ to ensure $Q$ is a manifold, and also to make sure $f$ is differentiable: it ceases to be differentiable when one of the probabilities $q_i$ hits zero.

Since $Q$ is a manifold, its cotangent bundle is a symplectic manifold $T^\ast Q.$ And here’s the good news: we have a god-given entropy function

$f \colon Q \to \mathbb{R}$

namely the Shannon entropy

$\displaystyle{ f(q) = - \sum_{i = 1}^n q_i \ln q_i }$

So, everything I just described about thermodynamics works in the setting of plain old probability theory! Starting from our manifold $Q$ and the entropy function, we get all the rest, leading up to the Lagrangian submanifold

$\Lambda = \{(q,p) \in T^\ast Q : \; p = (df)_q \}$

that describes the relation between extensive and intensive variables.

For computations it helps to pick coordinates on $Q.$ Since the probabilities $q_1, \dots, q_n$ sum to 1, they aren’t independent coordinates on $Q.$ So, we can either pick all but one of them as coordinates, or learn how to deal with non-independent coordinates, which are already completely standard in projective geometry. Let’s do the former, just to keep things simple.

These coordinates on $Q$ give rise in the usual way to coordinates $q_i$ and $p_i$ on the cotangent bundle $T^\ast Q.$ These play the role of extensive and intensive variables, respectively, and it should be very interesting to impose the equation

$\displaystyle{ p_i = \frac{\partial f}{\partial q_i} }$

where $f$ is the Shannon entropy. This picks out a Lagrangian submanifold $\Lambda \subseteq T^\ast Q.$

So, the question becomes: what does this mean? If this formula gives the analogue of momentum for probability theory, what does this analogue of momentum mean?

Here’s a preliminary answer: $p_i$ says how fast entropy increases as we increase the probability $q_i$ that our system is in the ith microstate. So if we think of nature as ‘wanting’ to maximize entropy, the quantity $p_i$ says how eager it is to increase the probability $q_i.$

Indeed, you can think of $p_i$ as a bit like pressure—one of the most famous intensive quantities in thermodynamics. A gas ‘wants’ to expand, and its pressure says precisely how eager it is to expand. Similarly, a probability distribution ‘wants’ to flatten out, to maximize entropy, and $p_i$ says how eager it is to increase the probability $q_i$ in order to do this.

But what can we do with this concept? And what does symplectic geometry do for probability theory?

I will start tackling these questions next time.

One thing I’ll show is that when we reduce thermodynamics to probability theory using the ideas of statistical mechanics, the appearance of symplectic geometry in thermodynamics follows from its appearance in probability theory.

Another thing I want to investigate is how other geometrical structures on the space of probability distributions, like the Fisher information metric, interact with the symplectic structure on its cotangent bundle. This will integrate symplectic geometry and information geometry.

I also want to bring contact geometry into the picture. It’s already easy to see from our work last time how this should go. We treat the entropy $S$ as an independent variable, and replace $T^\ast Q$ with a larger manifold $T^\ast Q \times \mathbb{R}$ having $S$ as an extra coordinate. This is a contact manifold with contact form

$\alpha = -dS + p_1 dq_i + \cdots + p_n dq_n$

This contact manifold has a submanifold $\Sigma$ where we remember that entropy is a function of the probability distribution $q,$ and define $p$ in terms of $q$ as usual:

$\Sigma = \{(q,p,S) \in T^\ast Q \times \mathbb{R} : \; S = f(q), \; p = (df)_q \}$

And as we saw last time, $\Sigma$ is a Legendrian submanifold, meaning

$\displaystyle{ \alpha|_{\Sigma} = 0 }$

But again, we want to understand what these ideas from contact geometry really do for probability theory!

For all my old posts on information geometry, go here:

### 2 Responses to Information Geometry (Part 18)

1. Toby Bartels says:

It seems to me that if we want to know what to call $p_i$, then we should calculate it:

$p_i := \partial S/\partial q_i = \partial(-\sum_i q_i \ln q_i)/\partial q_i =$
$-d(q_i \ln q_i)/d q_i = -\ln q_i - 1.$

Now, $-\ln q$ is often called the surprisal; it tells you how surprised you should be if an event of probability $q$ occurs (from no surprise if the event is certain to infinite surprise if the event is impossible). For example, the entropy is the expected surprisal. And so $p_i$ is basically the surprisal of microstate $i$, only we subtract $1$ (the surprisal associated with a probability of $1/e$) for some reason.

But actually, there’s a flaw in my calculation, because I forgot that there are only $n - 1$ independent variables, so I need to add on $\partial (-q_n \ln q_n)/\partial q_i$, where $q_n = 1 - \sum_{i < n} q_i$, so that $\partial q_n/\partial q_i = -1:$

$\partial(-q_n \ln q_n)/\partial q_i = (d(-q_n \ln q_n)/d q_n) (\partial q_n/\partial q_i) =$
$(-\ln q_n - 1) (-1) = \ln q_n + 1.$

Therefore, the correct value of $p_i$ is $\ln q_n - \ln q_i$, the relative surprisal of microstate $i$ relative to microstate $n$ (the state whose probability we arbitrarily chose not to include as an independent variable). At least the mysterious $1$s cancelled.

2. Jeremy Schmitt says:

Great post and a fascinating topic! I did my thesis on symplectic integrators ( the same type that power Hamiltonian Monte Carlo), and the connection with information geometry is really intriguing. My advisor at UCSD had one related paper that attempted to connect symplectic and information geometry in a discrete setting by connecting divergence functions and a generating function.
https://www.mdpi.com/1099-4300/19/10/518

This site uses Akismet to reduce spam. Learn how your comment data is processed.