Information Geometry (Part 18)

Last time I sketched how two related forms of geometry, symplectic and contact geometry, show up in thermodynamics. Today I want to explain how they show up in probability theory.

For some reason I haven’t seen much discussion of this! But people should have looked into this. After all, statistical mechanics explains thermodynamics in terms of probability theory, so if some mathematical structure shows up in thermodynamics it should appear in statistical mechanics… and thus ultimately in probability theory.

I just figured out how this works for symplectic and contact geometry.

Suppose a system has n possible states. We’ll call these microstates, following the tradition in statistical mechanics. If you don’t know what ‘microstate’ means, don’t worry about it! But the rough idea is that if you have a macroscopic system like a rock, the precise details of what its atoms are doing are described by a microstate, and many different microstates could be indistinguishable unless you look very carefully.

We’ll call the microstates 1, 2, \dots, n. So, if you don’t want to think about physics, when I say microstate I’ll just mean an integer from 1 to n.

Next, a probability distribution q assigns a real number q_i to each microstate, and these numbers must sum to 1 and be nonnegative. So, we have q \in \mathbb{R}^n, though not every vector in \mathbb{R}^n is a probability distribution.

I’m sure you’re wondering why I’m using q rather than p to stand for an observable instead of a probability distribution. Am I just trying to confuse you?

No: I’m trying to set up an analogy to physics!

Last time I introduced symplectic geometry using classical mechanics. The most important example of a symplectic manifold is the cotangent bundle T^\ast Q of a manifold Q. A point of T^\ast Q is a pair (q,p) consisting of a point q \in Q and a cotangent vector p \in T^\ast_q Q. In classical mechanics the point q describes the position of some physical system, while p describes its momentum.

So, I’m going to set up an analogy like this:

 Classical Mechanics  Probability Theory
  q   position   probability distribution  
  p   momentum ???

But what is to momentum as probability is to position?

A big clue is the appearance of symplectic geometry in thermodynamics, which I also outlined last time. We can use this to get some intuition about the analogue of momentum in probability theory.

In thermodynamics, a system has a manifold Q of states. (These are not the ‘microstates’ I mentioned before: we’ll see the relation later.) There is a function

f \colon Q \to \mathbb{R}

describing the entropy of the system as a function of its state. There is a law of thermodynamics saying that

p = (df)_q

This equation picks out a submanifold of T^\ast Q, namely

\Lambda = \{(q,p) \in T^\ast Q : \; p = (df)_q \}

Moreover this submanifold is Lagrangian: the symplectic structure \omega vanishes when restricted to it:

\displaystyle{ \omega |_\Lambda = 0 }

This is very beautiful, but it goes by so fast you might almost miss it! So let’s clutter it up a bit with coordinates. We often use local coordinates on Q and describe a point q \in Q using these coordinates, getting a point

(q_1, \dots, q_n) \in \mathbb{R}^n

They give rise to local coordinates q_1, \dots, q_n, p_1, \dots, p_n on the cotangent bundle T^\ast Q. The q_i are called extensive variables, because they are typically things that you can measure only by totalling up something over the whole system, like the energy or volume of a cylinder of gas. The p_i are called intensive variables, because they are typically things that you can measure locally at any point, like temperature or pressure.

In these local coordinates, the symplectic structure on T^\ast Q is the 2-form given by

\omega = dp_1 \wedge dq_1 + \cdots + dp_n \wedge dq_n

The equation

p = (df)_q

serves as a law of physics that determines the intensive variables given the extensive ones when our system is in thermodynamic equilibrium. Written out using coordinates, this law says

\displaystyle{ p_i = \frac{\partial f}{\partial q_i} }

It looks pretty bland here, but in fact it gives formulas for the temperature and pressure of a gas, and many other useful formulas in thermodynamics.

Now we are ready to see how all this plays out in probability theory! We’ll get an analogy like this, which goes hand-in-hand with our earlier one:

 Thermodynamics   Probability Theory 
  q   extensive variables   probability distribution  
  p   intensive variables ???

This analogy is clearer than the last, because statistical mechanics reveals that the extensive variables in thermodynamics are really just summaries of probability distributions on microstates. Furthermore, both thermodynamics and probability theory have a concept of entropy.

So, let’s take our manifold Q to consist of probability distributions on the set of microstates I was talking about before: the set \{1, \dots, n\}. Actually, let’s use nowhere vanishing probability distributions:

\displaystyle{ Q = \{ q \in \mathbb{R}^n : \; q_i > 0, \; \sum_{i=1}^n q_i = 1 \} }

I’m requiring q_i > 0 to ensure Q is a manifold, and also to make sure f is differentiable: it ceases to be differentiable when one of the probabilities q_i hits zero.

Since Q is a manifold, its cotangent bundle is a symplectic manifold T^\ast Q. And here’s the good news: we have a god-given entropy function

f \colon Q \to \mathbb{R}

namely the Shannon entropy

\displaystyle{ f(q) = - \sum_{i = 1}^n q_i \ln q_i }

So, everything I just described about thermodynamics works in the setting of plain old probability theory! Starting from our manifold Q and the entropy function, we get all the rest, leading up to the Lagrangian submanifold

\Lambda = \{(q,p) \in T^\ast Q : \; p = (df)_q \}

that describes the relation between extensive and intensive variables.

For computations it helps to pick coordinates on Q. Since the probabilities q_1, \dots, q_n sum to 1, they aren’t independent coordinates on Q. So, we can either pick all but one of them as coordinates, or learn how to deal with non-independent coordinates, which are already completely standard in projective geometry. Let’s do the former, just to keep things simple.

These coordinates on Q give rise in the usual way to coordinates q_i and p_i on the cotangent bundle T^\ast Q. These play the role of extensive and intensive variables, respectively, and it should be very interesting to impose the equation

\displaystyle{ p_i = \frac{\partial f}{\partial q_i} }

where f is the Shannon entropy. This picks out a Lagrangian submanifold \Lambda \subseteq T^\ast Q.

So, the question becomes: what does this mean? If this formula gives the analogue of momentum for probability theory, what does this analogue of momentum mean?

Here’s a preliminary answer: p_i says how fast entropy increases as we increase the probability q_i that our system is in the ith microstate. So if we think of nature as ‘wanting’ to maximize entropy, the quantity p_i says how eager it is to increase the probability q_i.

Indeed, you can think of p_i as a bit like pressure—one of the most famous intensive quantities in thermodynamics. A gas ‘wants’ to expand, and its pressure says precisely how eager it is to expand. Similarly, a probability distribution ‘wants’ to flatten out, to maximize entropy, and p_i says how eager it is to increase the probability q_i in order to do this.

But what can we do with this concept? And what does symplectic geometry do for probability theory?

I will start tackling these questions next time.

One thing I’ll show is that when we reduce thermodynamics to probability theory using the ideas of statistical mechanics, the appearance of symplectic geometry in thermodynamics follows from its appearance in probability theory.

Another thing I want to investigate is how other geometrical structures on the space of probability distributions, like the Fisher information metric, interact with the symplectic structure on its cotangent bundle. This will integrate symplectic geometry and information geometry.

I also want to bring contact geometry into the picture. It’s already easy to see from our work last time how this should go. We treat the entropy S as an independent variable, and replace T^\ast Q with a larger manifold T^\ast Q \times \mathbb{R} having S as an extra coordinate. This is a contact manifold with contact form

\alpha = -dS + p_1 dq_i + \cdots + p_n dq_n

This contact manifold has a submanifold \Sigma where we remember that entropy is a function of the probability distribution q, and define p in terms of q as usual:

\Sigma = \{(q,p,S) \in T^\ast Q \times \mathbb{R} : \; S = f(q), \; p = (df)_q \}

And as we saw last time, \Sigma is a Legendrian submanifold, meaning

\displaystyle{ \alpha|_{\Sigma} = 0 }

But again, we want to understand what these ideas from contact geometry really do for probability theory!

For all my old posts on information geometry, go here:

Information geometry.

2 Responses to Information Geometry (Part 18)

  1. Toby Bartels says:

    It seems to me that if we want to know what to call p_i, then we should calculate it:

    p_i := \partial S/\partial q_i = \partial(-\sum_i q_i \ln q_i)/\partial q_i =
    -d(q_i \ln q_i)/d q_i = -\ln q_i - 1.

    Now, -\ln q is often called the surprisal; it tells you how surprised you should be if an event of probability q occurs (from no surprise if the event is certain to infinite surprise if the event is impossible). For example, the entropy is the expected surprisal. And so p_i is basically the surprisal of microstate i, only we subtract 1 (the surprisal associated with a probability of 1/e) for some reason.

    But actually, there’s a flaw in my calculation, because I forgot that there are only n - 1 independent variables, so I need to add on \partial (-q_n \ln q_n)/\partial q_i, where q_n = 1 - \sum_{i < n} q_i, so that \partial q_n/\partial q_i = -1:

    \partial(-q_n \ln q_n)/\partial q_i = (d(-q_n \ln q_n)/d q_n) (\partial q_n/\partial q_i) =
    (-\ln q_n - 1) (-1) = \ln q_n + 1.

    Therefore, the correct value of p_i is \ln q_n - \ln q_i, the relative surprisal of microstate i relative to microstate n (the state whose probability we arbitrarily chose not to include as an independent variable). At least the mysterious 1s cancelled.

  2. Jeremy Schmitt says:

    Great post and a fascinating topic! I did my thesis on symplectic integrators ( the same type that power Hamiltonian Monte Carlo), and the connection with information geometry is really intriguing. My advisor at UCSD had one related paper that attempted to connect symplectic and information geometry in a discrete setting by connecting divergence functions and a generating function.

You can use Markdown or HTML in your comments. You can also use LaTeX, like this: $latex E = m c^2 $. The word 'latex' comes right after the first dollar sign, with a space after it.

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.