Information Geometry (Part 19)

Last time I figured out the analogue of momentum in probability theory, but I didn’t say what it’s called. Now I will tell you—thanks to some help from Abel Jansma and Toby Bartels.


This is a well-known concept in information theory. It’s also called ‘information content‘.

Let’s see why. First, let’s remember the setup. We have a manifold

\displaystyle{ Q = \{ q \in \mathbb{R}^n : \; q_i > 0, \; \sum_{i=1}^n q_i = 1 \} }

whose points q are nowhere vanishing probability distributions on the set \{1, \dots, n\}. We have a function

f \colon Q \to \mathbb{R}

called the Shannon entropy, defined by

\displaystyle{ f(q) = - \sum_{j = 1}^n q_j \ln q_j }

For each point q \in Q we define a cotangent vector p \in T^\ast_q Q by

p = (df)_q

As mentioned last time, this is the analogue of momentum in probability theory. In the second half of this post I’ll say more about exactly why. But first let’s compute it and see what it actually equals!

Let’s start with a naive calculation, acting as if the probabilities q_1, \dots, q_n were a coordinate system on the manifold Q. We get

\displaystyle{ p_i = \frac{\partial f}{\partial q_i} }

so using the definition of the Shannon entropy we have

\begin{array}{ccl}  p_i &=& \displaystyle{ -\frac{\partial}{\partial q_i} \sum_{j = 1}^n q_j \ln q_j  }\\ \\  &=& \displaystyle{ -\frac{\partial}{\partial q_i} \left( q_i \ln q_i \right) } \\ \\  &=& -\ln(q_i) - 1  \end{array}

Now, the quantity -\ln q_i is called the surprisal of the probability distribution at i. Intuitively, it’s a measure of how surprised you should be if an event of probability q_i occurs. For example, if you flip a fair coin and it lands heads up, your surprisal is ln 2. If you flip 100 fair coins and they all land heads up, your surprisal is 100 times ln 2.

Of course ‘surprise’ is a psychological term, not a term from math or physics, so we shouldn’t take it too seriously here. We can derive the concept of surprisal from three axioms:

  1. The surprisal of an event of probability q is some function of q, say F(q).
  2. The less probable an event is, the larger its surprisal is: q_1 \le q_2 \implies  F(q_1) \ge F(q_2).
  3. The surprisal of two independent events is the sum of their surprisals: F(q_1 q_2) = F(q_1) + F(q_2).

It follows from work on Cauchy’s functional equation that F must be of this form:

F(q) = - \log_b q

for some constant b > 1. We shall choose b, the base of our logarithms, to be e. We had a similar freedom of choice in defining the Shannon entropy, and we will use base e for both to be consistent. If we chose something else, it would change the surprisal and the Shannon entropy by the same constant factor.

So far, so good. But what about the irksome “-1” in our formula?

p_i = -\ln(q_i) - 1

Luckily it turns out we can just get rid of this! The reason is that the probabilities q_i are not really coordinates on the manifold Q. They’re not independent: they must sum to 1. So, when we change them a little, the sum of their changes must vanish. Putting it more technically, the tangent space T_q Q is not all of \mathbb{R}^n, but just the subspace consisting of vectors whose components sum to zero:

\displaystyle{ T_q Q = \{ v \in \mathbb{R}^n : \; \sum_{j = 1}^n v_j = 0 \} }

The cotangent space is the dual of the tangent space. The dual of a subspace

S \subseteq V

is the quotient space

V^\ast/\{ \ell \colon V \to \mathbb{R} : \; \forall v \in S \; \, \ell(v) = 0 \}

The cotangent space T_q^\ast Q thus consists of linear functionals \ell \colon \mathbb{R}^n \to \mathbb{R} modulo those that vanish on vectors v obeying the equation

\displaystyle{ \sum_{j = 1}^n v_j = 0 }

Of course, we can identify the dual of \mathbb{R}^n with \mathbb{R}^n in the usual way, using the Euclidean inner product: a vector u \in \mathbb{R}^n corresponds to the linear functional

\displaystyle{ \ell(v) = \sum_{j = 1}^n u_j v_j }

From this, you can see that a linear functional \ell vanishes on all vectors v obeying the equation

\displaystyle{ \sum_{j = 1}^n v_j = 0 }

if and only if its corresponding vector u has

u_1 = \cdots = u_n

So, we get

T^\ast_q Q \cong \mathbb{R}^n/\{ u \in \mathbb{R}^n : \; u_1 = \cdots = u_n \}

In words: we can describe cotangent vectors to Q as lists of n numbers if we want, but we have to remember that adding the same constant to each number in the list doesn’t change the cotangent vector!

This suggests that our naive formula

p_i = \ln(q_i) - 1

is on the right track, but we’re free to get rid of the constant 1 if we want! And that’s true.

To check this rigorously, we need to show

\displaystyle{ p(v) = -\sum_{j=1}^n \ln(q_i) v_i}

for all v \in T_q Q. We compute:

\begin{array}{ccl}  p(v) &=& df(v) \\ \\  &=& v(f) \\ \\  &=& \displaystyle{ \sum_{j=1}^n v_j \, \frac{\partial f}{\partial q_j} } \\ \\  &=& \displaystyle{ \sum_{j=1}^n v_j (-\ln(q_i) - 1) } \\ \\  &=& \displaystyle{ -\sum_{j=1}^n \ln(q_i) v_i }  \end{array}

where in the second to last step we used our earlier calculation:

\displaystyle{ \frac{\partial f}{\partial q_i} = -\frac{\partial}{\partial q_i} \sum_{j = 1}^n q_j \ln q_j = -\ln(q_i) - 1 }

and in the last step we used

\displaystyle{ \sum_j v_j = 0 }

Back to the big picture

Now let’s take stock of where we are. We can fill in the question marks in the charts from last time, and combine those charts while we’re at it.

 Classical Mechanics   Thermodynamics   Probability Theory 
 q   position extensive variables probabilities
 p  momentum intensive variables surprisals
 S   action entropy Shannon entropy

What’s going on here? In classical mechanics, action is minimized (or at least the system finds a critical point of the action). In thermodynamics, entropy is maximized. In the maximum entropy approach to probability, Shannon entropy is maximized. This leads to a mathematical analogy that’s quite precise. For classical mechanics and thermodynamics, I explained it here:

Classical mechanics versus thermodynamics (part 1).

Classical mechanics versus thermodynamics (part 2).

These posts may give a more approachable introduction to what I’m doing now: now I’m bringing probability theory into the analogy, with a big emphasis on symplectic and contact geometry.

Let me spell out a bit of the analogy more carefully:

Classical Mechanics. In classical mechanics, we have a manifold Q whose points are positions of a particle. There’s an important function on this manifold: Hamilton’s principal function

f \colon Q \to \mathbb{R}

What’s this? It’s basically action: f(q) is the action of the least-action path from the position q_0 at some earlier time t_0 to the position q at time 0. The Hamilton–Jacobi equations say the particle’s momentum p at time 0 is given by

p = (df)_q

Thermodynamics. In thermodynamics, we have a manifold Q whose points are equilibrium states of a system. The coordinates of a point q \in Q are called extensive variables. There’s an important function on this manifold: the entropy

f \colon Q \to \mathbb{R}

There is a cotangent vector p at the point q given by

p = (df)_q

The components of this vector are the intensive variables corresponding to the extensive variables.

Probability Theory. In probability theory, we have a manifold Q whose points are nowhere vanishing probability distributions on a finite set. The coordinates of a point q \in Q are probabilities. There’s an important function on this manifold: the Shannon entropy

f \colon Q \to \mathbb{R}

There is a cotangent vector p at the point q given by

p = (df)_q

The components of this vector are the surprisals corresponding to the probabilities.

In all three cases, T^\ast Q is a symplectic manifold and imposing the constraint p = (df)_q picks out a Lagrangian submanifold

\Lambda = \{ (q,p) \in T^\ast Q: \; p = (df)_q \}

There is also a contact manifold T^\ast Q \times \mathbb{R} where the extra dimension comes with an extra coordinate S that means

• action in classical mechanics,
• entropy in thermodynamics, and
• Shannon entropy in probability theory.

We can then decree that S = f(q) along with p = (df)_q, and these constraints pick out a Legendrian submanifold

\Sigma = \{(q,p,S) \in T^\ast Q \times \mathbb{R} : \; S = f(q), \; p = (df)_q \}

There’s a lot more to do with these ideas, and I’ll continue next time.

For all my old posts on information geometry, go here:

Information geometry.

31 Responses to Information Geometry (Part 19)

  1. Toby Bartels says:

    So is it just a coincidence that entropy and action are both traditionally denoted S, or did some 19th-century genius (maybe Hamilton) know this analogy all along?

    • name starts with G says:

      Probably a coincidence. “S for action” in Hamilton’s works is first used in On a General Method in Dynamics (1834). Maybe traces to Maupertuis and Euler (ds) though he gives no explanation for the choice of the letter S. We owe “S for entropy” to Clausius (1865), who also gives no explanation for his choice of S.

      Also remember that Boltzmann gave birth to “H for entropy”. Information theory (seminal Shannon 1948) took that gladly.

      I think such holistic views are mostly attainable once the frameworks are explored tightly. Even if Clausius for instance was tickled by such potential connection, not sure if he would label a concept by pure intuition.

    • John Baez says:

      Then there’s the idiot who named momentum rather than position “p”, which becomes even more annoying now that we see position is analogous to probability.

      Thanks for explaining the origin of the two things named S, “name starts with G”. I’ve always been curious about that. It’s sad that neither author gave a reason for their choice.

      • Doctor Nuu says:

        He might well be a descendant of the guy that once made q and p mirror images.

        Yeah, naming is hard, but quosition and pomentum might be one of the most annoying ones.

  2. Elias says:

    Hello! So if I’m understanding correctly, since the position and momentum are called complementary variables, that is, are the Fourier transforms of each other.. I suppose that extensive and intensive variables in thermodynamics are also Fourier transform pairs? And probabilities and surprisals are, too, related to each other by Fourier transforms?

    But the Fourier transform of the probability density function is the characteristic function.

    So maybe the surprisal and the characteristic function are related?

    • John Baez says:

      In quantum mechanics, position and momentum are related by a Fourier transform: the momentum operator is the position operator conjugated by the Fourier transform.

      Here I’m doing classical mechanics, so position is a point in a vector space V and momentum is a point in the dual vector space V*. Or, more generally, position is a point q in a manifold Q and momentum is a cotangent vector p at this point q. I’m taking the more general stance here, but this stance breaks the symmetry between position and momentum, since only momentum lives in a vector space.

      The relation between the classical and quantum approaches to position versus momentum well-understood. I could write an essay about it, but I won’t here. One clue is that the Fourier transform of a function on a vector space V is really a function on the vector space V*. When we generalize the heck out of this, we get Pontryagin duality.

      Your idea that extensive and intensive quantities in thermodynamics are related by a Fourier transform is extremely interesting, but for the reason I just mentioned it’s a bit tricky to develop. I am actually working on it, and I’ll blog about it here when I’m done!

      • Emmy B says:

        While this may be imprecise, it seems that in a way the coordinates which are the probabilities (q_1,\dots,q_n) specify a point in the manifold which could alternatively be specified using the Fourier transform of the list (q_1,\dots,q_n) \in \mathbb{R}^n. The transition map between the two charts is the Fourier transform. I’m curious what your thoughts are on this connection especially given your interest in category theory. In this set-up, it seems that the transition maps have some functor-like qualities in that they convey the same data within two different models.

        • John Baez says:

          Since the order of the list (q_1, \dots, q_n) is completely arbitrary and unphysical, yet it affects the Fourier transform of this list (thought as a function on \mathbb{Z}/n\mathbb{Z}) in dramatic ways, I don’t think taking the Fourier transform of this list is a good idea.

        • The order of values in the list doesn’t matter if it’s a Mach-Zehnder-like modulation. One can take the Fourier periodogram and apply it to some practical applications as in this recent post:

        • John Baez says:

          If you arbitrarily permute the numbers in the list (q_1, \dots, q_n), their Fourier transform changes in a very messy way. If these numbers are time series data they come in a particular order, but I’m not assuming they’re time series data. They’re just a probability distribution on a finite set. I’m calling this finite set \{1,\dots,n\} to make my blog articles easier to read, but if I were talking to myself I’d just call it X, since I’m never using the linear ordering.

        • OK, I thought I understood what Emmy was asking but if not I will let her clarify if she ever sees this comment thread again.

        • Emmy Blumenthal says:

          I am happy to clarify what I meant; my question was a bit imprecise and more conceptual, and I am not too fixated on the Fourier transform here. In this framework, a probability distribution on some finite set is described by some given sequence of numbers, (q_1,q_2,\dots,q_n), but—as was noted in your lovely replies—the order of these particular numbers do not matter. That is to say that the same physical/statistical/mathematical information is expressed no matter the order. However, to compare two given permutations of lists of numbers supposedly specifying the same information/distribution, we must know the appropriate permutation to relate them and verify if they do indeed specify the same information. A similar argument can be made about many other ways of communicating and restructuring the list (q_1,q_2\dots,q_n). As long as there is some sort of consistent and well-designed (very vague I know…) translation between the two ways of expressing the distribution, any two expressions of some distribution can appropriately be expressed in a myriad of forms. Philosophically, this reminds us that the list of numbers is not the distribution nor physical information in question but is a representation that is meaningful when put in mathematical context. In my original post, I tried to use the Fourier transform as an example of how this data may be expressed.

          I point this out as a naive undergraduate because my intuition tells me there is some connection between these concepts specifically the question of uniting different physical perspectives and category theory. Specifically, the idea of translating between individual representations of the same data reminds me vaguely of category theory, and I’m curious if there is any reading that you recommend in order to better understand and explore such a connection.
          Emmy Blumenthal (they/them)

        • John Baez says:

          One important thing you can compute from a probability distribution is its Rényi entropy: a generalization of Shannon entropy that depends on a real-valued parameter. I suspect that if you know the Rényi entropy of a probability distribution (q_1, \dots, q_n) for all values of thid parameter, you can recover the numbers q_i and how many times each one appears in the list—but not the order of the list.

          Maybe this is an example of the kind of thing you’re curious about. Computing the Rényi entropy of a probability distribution is a bit like taking the Fourier or Laplace transform of a function, but different.

    • In wave models, yes, since momentum is the wavenumber and that is 1/length. We are discussing the relationship between the autocorrelation of a waveform and the Fourier power-spectrum of that providing a means of maximizing or minimizing the Shannon entropy at the Azimuth forum :

  3. Matt says:

    Is what’s emerging a general framework for predictability?

    • John Baez says:

      In the short run I’m just trying to fit thermodynamics, statistical mechanics and probability theory into the same framework as classical mechanics, using the fact that extremal principles (least action, maximum entropy) are important in all these subjects.

      I wrote about this idea back in Classical mechanics versus thermodynamics (part 1):

      The big picture

      Now let’s step back and think about what’s going on.

      Lately I’ve been trying to unify a bunch of ‘extremal principles’, including:

      1) the principle of least action
      2) the principle of least energy
      3) the principle of maximum entropy
      4) the principle of maximum simplicity, or Occam’s razor

      In my post on quantropy I explained how the first three principles fit into a single framework if we treat Planck’s constant as an imaginary temperature. The guiding principle of this framework is

      maximize entropy
      subject to the constraints imposed by what you believe

      And that’s nice, because E. T. Jaynes has made a powerful case for this principle.

      However, when the temperature is imaginary, entropy is so different that it may deserves a new name: say, ‘quantropy’. In particular, it’s complex-valued, so instead of maximizing it we have to look for stationary points: places where its first derivative is zero. But this isn’t so bad. Indeed, a lot of minimum and maximum principles are really ‘stationary principles’ if you examine them carefully.

      What about the fourth principle: Occam’s razor? We can formalize this using algorithmic probability theory. Occam’s razor then becomes yet another special case of

      maximize entropy
      subject the constraints imposed by what you believe

      once we realize that algorithmic entropy is a special case of ordinary entropy.

      • Possibly dumb question: in the path integral method the amplitude for each path has the same magnitude. Is there any good justification for that? Seems similar to the idea in thermodynamics that all accessible micro-states are equiprobable over the long run.

        Dropping this requirement seems equivalent to letting the action be complex. I can’t think of any reason to discount it either than that sounds weird.

      • John Baez says:

        That’s not a dumb question. Physicists act like the amplitude for each path has the same magnitude, and they rarely think about complex-valued Lagrangians.

        On the other hand, there are great subtleties involved in defining the correct measure when you integrate over all paths—since it’s rarely a finite or even countably infinite sum over all paths. Path integrals are often integrals over infinite-dimensional spaces, and we just barely understand what a measure means in this case: it’s one of the main problems in quantum field theory. Even in the finite-dimensional case, getting the ‘right’ measure is nontrivial.

        But changing the measure in your path integral is equivalent to changing the magnitude of the paths’ amplitudes! And this in turn is equivalent to making the Lagrangian complex.

  4. Frederic Barbaresco says:

    Symplectic model of Thermodynamics developed by Jean-Marie Souriau is clearly explained in Charles-Michel Marle paper:
    On Gibbs states of mechanical systems with symmetries

    • John Baez says:

      Thanks for all these links, Frederic! I have been wanting to understand Souriau’s ideas on this topic, but finding it hard to find something to read—in part because it’s a lot of work for me to read French.

  5. Frederic Barbaresco says:

    In Souriau model of “Lie Groups Thermodynamics”, Entropy could be characterized as an invariant Casimir Function in coadjoint representation:

  6. Frederic Barbaresco says:

    Souriau has also extended Koszul-Fisher Metric with a link to KKS 2 form in cas of non-null cohomology:

  7. Frederic Barbaresco says:

    I invite you to read slides of Charles-Michel Marle on Souriau Lie Groups Thermodynamics model given at Paris University for 60th birthday of Professor Marco (Sorbonne University). Slides are in French and in English:

    Click to access Marle_2021_JPMarco.pdf

  8. Frederic Barbaresco says:

    Souriau model is described in chapter IV of his book “structure of dynamical systems” ( and in 3 papers that have been translated in English recently:

  9. Frederic Barbaresco says:

    I have presented fundamental equation of Souriau Lie Groups Thermodynamics at the Summer week SPIGL’20, I have organized at Ecole de Physique des Houches in July 2020:

    Proceedings have been published by SPRINGER:

    This topic has been developed also during GSI’21 conférence at Sorbonne University:

    We will developed this model at MaxEnt’22 conference that we will organize at Institut Henri Poincaré in July 2022.

  10. Andrea says:

    Just to make sure I’m following. To make the analogy in the table work, we need to do mechanics in the extended configuration space, right? As in, one of the qs is going to be the time t and one of the ps is going to be -H, and the symplectic form is

    \omega = \sum_i p_i\wedge q^i - Hdt.

You can use Markdown or HTML in your comments. You can also use LaTeX, like this: $latex E = m c^2 $. The word 'latex' comes right after the first dollar sign, with a space after it.

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.