Information Geometry (Part 21)

Last time I ended with a formula for the ‘Gibbs distribution’: the probability distribution that maximizes entropy subject to constraints on the expected values of some observables.

This formula is well-known, but I’d like to derive it here. My argument won’t be up to the highest standards of rigor: I’ll do a bunch of computations, and it would take more work to state conditions under which these computations are justified. But even a nonrigorous approach is worthwhile, since the computations will give us more than the mere formula for the Gibbs distribution.

I’ll start by reminding you of what I claimed last time. I’ll state it in a way that removes all unnecessary distractions, so go back to Part 20 if you want more explanation.

The Gibbs distribution

Take a measure space \Omega with measure \mu. Suppose there is a probability distribution \pi on \Omega that maximizes the entropy

\displaystyle{ -\int_\Omega \pi(x) \ln \pi(x) \, d\mu(x) }

subject to the requirement that some integrable functions A^1, \dots, A^n on \Omega have expected values equal to some chosen list of numbers q^1, \dots, q^n.

(Unlike last time, now I’m writing A^i and q^i with superscripts rather than subscripts, because I’ll be using the Einstein summation convention: I’ll sum over any repeated index that appears once as a a superscript and once as a subscript.)

Furthermore, suppose \pi depends smoothly on q \in \mathbb{R}^n. I’ll call it \pi_q to indicate its dependence on q. Then, I claim \pi_q is the so-called Gibbs distribution

\displaystyle{  \pi_q(x) =\frac{e^{-p_i A^i(x)}}{\int_\Omega e^{-p_i A^i(x)} \, d\mu(x)}   }


\displaystyle{ p_i = \frac{\partial f(q)}{\partial q^i} }


\displaystyle{ f(q) = - \int_\Omega \pi_q(x) \ln \pi_q(x) \, d\mu(x) }

is the entropy of \pi_q.

Let’s show this is true!

Finding the Gibbs distribution

So, we are trying to find a probability distribution \pi that maximizes entropy subject to these constraints:

\displaystyle{ \int_\Omega \pi(x) A^i(x) \, d\mu(x) = q^i }

We can solve this problem using Lagrange multipliers. We need one Lagrange multiplier, say \beta_i, for each of the above constraints. But it’s easiest if we start by letting \pi range over all of L^1(\Omega), that is, the space of all integrable functions on \Omega. Then, because we want \pi to be a probability distribution, we need to impose one extra constraint

\displaystyle{ \int_\Omega \pi(x) \, d\mu(x) = 1 }

To do this we need an extra Lagrange multiplier, say \gamma.

So, that’s what we’ll do! We’ll look for critical points of this function on L^1(\Omega):

\displaystyle{ - \int \pi \ln \pi \, d\mu - \beta_i  \int \pi A^i \, d\mu - \gamma \int \pi\,  d\mu  }

Here I’m using some tricks to keep things short. First, I’m dropping the dummy variable x which appeared in all of the integrals we had: I’m leaving it implicit. Second, all my integrals are over \Omega so I won’t say that. And third, I’m using the Einstein summation convention, so there’s a sum over i implicit here.

Okay, now let’s do the variational derivative required to find a critical point of this function. When I was a math major taking physics classes, the way physicists did variational derivatives seemed like black magic to me. Then I spent months reading how mathematicians rigorously justified these techniques. I don’t feel like a massive digression into this right now, so I’ll just do the calculations—and if they seem like black magic, I’m sorry!

We need to find \pi obeying

\displaystyle{ \frac{\delta}{\delta \pi(x)} \left(- \int \pi \ln \pi \, d\mu - \beta_i  \int \pi A^i \, d\mu - \gamma \int \pi \, d\mu \right) = 0 }

or in other words

\displaystyle{ \frac{\delta}{\delta \pi(x)} \left(\int \pi \ln \pi \, d\mu + \beta_i  \int \pi A^i \, d\mu + \gamma \int \pi \, d\mu \right) = 0 }

First we need to simplify this expression. The only part that takes any work, if you know how to do variational derivatives, is the first term. Since the derivative of z \ln z is 1 + \ln z, we have

\displaystyle{ \frac{\delta}{\delta \pi(x)} \int \pi \ln \pi \, d\mu  = 1 + \ln \pi(x) }

The second and third terms are easy, so we get

\displaystyle{ \frac{\delta}{\delta \pi(x)} \left( \int \pi \ln \pi \, d\mu + \beta_i \int \pi A^i \, d\mu + \gamma \int \pi \, d\mu \right) =}

        1 + \ln \pi(x) + \beta_i A^i(x) + \gamma

Thus, we need to solve this equation:

\displaystyle{ 1 + \ln \pi(x) + \beta_i A^i(x) + \gamma  = 0}

That’s easy to do:

\pi(x) = e^{-1 - \gamma - \beta_i A^i(x)}

Good! It’s starting to look like the Gibbs distribution!

We now need to choose the Lagrange multipliers \beta_i and \gamma to make the constraints hold. To satisfy this constraint

\displaystyle{ \int \pi \, d\mu = 1 }

we must choose \gamma so that

\displaystyle{ \int  e^{-1 - \gamma - \beta_i A^i } \, d\mu = 1 }

or in other words

\displaystyle{ e^{1 + \gamma} = \int e^{- \beta_i A^i} \, d\mu }

Plugging this into our earlier formula

\pi(x) = e^{-1 - \gamma - \beta_i A^i(x)}

we get this:

\displaystyle{ \pi_q(x) = \frac{e^{- \beta_i A^i(x)}}{\int e^{- \beta_i A^i} \, d\mu} }

Great! Even more like the Gibbs distribution!

By the way, you must have noticed the “1” that showed up here:

\displaystyle{ \frac{\delta}{\delta \pi(x)} \int \pi \ln \pi \, d\mu  = 1 + \ln \pi(x) }

It buzzed around like an annoying fly in the otherwise beautiful calculation, but eventually went away. This is the same irksome “1” that showed up in Part 19. Someday I’d like to say a bit more about it.

Now, where were we? We were trying to show that

\displaystyle{  \pi_q(x) =\frac{e^{-p_i A^i(x)}}{\int e^{-p_i A^i} \, d\mu}   }

minimizes entropy subject to our constraints. So far we’ve shown

\displaystyle{ \pi(x) = \frac{e^{- \beta_i A^i(x)}}{\int e^{- \beta_i A^i} \, d\mu} }

is a critical point. It’s clear that

\pi(x) \ge 0

so \pi really is a probability distribution. We should show it actually maximizes entropy subject to our constraints, but I will skip that. Given that, \pi will be our claimed Gibbs distribution \pi_q if we can show

p_i = \beta_i

This is interesting! It’s saying our Lagrange multipliers \beta_i actually equal the so-called conjugate variables p_i given by

\displaystyle{ p_i = \frac{\partial f}{\partial q^i} }

where f(q) is the entropy of \pi_q:

\displaystyle{ f(q) = - \int_\Omega \pi_q(x) \ln \pi_q(x) \, d\mu(x) }

There are two ways to show this: the easy way and the hard way. The easy way is to reflect on the meaning of Lagrange multipliers, and I’ll sketch that way first. The hard way is to use brute force: just compute p_i and show it equals \beta_i. This is a good test of our computational muscle—but more importantly, it will help us discover some interesting facts about the Gibbs distribution.

The easy way

Consider a simple Lagrange multiplier problem where you’re trying to find a critical point of a smooth function

f \colon \mathbb{R}^2 \to \mathbb{R}

subject to the constraint

g = c

for some smooth function

g \colon \mathbb{R}^2 \to \mathbb{R}

and constant c. (The function f here has nothing to do with the f in the previous sections.) To answer this we introduce a Lagrange multiplier \lambda and seek points where

\nabla ( f - \lambda g) = 0

This works because the above equation says

\nabla f = \lambda \nabla g

Geometrically this means we’re at a point where the gradient of f points at right angles to the level surface of g:

Thus, to first order we can’t change f by moving along the level surface of g.

But also, if we start at a point where

\nabla f = \lambda \nabla g

and we begin moving in any direction, the function f will change at a rate equal to \lambda times the rate of change of g. That’s just what the equation says! And this fact gives a conceptual meaning to the Lagrange multiplier \lambda.

Our situation is more complicated, since our functions are defined on the infinite-dimensional space L^1(\Omega), and we have an n-tuple of constraints with an n-tuple of Lagrange multipliers. But the same principle holds.

So, when we are at a solution \pi_q of our constrained entropy-maximization problem, and we start moving the point \pi_q by changing the value of the ith constraint, namely q^i, the rate at which the entropy changes will be \beta_i times the rate of change of q^i. So, we have

\displaystyle{\frac{\partial f}{\partial q^i} = \beta_i }

But this is just what we needed to show!

The hard way

Here’s another way to show

\displaystyle{\frac{\partial f}{\partial q^i} = \beta_i }

We start by solving our constrained entropy-maximization problem using Lagrange multipliers. As already shown, we get

\displaystyle{ \pi_q(x) = \frac{e^{- \beta_i A^i(x)}}{\int e^{- \beta_i A^i} \, d\mu} }

Then we’ll compute the entropy

f(q) = - \int \pi_q \ln \pi_q \, d\mu

Then we’ll differentiate this with respect to q_i and show we get \beta_i.

Let’s try it! The calculation is a bit heavy, so let’s write Z(q) for the so-called partition function

\displaystyle{ Z(q) = \int e^{- \beta_i A^i} \, d\mu }

so that

\displaystyle{ \pi_q(x) = \frac{e^{- \beta_i A^i(x)}}{Z(q)} }

and the entropy is

\begin{array}{ccl}  f(q) &=& - \displaystyle{ \int  \pi_q  \ln \left( \frac{e^{- \beta_k A^k}}{Z(q)} \right)  \, d\mu }  \\ \\  &=& \displaystyle{ \int \pi_q \left(\beta_k A^k + \ln Z(q) \right)  \, d\mu } \\ \\  \end{array}

This is the sum of two terms. The first term

\displaystyle{ \int \pi_q \beta_k A^k   \, d\mu =  \beta_k \int \pi_q A^k   \, d\mu}

is \beta_k times the expected value of A^k with respect to the probability distribution \pi_q, all summed over k. But the expected value of A^k is q^k, so we get

\displaystyle{ \int  \pi_q \beta_k A^k \, d\mu } =  \beta_k q^k

The second term is easier:

\displaystyle{ \int_\Omega  \pi_q \ln Z(q) \, d\mu = \ln Z(q) }

since \pi_q(x) integrates to 1 and the partition function Z(q) doesn’t depend on x \in \Omega.

Putting together these two terms we get an interesting formula for the entropy:

f(q) = \beta_k q^k + \ln Z(q)

This formula is one reason this brute-force approach is actually worthwhile! I’ll say more about it later.

But for now, let’s use this formula to show what we’re trying to show, namely

\displaystyle{\frac{\partial f}{\partial q^i} = \beta_i }

For starters,

\begin{array}{ccl}  \displaystyle{\frac{\partial f}{\partial q^i}} &=& \displaystyle{\frac{\partial}{\partial q^i} \left(\beta_k q^k + \ln Z(q) \right) } \\ \\  &=& \displaystyle{ \frac{\partial \beta_k}{\partial q^i} q^k + \beta_k   \frac{\partial q^k}{\partial q^i} + \frac{\partial}{\partial q^i} \ln Z(q)   }  \\ \\  &=& \displaystyle{ \frac{\partial \beta_k}{\partial q^i} q^k + \beta_k  \delta^k_i + \frac{\partial}{\partial q^i} \ln Z(q)   } \\ \\  &=& \displaystyle{ \frac{\partial \beta_k}{\partial q^i} q^k + \beta_i + \frac{\partial}{\partial q^i} \ln Z(q)  }  \end{array}

where we played a little Kronecker delta game with the second term.

Now we just need to compute the third term:

\begin{array}{ccl}  \displaystyle{ \frac{\partial}{\partial q^i} \ln Z(q) } &=&  \displaystyle{ \frac{1}{Z(q)} \frac{\partial}{\partial q^i} Z(q) } \\  \\  &=&  \displaystyle{ \frac{1}{Z(q)} \frac{\partial}{\partial q^i}  \int e^{- \beta_j A^j}  \, d\mu  }  \\ \\  &=&  \displaystyle{ \frac{1}{Z(q)} \int \frac{\partial}{\partial q^i} \left(e^{- \beta_j A^j}\right) \, d\mu  }  \\ \\  &=&  \displaystyle{ \frac{1}{Z(q)} \int \frac{\partial}{\partial q^i}\left( - \beta_k A^k \right)  e^{- \beta_j A^j} \, d\mu  }  \\ \\  &=&  \displaystyle{ -\frac{1}{Z(q)} \int \frac{\partial \beta_k}{\partial q^i}  A^k   e^{- \beta_j A^j} \, d\mu  }  \\ \\  &=&  \displaystyle{ - \frac{\partial \beta_k}{\partial q^i} \frac{1}{Z(q)} \int  A^k   e^{- \beta_j A^j} \, d\mu  }  \\ \\  &=&  \displaystyle{ - \frac{\partial \beta_k}{\partial q^i} q^k }  \end{array}

Ah, you don’t know how good it feels, after years of category theory, to be doing calculations like this again!

Now we can finish the job we started:

\begin{array}{ccl}  \displaystyle{\frac{\partial f}{\partial q^i}} &=&  \displaystyle{ \frac{\partial \beta_k}{\partial q^i} q^k + \beta_i + \frac{\partial}{\partial q^i} \ln Z(q)  } \\ \\  &=&  \displaystyle{ \frac{\partial \beta_k}{\partial q^i} q^k + \beta_i - \frac{\partial \beta_k}{\partial q^i} q^k  } \\ \\  &=& \beta_i  \end{array}



We’ve learned the formula for the probability distribution that maximizes entropy subject to some constraints on the expected values of observables. But more importantly, we’ve seen that the anonymous Lagrange multipliers \beta_i that show up in this problem are actually the partial derivatives of entropy! They equal

\displaystyle{ p_i = \frac{\partial f}{\partial q^i} }

Thus, they are rich in meaning. From what we’ve seen earlier, they are ‘surprisals’. They are analogous to momentum in classical mechanics and have the meaning of intensive variables in thermodynamics:

 Classical Mechanics   Thermodynamics   Probability Theory 
 q   position extensive variables probabilities
 p  momentum intensive variables surprisals
 S   action entropy Shannon entropy

Furthermore, by showing \beta_i = p_i the hard way we discovered an interesting fact. There’s a relation between the entropy and the logarithm of the partition function:

f(q) = p_i q^i + \ln Z(q)

(We proved this formula with \beta_i replacing p_i, but now we know those are equal.)

This formula suggests that the logarithm of the partition function is important—and it is! It’s closely related to the concept of free energy—even though ‘energy’, free or otherwise, doesn’t show up at the level of generality we’re working at now.

This formula should also remind you of the tautological 1-form on the cotangent bundle T^\ast Q, namely

\theta = p_i dq^i

It should remind you even more of the contact 1-form on the contact manifold T^\ast Q \times \mathbb{R}, namely

\alpha = -dS + p_i dq^i

Here S is a coordinate on the contact manifold that’s a kind of abstract stand-in for our entropy function f.

So, it’s clear there’s a lot more to say: we’re seeing hints of things here and there, but not yet the full picture.

For all my old posts on information geometry, go here:

Information geometry.

15 Responses to Information Geometry (Part 21)

  1. ab says:

    There’s a generalization of the KL-divergence to non-probability measures that goes

    D(A\mid B) = \int A \log\left(\frac{A}{B}\right) - A + B

    This solves a lot of annoying constant factors when varying. Since the KL-divergence with B=1 gives the entropy, presumably the generalization of entropy is:

    S(A) = \int A\log A - A + 1

    which would solve your extra one. But I can’t say I’ve really understood how to think of this generalization.

    • John Baez says:

      I’ve used this generalization of the Kullback–Leibler divergence myself—see equation (21) in this paper:

      • John Baez and Blake Pollard, Relative information in biological systems.

      (I call the Kullback–Leibler divergence ‘relative information’.)

      And yes, it helps! But this generalization is still somewhat mysterious to me. It’s sometimes used in mathematical chemistry where instead of normalized probability distributions we have ‘populations’, e.g. numbers of molecules—I believe it was introduced there by Horn and Jackson. Have you seen it somewhere else?

      • ab says:

        I first saw it referred to in some work by Csiszar. A little googling gives this as maybe the origin, but I haven’t read it yet.

      • ab says:

        It occurs to me that this form of the KL-divergence is also what arises when you calculate the Bergman divergence of the usual entropy formula, but I can’t say I really understand Bregman divergences either :).

      • John Baez says:

        Okay, thanks—that’s interesting. Horn and Jackson introduced it in chemistry in 1972 and called it the “pseudo-Helmholtz function”:

        • F. Horn and R. Jackson, General mass action kinetics,
        Arch. Ration. Mech. An. 47 (1972), 81–116.

        It plays an important role as a Lyapunov function in chemical reaction networks that are ‘complex balanced’.

  2. Steve Huntsman says:

    If you also have a timescale measuring something like the growth rate of orbits or a mixing time to go along with the probability setup (say, in the case that the distribution arises from a Markov process), then you do have energy at this level of generality: see section 3 of The caveat here is that the precise nature of the timescale to generically reproduce physics isn’t completely nailed down, but it’s very highly constrained. Moreover, a “minimum channel capacity” Ansatz in the case of a Markov process (unpublished, but I have a writeup) suggests a general principle for determining this timescale.

    • John Baez says:

      I feel like maybe I didn’t make myself clear.

      There’s no concept of “time” in what I’m talking about, and no Markov process. I’ve just got a probability distribution maximizing entropy subject to constraints on expected values of some finite list of random variables (aka observables).

      But this subsumes the case where we call one those observables “energy”. In this case the conjugate intensive variable is 1/kT, where T is the temperature and k is Boltzmann’s constant, and then \mathrm -k T \ln Z is free energy.

      So, I was trying to say we can easily specialize the framework described here to relate \ln Z to free energy, but that \ln Z deserves some other name when we’re working at the level of generality described here.

      I believe in quantum field theory people sometimes call \ln Z “free energy” even in contexts where it deserves some other name.

  3. Keith Harbaugh says:

    Re your comment that

    When I was a math major taking physics classes, the way physicists did variational derivatives seemed like black magic to me. Then I spent months reading how mathematicians rigorously justified these techniques.

    Is there a half-way decent textbook that explains that?


    • John Baez says:

      There must be a bunch, but I forget where I learned this stuff—probably here and there. It can’t hurt too much to start here:

      • Wikipedia, Functional derivative.

      They recommend a bunch of textbooks, starting with Courant and Hilbert and working on up to Gelfan’d and Fomin, which is a Dover book—so fairly cheap, I imagine. All four of these folks are famous.

      • Keith Harbaugh says:

        Very interesting.

        Courant and Hilbert? How classical can you get :-) Looks like the DWM haven’t been made obsolete quite yet.

        I’m kind of amazed this isn’t covered by a more modern, mainstream textbook from some place like Springer. Maybe it is. Hopefully someone can make a suggestion.

        • Keith Harbaugh says:

          Oh wait, I see it is … The text
          Giaquinta, Mariano; Hildebrandt, Stefan (1996), Calculus of Variations 1. The Lagrangian Formalism.

  4. Frederic Barbaresco says:

    Geometric Structures of Information Geometry is adressed in GSI conferences:

You can use Markdown or HTML in your comments. You can also use LaTeX, like this: $latex E = m c^2 $. The word 'latex' comes right after the first dollar sign, with a space after it.

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.