Information Geometry (Part 1)

I’d like to provide a bit of background to this interesting paper:

• Gavin E. Crooks, Measuring thermodynamic length.

which was pointed out by John F in our discussion of entropy and uncertainty.

The idea here should work for either classical or quantum statistical mechanics. The paper describes the classical version, so just for a change of pace let me describe the quantum version.

First a lightning review of quantum statistical mechanics. Suppose you have a quantum system with some Hilbert space. When you know as much as possible about your system, then you describe it by a unit vector in this Hilbert space, and you say your system is in a pure state. Sometimes people just call a pure state a ‘state’. But that can be confusing, because in statistical mechanics you also need more general ‘mixed states’ where you don’t know as much as possible. A mixed state is described by a density matrix, meaning a positive operator \rho with trace equal to 1:

\mathrm{tr}(\rho) = 1

The idea is that any observable is described by a self-adjoint operator A, and the expected value of this observable in the mixed state \rho is

\langle A \rangle = \mathrm{tr}(\rho A)

The entropy of a mixed state is defined by

S(\rho) = -\mathrm{tr}(\rho \; \mathrm{ln} \, \rho)

where we take the logarithm of the density matrix just by taking the log of each of its eigenvalues, while keeping the same eigenvectors. This formula for entropy should remind you of the one that Gibbs and Shannon used — the one I explained a while back.

Back then I told you about the ‘Gibbs ensemble’: the mixed state that maximizes entropy subject to the constraint that some observable have a given value. We can do the same thing in quantum mechanics, and we can even do it for a bunch of observables at once. Suppose we have some observables X_1, \dots, X_n and we want to find the mixed state \rho that maximizes entropy subject to these constraints:

\langle X_i \rangle = x_i

for some numbers x_i. Then a little exercise in Lagrange multipliers shows that the answer is the Gibbs state:

\rho = \frac{1}{Z} \mathrm{exp}(-\lambda_1 X_1 + \cdots + \lambda_n X_n)

Huh?

This answer needs some explanation. First of all, the numbers \lambda_1, \dots \lambda_n are called Lagrange multipliers. You have to choose them right to get

\langle X_i \rangle = x_i

So, in favorable cases, they will be functions of the numbers x_i. And when you’re really lucky, you can solve for the numbers x_i in terms of the numbers \lambda_i. We call \lambda_i the conjugate variable of the observable X_i. For example, the conjugate variable of energy is inverse temperature!

Second of all, we take the exponential of a self-adjoint operator just as we took the logarithm of a density matrix: just take the exponential of each eigenvalue.

(At least this works when our self-adjoint operator has only eigenvalues in its spectrum, not any continuous spectrum. Otherwise we need to get serious and use the functional calculus. Luckily, if your system’s Hilbert space is finite-dimensional, you can ignore this parenthetical remark!)

But third: what’s that number Z? It begins life as a humble normalizing factor. Its job is to make sure \rho has trace equal to 1:

Z = \mathrm{tr}(\mathrm{exp}(-\lambda_1 X_1 + \cdots + \lambda_n X_n))

However, once you get going, it becomes incredibly important! It’s called the partition function of your system.

As an example of what it’s good for, it turns out you can compute the numbers x_i as follows:

x_i = - \frac{\partial}{\partial \lambda_i} \mathrm{ln} Z

In other words, you can compute the expected values of the observables X_i by differentiating the log of the partition function:

\langle X_i \rangle = - \frac{\partial}{\partial \lambda_i} \mathrm{ln} Z

Or in still other words,

\langle X_i \rangle = - \frac{1}{Z} \; \frac{\partial Z}{\partial \lambda_i}

To believe this you just have to take the equations I’ve given you so far and mess around — there’s really no substitute for doing it yourself. I’ve done it fifty times, and every time I feel smarter.

But we can go further: after the ‘expected value’ or ‘mean’ of an observable comes its variance, which is the square of its standard deviation:

(\Delta A)^2 = \langle A^2 \rangle - \langle A \rangle^2

This measures the size of fluctuations around the mean. And in the Gibbs state, we can compute the variance of the observable X_i as the second derivative of the log of the partition function:

\langle X_i^2 \rangle - \langle X_i \rangle^2 =  \frac{\partial^2}{\partial^2 \lambda_i} \mathrm{ln} Z

Again: calculate and see.

But when we’ve got lots of observables, there’s something better than the variance of each one. There’s the covariance matrix of the whole lot of them! Each observable X_i fluctuates around its mean value x_i… but these fluctuations are not independent! They’re correlated, and the covariance matrix says how.

All this is very visual, at least for me. If you imagine the fluctuations as forming a blurry patch near the point (x_1, \dots, x_n), this patch will be ellipsoidal in shape, at least when all our random fluctuations are Gaussian. And then the shape of this ellipsoid is precisely captured by the covariance matrix! In particular, the eigenvectors of the covariance matrix will point along the principal axes of this ellipsoid, and the eigenvalues will say how stretched out the ellipsoid is in each direction!

To understand the covariance matrix, it may help to start by rewriting the variance of a single observable A as

(\Delta A)^2 = \langle (A - \langle A \rangle)^2 \rangle

That’s a lot of angle brackets, but the meaning should be clear. First we look at the difference between our observable and its mean value, namely

A - \langle A \rangle

Then we square this, to get something that’s big and positive whenever our observable is far from its mean. Then we take the mean value of the that, to get an idea of how far our observable is from the mean on average.

We can use the same trick to define the covariance of a bunch of observables X_i. We get an n \times n matrix called the covariance matrix, whose entry in the ith row and jth column is

\langle (X_i - \langle X_i \rangle)  (X_j - \langle X_j \rangle)  \rangle

If you think about it, you can see that this will measure correlations in the fluctuations of your observables.

An interesting difference between classical and quantum mechanics shows up here. In classical mechanics the covariance matrix is always symmetric — but not in quantum mechanics! You see, in classical mechanics, whenever we have two observables A and B, we have

\langle A B \rangle = \langle B A \rangle

since observables commute. But in quantum mechanics this is not true! For example, consider the position q and momentum p of a particle. We have

q p = p q + i

so taking expectation values we get

\langle q p \rangle = \langle p q \rangle + i

So, it’s easy to get a non-symmetric covariance matrix when our observables X_i don’t commute. However, the real part of the covariance matrix is symmetric, even in quantum mechanics. So let’s define

g_{ij} =  \mathrm{Re}  \langle (X_i - \langle X_i \rangle)  (X_j - \langle X_j \rangle)  \rangle

You can check that the matrix entries here are the second derivatives of the partition function:

g_{ij} = \frac{\partial^2}{\partial \lambda_i \partial \lambda_j} \mathrm{ln} Z

And now for the cool part: this is where information geometry comes in! Suppose that for any choice of values x_i we have a Gibbs state with

\langle X_i \rangle = x_i

Then for each point

x = (x_1, \dots , x_n) \in \mathbb{R}^n

we have a matrix

g_{ij} =  \mathrm{Re}  \langle (X_i - \langle X_i \rangle)  (X_j - \langle X_j \rangle)  \rangle = \frac{\partial^2}{\partial \lambda_i \partial \lambda_j} \mathrm{ln} Z

And this matrix is not only symmetric, it’s also positive. And when it’s positive definite we can think of it as an inner product on the tangent space of the point x. In other words, we get a Riemannian metric on \mathbb{R}^n. This is called the Fisher information metric.

I hope you can see through the jargon to the simple idea. We’ve got a space. Each point x in this space describes the maximum-entropy state of a quantum system for which our observables have specified mean values. But in each of these states, the observables are random variables. They don’t just sit at their mean value, they fluctuate! You can picture these fluctuations as forming a little smeared-out blob in our space. To a first approximation, this blob is an ellipsoid. And if we think of this ellipsoid as a ‘unit ball’, it gives us a standard for measuring the length of any little vector sticking out of our point. In other words, we’ve got a Riemannian metric: the Fisher information metric!

Now if you look at the Wikipedia article you’ll see a more general but to me somewhat scarier definition of the Fisher information metric. This applies whenever we’ve got a manifold whose points label arbitrary mixed states of a system. But Crooks shows this definition reduces to his — the one I just described — when our manifold is \mathbb{R}^n and it’s parametrizing Gibbs states in the way we’ve just seen.

More precisely: both Crooks and the Wikipedia article describe the classical story, but it parallels the quantum story I’ve been telling… and I think the quantum version is well-known. I believe the quantum version of the Fisher information metric is sometimes called the Bures metric, though I’m a bit confused about what the Bures metric actually is.

[Note: in the original version of this post, I omitted the real part in my definition g_{ij} = \mathrm{Re} \langle (X_i - \langle X_i \rangle)  (X_j - \langle X_j \rangle)  \rangle , giving a ‘Riemannian metric’ that was neither real nor symmetric in the quantum case. Most of the comments below are based on that original version, not the new fixed one.]

27 Responses to Information Geometry (Part 1)

  1. phorgyphynance says:

    Although a different application (mathematical finance), this reminds me a bit of what I wrote here:

    Einstein meets Markowitz: Relativity Theory of Risk-Return

    Interpreting covariance as an inner product provides a lot of insight into applied statistics, e.g.

    Visualizing Market Risk: A Physicist’s Perspective

    But I’ve never seen the added bit that makes it Lorentzian. Any stochastic process has both a stochastic (the part the wiggles around) and deterministic component. I wonder if there is an information theoretic interpretation of the deterministic part?

    • John F says:

      I haven’t seen a Lorentzian application, but sometimes come across causal correlation (or at least temporal). No, that’s not a joke. It’s exactly the same g_{ij}, but now X_i is a fuction of t_i and X_j(t_j).

      • Gavin Crooks says:

        Do you have a reference? I think I’ve been needing to reinvent that idea for a problem I’m working on.

        • John F says:

          Sorry, nothing good in the way of nonequilibrium thermodynamics or anything. I walked out of a nonequilibrium class 25 years ago and haven’t looked back. The occasion was the lecturer pondering deriving that the temperature fluctation was identically zero, after assuming it was nonzero a couple equations earlier. Our text, which we didn’t follow, was Kubo, Toda, and Hashitsume II.

          I basically do applied analysis of other people’s data, and one main kind is change detection. In environmental contexts, an example I liked from a few years back was correlating the change in salinity in Gulf of Mexico marshes to later changes in pigment of neighboring areas to predict marshland losses. The presenters were from Tulane and Louisiana State University, but my memory fails besides. For theory of these sorts of causal investigations, delay difference equations are common. I think those are mostly used in network analyses, which I don’t do, but there are plenty of references.

          In other applications which I won’t belabor the spatiality/neighborhood aspect is extremely unimportant. We just look for changes in one signal preceding changes in another signal.

        • John F says:

          Onsager may have arrived firstest with the mostest. Eqs 1 etc in
          http://prola.aps.org/pdf/PR/v38/i12/p2265_1

  2. Tim van Beek says:

    …and I think the quantum version is well-known. The quantum version of the Fisher information metric is sometimes called the Bures metric.

    I’ll come out and admit that I never ever have heard of the Bures metric, although it would seem that it was introduced in 1969. Wikipedia does not offer a hint if anyone did anything interesting with it…

    A side note:

    …where we take the logarithm of a positive operator just by taking the log of each of its eigenvalues, while keeping the same eigenvectors.

    JB is deliberately simplifying matters. This is true in the case where we assume that the spectrum of the density matrix consists of eigenvalues only. This need not be the case of course. In the more general setting we’d need to use (holomorphic) functional calculus to define the value of a function of a (positive) operator :-)

    • John Baez says:

      John wrote:

      …where we take the logarithm of a positive operator just by taking the log of each of its eigenvalues, while keeping the same eigenvectors.

      Tim wrote:

      JB is deliberately simplifying matters. This is true in the case where we assume that the spectrum of the density matrix consists of eigenvalues only.

      Right. But note: this is always true for a density matrix: a density matrix has finite trace, so its spectrum must consist of eigenvalues only! So I should have said “density matrix” where I said “positive operator”.

      Later I screwed up a bit more: when I defined the exponential of an operator, I was using a definition that’s fine for self-adjoint operators whose spectrum consists of eigenvalues only, but we may want to apply these ideas more generally, and then we need the full-fledged functional calculus. I will correct my statement.

      For people reading this and feeling nervous: none of these issues matter if we’re dealing with a quantum system with a finite-dimensional Hilbert space. Then everything I’d said is fine.

  3. Blake Stacey says:

    Copy-editing note: $\lambda_1, \dots \lambda_n$ and $n \times n$ need the extra “latex” to render properly. Not that we all aren’t used to parsing TeX in our heads, but. . .

    • John Baez says:

      Thanks. Fixed!

      What we need are really good brain implants that render TeX in an ocular preprocessing unit, so I don’t need to type the bloody “latex” after each dollar sign to get it to work on this blog. (There could in theory be an easier solution, but apparently WordPress hasn’t found it.)

  4. John F says:

    This may be a gedanken idea, or virtual meme, or instance of Dave Berg’s pseudo-Cartesian formulation “I think I’m thinking”, but here’s a leading question anyway. What is the significance of the Berry’s phase of a quantum thermodynamic cycle?

  5. streamfortyseven says:

    I just saw this and thought someone here might be interested…

    http://news.slashdot.org/story/10/10/22/1236215/Astonishing-Speedup-In-Solving-Linear-SDD-Systems

    • Tim van Beek says:

      That’s interesting news indeed! If you like, you could join the azimuth forum and create a thread about numerical mathematics and computer models over there. In addition feel free to add this link e.g. to the climate models page over at the Azimuth project.

      In a distant future the numerical and implementation aspects of climate models may be moved to their own pages…but collecting the information that we learn now over there is an important step into that direction :-)

  6. […] Last time I provided some background to this paper: […]

  7. Vasileios Anagnostopoulos says:

    For the Gibbs state derivation please see http://videolectures.net/mlss05us_dasgupta_ig/

  8. Bananahead says:

    So stupid question maybe, but it is not clear to me why g_ij so defined is the real part of the metric.

    My problem is the following: by taking the double derivative of Z, you get a totally symmetrized (and thus Real product of operators):

    \mathrm{Tr}(X_i X_j (\lambda^k X_k)^m + X_i (\lambda^k X_k)X_j(\lambda^k X_k)^{m-1} + \cdots + X_i  (\lambda^k X_k)^m X_j)

    and it is really not clear to me that that is equal to the real part of the double moment which is

    (m+1)/2 * {\mathrm{Tr}(X_i X_j (\lambda^k X_k)^m + X_j X_i (\lambda^k X_k)^m)}

    • John Baez says:

      Sorry to take so long to reply to your comment! I got pretty nervous about exactly this point here, but Eric Forgy made feel okay about it in the following comments. So, please look at those, if you haven’t already.

  9. I thought you were going to talk about that Fisher Information Geometry article from Nature a few months ago.

    What’s the connection between these Lagrange Multipliers and the ones from calc 3?

    And what’s the significance of this being a Riemannian metric, as opposed to any other metric or semi pseudo quasi metric?

    • John Baez says:

      Q. P. wrote:

      I thought you were going to talk about that Fisher Information Geometry article from Nature a few months ago.

      I don’t know about that. If you give me a link, I might say something about it sometime.

      What’s the connection between these Lagrange Multipliers and the ones from calc 3?

      They’re exactly the same! We use Lagrange multipliers whenever we want to maximize something subject to some constraints. Here we are trying to maximize entropy

      \mathrm{tr}(\rho \ln \rho)

      subject to some constraints

      \mathrm{tr}(X_i \rho) = x_i

      Think of \rho as an n \times n self-adjoint matrix that you get to vary, and X_i as some fixed self-adjoint matrices. Since the matrix \rho has a lot of entries this is a multivariable calculus problem with a lot of variables. To solve it, you need to know a bit about the logarithm of a matrix (which I explained). So, most people taking Calculus 3 couldn’t do this problem, but it’s a great test of ones skill, and it’s fundamental to statistical mechanics.

      And what’s the significance of this being a Riemannian metric, as opposed to any other metric or semi pseudo quasi metric?

      Well, Riemannian geometry is the math that Einstein used to study curved spacetimes, so a vast amount is known about it. Since I used to work on quantum gravity, I like this branch of math, and I’m eager to see how it applies to the study of information.

      For example, there’s a nice theory of geodesics, which are the ‘best approximations to a straight line in a curved space’. In information geometry, it turns out that if you have some data and you want to choose the hypothesis that best fits this data, you sometimes want to draw a geodesic from the probability distribution given by your data to the surface that includes all the hypotheses you’re considering.

      That was a pretty vague remark, but later in this series I’ll explain it! So far I’ve written 5 posts in this series, but I’m really just getting started. I quit for a while because I didn’t want it to overwhelm the main point of this blog: how scientists can help save the planet.

  10. […] Part 1     • Part 2     • Part 3     • Part 4     • Part […]

  11. Gavin Crooks’ paper on “thermodynamic length” helps us understand why relative entropy doesn’t obey the triangle inequality.

  12. Sukratu Barve says:

    Towards the end of the article there is this description of
    the Fischer metric as a metric on ‘x_i’ space where these
    x_i are averages of the random variables. I always thought
    it was the parameter space (lambda_i in the article) which received the metric on its tangent space. Now one may ‘reparametrize’ this space in terms of averages of X_i (which
    I am not sure if one can always do). But what makes us want to see it this way?

  13. Sukratu Barve says:

    (Sorry about that part about being unsure. Take the Jacobian
    which turns out to be made up of second derivatives of ln(Z)
    which is invertible)

  14. […] is an excellent blog series on information geometry by the renowned mathematician John Baez. I learned a lot from that, but in […]

You can use Markdown or HTML in your comments. You can also use LaTeX, like this: $latex E = m c^2 $. The word 'latex' comes right after the first dollar sign, with a space after it.

This site uses Akismet to reduce spam. Learn how your comment data is processed.