Information Geometry (Part 5)

2 November, 2010

I’m trying to understand the Fisher information metric and how it’s related to Öttinger’s formalism for ‘dissipative mechanics’ — that is, mechanics including friction. They involve similar physics, and they involve similar math, but it’s not quite clear how they fit together.

I think it will help to do an example. The harmonic oscillator is a trusty workhorse throughout physics, so let’s do that.

So: suppose you have a rock hanging on a spring, and it can bounce up and down. Suppose it’s in thermal equilibrium with its environment. It will wiggle up and down ever so slightly, thanks to thermal fluctuations. The hotter it is, the more it wiggles. These vibrations are random, so its position and momentum at any given moment can be treated as random variables.

If we take quantum mechanics into account, there’s an extra source of randomness: quantum fluctuations. Now there will be fluctuations even at zero temperature. Ultimately this is due to the uncertainty principle. Indeed, if you know the position for sure, you can’t know the momentum at all!

Let’s see how the position, momentum and energy of our rock will fluctuate given that we know all three of these quantities on average. The fluctuations will form a little fuzzy blob, roughly ellipsoidal in shape, in the 3-dimensional space whose coordinates are position, momentum and energy:

Yeah, I know you’re sick of this picture, but this time it’s for real: I want to calculate what this ellipsoid actually looks like! I’m not promising I’ll do it — I may get stuck, or bored — but at least I’ll try.

Before I start the calculation, let’s guess the answer. A harmonic oscillator has a position q and momentum p, and its energy is

H = \frac{1}{2}(q^2 + p^2)

Here I’m working in units where lots of things equal 1, to keep things simple.

You’ll notice that this energy has rotational symmetry in the position-momentum plane. This is ultimately what makes the harmonic oscillator such a beloved physical system. So, we might naively guess that our little ellipsoid will have rotational symmetry as well, like this:

or this:

Here I’m using the x and y coordinates for position and momentum, while the z coordinate stands for energy. So in these examples the position and momentum fluctuations are the same size, while the energy fluctuations, drawn in the vertical direction, might be bigger or smaller.

Unfortunately, this guess really is naive. After all, there are lots of these ellipsoids, one centered at each point in position-momentum-energy space. Remember the rules of the game! You give me any point in this space. I take the coordinates of this point as the mean values of position, momentum and energy, and I find the maximum-entropy state with these mean values. Then I work out the fluctuations in this state, and draw them as an ellipsoid.

If you pick a point where position and momentum have mean value zero, you haven’t broken the rotational symmetry of the problem. So, my ellipsoid must be rotationally symmetric. But if you pick some other mean value for position and momentum, all bets are off!

Fortunately, this naive guess is actually right: all the ellipsoids are rotationally symmetric — even the ones centered at nonzero values of position and momentum! We’ll see why soon. And if you’ve been following this series of posts, you’ll know what this implies: the “Fisher information metric” g on position-momentum-energy space has rotational symmetry about any vertical axis. (Again, I’m using the vertical direction for energy.) So, if we slice this space with any horizontal plane, the metric on this plane must be the plane’s usual metric times a constant:

g = \mathrm{constant} \, (dq^2 + dp^2)

Why? Because only the usual metric on the plane, or any multiple of it, has ordinary rotations around every point as symmetries.

So, roughly speaking, we’re recovering the ‘obvious’ geometry of the position-momentum plane from the Fisher information metric. We’re recovering ‘ordinary’ geometry from information geometry!

But this should not be terribly surprising, since we used the harmonic oscillator Hamiltonian

H = \frac{1}{2}(q^2 + p^2)

as an input to our game. It’s mainly just a confirmation that things are working as we’d hope.

There’s more, though. Last time I realized that because observables in quantum mechanics don’t commute, the Fisher information metric has a curious skew-symmetric partner called \omega. So, we should also study this in our example. And when we do, we’ll see that restricted to any horizontal plane in position-momentum-energy space, we get

\omega = \mathrm{constant} \, (dq \, dp - dp \, dq)

This looks like a mutant version of the Fisher information metric

g = \mathrm{constant} \, (dq^2 + dp^2)

and if you know your geometry, you’ll know it’s the usual ‘symplectic structure’ on the position-energy plane — at least, times some constant.

All this is very reminiscent of Öttinger’s work on dissipative mechanics. But we’ll also see something else: while the constant in g depends on the energy — that is, on which horizontal plane we take — the constant in \omega does not!

Why? It’s perfectly sensible. The metric g on our horizontal plane keeps track of fluctuations in position and momentum. Thermal fluctuations get bigger when it’s hotter — and to boost the average energy of our oscillator, we must heat it up. So, as we increase the energy, moving our horizontal plane further up in position-momentum-energy space, the metric on the plane gets bigger! In other words, our ellipsoids get a fat cross-section at high energies.

On the other hand, the symplectic structure \omega arises from the fact that position q and momentum p don’t commute in quantum mechanics. They obey Heisenberg’s ‘canonical commutation relation’:

q p - p q = i

This relation doesn’t involve energy, so \omega will be the same on every horizontal plane. And it turns out this relation implies

\omega  = \mathrm{constant} \, (dq \, dp - dp \, dq)

for some constant we’ll compute later.

Okay, that’s the basic idea. Now let’s actually do some computations. For starters, let’s see why all our ellipsoids have rotational symmetry!

To do this, we need to understand a bit about the mixed state \rho that maximizes entropy given certain mean values of position, momentum and energy. So, let’s choose the numbers we want for these mean values (also known as ‘expected values’ or ‘expectation values’):

\langle H \rangle = E

\langle q \rangle = q_0

\langle p \rangle = p_0

I hope this isn’t too confusing: H, p, q are our observables which are operators, while E, p_0, q_0 are the mean values we have chosen for them. The state \rho depends on E, p_0 and q_0.

We’re doing quantum mechanics, so position q and momentum p are both self-adjoint operators on the Hilbert space L^2(\mathbb{R}):

(q\psi)(x) = x \psi(x)

(p\psi)(x) = - i \frac{d \psi}{dx}(x)

Indeed all our observables, including the Hamiltonian

H = \frac{1}{2} (p^2 + q^2)

are self-adjoint operators on this Hilbert space, and the state \rho is a density matrix on this space, meaning a positive self-adjoint operator with trace 1.

Now: how do we compute \rho? It’s a Lagrange multiplier problem: maximizing some function given some constraints. And it’s well-known that when you solve this problem, you get

\rho = \frac{1}{Z} e^{-(\lambda^1 q + \lambda^2 p + \lambda^3 H)}

where \lambda^1, \lambda^2, \lambda^3 are three numbers we yet have to find, and Z is a normalizing factor called the partition function:

Z = \mathrm{tr} (e^{-(\lambda^1 q + \lambda^2 p + \lambda^3 H)} )

Now let’s look at a special case. If we choose \lambda^1 = \lambda^2 = 0, we’re back a simpler and more famous problem, namely maximizing entropy subject to a constraint only on energy! The solution is then

\rho = \frac{1}{Z} e^{-\beta H} , \qquad Z = \mathrm{tr} (e^{- \beta H} )

Here I’m using the letter \beta instead of \lambda^3 because this is traditional. This quantity has an important physical meaning! It’s the reciprocal of temperature in units where Boltzmann’s constant is 1.

Anyway, back to our special case! In this special case it’s easy to explicitly calculate \rho and Z. Indeed, people have known how ever since Planck put the ‘quantum’ in quantum mechanics! He figured out how black-body radiation works. A box of hot radiation is just a big bunch of harmonic oscillators in thermal equilibrium. You can work out its partition function by multiplying the partition function of each one.

So, it would be great to reduce our general problem to this special case. To do this, let’s rewrite

Z = \mathrm{tr} (e^{-(\lambda^1 q + \lambda^2 p + \lambda^3 H)} )

in terms of some new variables, like this:

\rho = \frac{1}{Z} e^{-\beta(H - f q - g p)}

where now

Z = \mathrm{tr} (e^{-\beta(H - f q - g p)} )

Think about it! Now our problem is just like an oscillator with a modified Hamiltonian

H' = H - f q - g p

What does this mean, physically? Well, if you push on something with a force f, its potential energy will pick up a term - f q. So, the first two terms are just the Hamiltonian for a harmonic oscillator with an extra force pushing on it!

I don’t know a nice interpretation for the - g p term. We could say that besides the extra force equal to f, we also have an extra ‘gorce’ equal to g. I don’t know what that means. Luckily, I don’t need to! Mathematically, our whole problem is invariant under rotations in the position-momentum plane, so whatever works for q must also work for p.

Now here’s the cool part. We can complete the square:

\begin{aligned} H' & = \frac{1}{2} (q^2 + p^2) -  f q - g p \\ &= \frac{1}{2}(q^2 - 2 q f + f^2) + \frac{1}{2}(p^2 - 2 q g + g^2) - \frac{1}{2}(g^2 + f^2)  \\ &= \frac{1}{2}((q - f)^2 + (p - g)^2)  - \frac{1}{2}(g^2 + f^2)  \end{aligned}

so if we define ‘translated’ position and momentum operators:

q' = q - f, \qquad p' = p - g

we have

H' = \frac{1}{2}({q'}^2 + {p'}^2) -  \frac{1}{2}(g^2 + f^2)

So: apart from a constant, H' is just the harmonic oscillator Hamiltonian in terms of ‘translated’ position and momentum operators!

In other words: we’re studying a strange variant of the harmonic oscillator, where we are pushing on it with an extra force and also an extra ‘gorce’. But this strange variant is exactly the same as the usual harmonic oscillator, except that we’re working in translated coordinates on position-momentum space, and subtracting a constant from the Hamiltonian.

These are pretty minor differences. So, we’ve succeeded in reducing our problem to the problem of a harmonic oscillator in thermal equilibrium at some temperature!

This makes it easy to calculate

Z = \mathrm{tr} (e^{-\beta(H - f q - g p)} ) = \mathrm{tr}(e^{-\beta H'})

By our formula for H', this is just

Z = e^{\frac{1}{2}(g^2 + f^2)} \; \mathrm{tr} (e^{-\frac{1}{2}({q'}^2 + {p'}^2)})

And the second factor here equals the partition function for the good old harmonic oscillator:

Z = e^{\frac{1}{2}(g^2 + f^2)} \; \mathrm{tr} (e^{-\beta H})

So now we’re back to a textbook problem. The eigenvalues of the harmonic oscillator Hamiltonian are

n + \frac{1}{2}

where

n = 0,1,2,3, \dots

So, the eigenvalues of e^{-\beta H} are are just

e^{-\beta(n + \frac{1}{2})}

and to take the trace of this operator, we sum up these eigenvalues:

\mathrm{tr}(e^{-\beta H}) = \sum_{n = 0}^\infty e^{-\beta (n + \frac{1}{2})} = \frac{e^{-\beta/2}}{1 - e^{-\beta}}

So:

Z = e^{\frac{1}{2}(g^2 + f^2)} \; \frac{e^{-\beta/2}}{1 - e^{-\beta}}

We can now compute the Fisher information metric using this formula:

g_{ij} = \frac{\partial^2}{\partial \lambda^i \partial \lambda^j} \ln Z

if we remember how our new variables are related to the \lambda^i:

\lambda^1 = \beta f , \qquad \lambda^2 = \beta g, \qquad \lambda^3 = \beta

It’s just calculus! But I’m feeling a bit tired, so I’ll leave this pleasure to you.

For now, I’d rather go back to our basic intuition about how the Fisher information metric describes fluctuations of observables. Mathematically, this means it’s the real part of the covariance matrix

g_{ij} = \mathrm{Re} \langle \, (X_i - \langle X_i \rangle) \, (X_j - \langle X_j \rangle)  \, \rangle

where for us

X_1 = q, \qquad X_2 = p, \qquad X_3 = E

Here we are taking expected values using the mixed state \rho. We’ve seen this mixed state is just like the maximum-entropy state of a harmonic oscillator at fixed temperature — except for two caveats: we’re working in translated coordinates on position-momentum space, and subtracting a constant from the Hamiltonian. But neither of these two caveats affects the fluctuations (X_i - \langle X_i \rangle) or the covariance matrix.

So, as indeed we’ve already seen, g_{ij} has rotational symmetry in the 1-2 plane. Thus, we’ll completely know it once we know g_{11} = g_{22} and g_{33}; the other components are zero for symmetry reasons. g_{11} will equal the variance of position for a harmonic oscillator at a given temperature, while g_{33} will equal the variance of its energy. We can work these out or look them up.

I won’t do that now: I’m after insight, not formulas. For physical reasons, it’s obvious that g_{11} must diminish with diminishing energy — but not go to zero. Why? Well, as the temperature approaches zero, a harmonic oscillator in thermal equilibrium approaches its state of least energy: the so-called ‘ground state’. In its ground state, the standard deviations of position and momentum are as small as allowed by the Heisenberg uncertainty principle:

\Delta p \Delta q  \ge \frac{1}{2}

and they’re equal, so

g_{11} = (\Delta q)^2 = \frac{1}{2}.

That’s enough about the metric. Now, what about the metric’s skew-symmetric partner? This is:

\omega_{ij} = \mathrm{Im} \langle \, (X_i - \langle X_i \rangle) \, (X_j - \langle X_j \rangle)  \, \rangle

Last time we saw that \omega is all about expected values of commutators:

\omega_{ij} = \frac{1}{2i} \langle [X_i, X_j] \rangle

and this makes it easy to compute. For example,

[X_1, X_2] = q p - p q = i

so

\omega_{12} = \frac{1}{2}

Of course

\omega_{11} = \omega_{22} = 0

by skew-symmetry, so we know the restriction of \omega to any horizontal plane. We can also work out other components, like \omega_{13}, but I don’t want to. I’d rather just state this:

Summary: Restricted to any horizontal plane in the position-momentum-energy space, the Fisher information metric for the harmonic oscillator is

g = \mathrm{constant} (dq_0^2 + dp_0^2)

with a constant depending on the temperature, equalling \frac{1}{2} in the zero-temperature limit, and increasing as the temperature rises. Restricted to the same plane, the Fisher information metric’s skew-symmetric partner is

\omega = \frac{1}{2} dq_0 \wedge dp_0

(Remember, the mean values q_0, p_0, E_0 are the coordinates on position-momentum-energy space. We could also use coordinates f, g, \beta or f, g and temperature. In the chatty intro to this article you saw formulas like those above but without the subscripts; that’s before I got serious about using q and p to mean operators.)

And now for the moral. Actually I have two: a physics moral and a math moral.

First, what is the physical meaning of g or \omega when restricted to a plane of constant E_0, or if you prefer, a plane of constant temperature?

Physics Moral: Restricted to a constant-temperature plane, g is the covariance matrix for our observables. It is temperature-dependent. In the zero-temperature limit, the thermal fluctuations go away and g depends only on quantum fluctuations in the ground state. On the other hand, \omega restricted to a constant-temperature plane describes Heisenberg uncertainty relations for noncommuting observables. In our example, it is temperature-independent.

Second, what does this have to do with Kähler geometry? Remember, the complex plane has a complex-valued metric on it, called a Kähler structure. Its real part is a Riemannian metric, and its imaginary part is a symplectic structure. We can think of the the complex plane as the position-momentum plane for a point particle. Then the symplectic structure is the basic ingredient needed for Hamiltonian mechanics, while the Riemannian structure is the basic ingredient needed for the harmonic oscillator Hamiltonian.

Math Moral: In the example we considered, \omega restricted to a constant-temperature plane is equal to \frac{1}{2} the usual symplectic structure on the complex plane. On the other hand, g restricted to a constant-temperature plane is a multiple of the usual Riemannian metric on the complex plane — but this multiple is \frac{1}{2} only when the temperature is zero! So, only at temperature zero are g and \omega the real and imaginary parts of a Kähler structure.

It will be interesting to see how much of this stuff is true more generally. The harmonic oscillator is much nicer than your average physical system, so it can be misleading, but I think some of the morals we’ve seen here can be generalized.

Some other time I may so more about how all this is
related to Öttinger’s formalism, but the quick point is that he too has mixed states, and a symmetric g, and a skew-symmetric \omega. So it’s nice to see if they match up in an example.

Finally, two footnotes on terminology:

β: In fact, this quantity \beta = 1/kT is so important it deserves a better name than ‘reciprocal of temperature’. How about ‘coolness’? An important lesson from statistical mechanics is that coolness is more fundamental than temperature. This makes some facts more plausible. For example, if you say “you can never reach absolute zero,” it sounds very odd, since you can get as close as you like, and it’s even possible to get negative temperatures — but temperature zero remains tantalizingly out of reach. But “you can never attain infinite coolness” — now that makes sense.

Gorce: I apologize to Richard Feynman for stealing the word ‘gorce’ and using it a different way. Does anyone have a good intuition for what’s going on when you apply my sort of ‘gorce’ to a point particle? You need to think about velocity-dependent potentials, of that I’m sure. In the presence of a velocity-dependent potential, momentum is not just mass times velocity. Which is good: if it were, we could never have a system where the mean value of both q and p stayed constant over time!


Information Geometry (Part 4)

29 October, 2010

Before moving on, I’d like to clear up a mistake I’d been making in all my previous posts on this subject.

(By now I’ve tried to fix those posts, because people often get information from the web in a hasty way, and I don’t want my mistake to spread. But you’ll still see traces of my mistake infecting the comments on those posts.)

So what’s the mistake? It’s embarrassingly simple, but also simple to fix. A Riemannian metric must be symmetric:

g_{ij} = g_{ji}

Now, I had defined the Fisher information metric to be the so-called ‘covariance matrix’:

g_{ij} = \langle (X_i - \langle X_i \rangle) \;(X_j- \langle X_j \rangle)\rangle

where X_i are some observable-valued functions on a manifold M, and the angle brackets mean “expectation value”, computed using a mixed state \rho that also depends on the point in M.

The covariance matrix is symmetric in classical mechanics, since then observables commute, so:

\langle AB \rangle = \langle BA \rangle

But it’s not symmetric is quantum mechanics! After all, suppose q is the position operator for a particle, and p is the momentum operator. Then according to Heisenberg

qp = pq + i

in units where Planck’s constant is 1. Taking expectation values, we get:

\langle qp \rangle = \langle pq \rangle + i

and in particular:

\langle qp \rangle \ne \langle pq \rangle

We can use this to get examples where g_{ij} is not symmetric.

However, it turns out that the real part of the covariance matrix is symmetric, even in quantum mechanics — and that’s what we should use as our Fisher information metric.

Why is the real part of the covariance matrix symmetric, even in quantum mechanics? Well, suppose \rho is any density matrix, and A and B are any observables. Then by definition

\langle AB \rangle = \mathrm{tr} (\rho AB)

so taking the complex conjugate of both sides

\langle AB\rangle^*  = \mathrm{tr}(\rho AB)^* = \mathrm{tr}((\rho A B)^*) = \mathrm{tr}(B^* A^* \rho^*)

where I’m using an asterisk both for the complex conjugate of a number and the adjoint of an operator. But our observables are self-adjoint, and so is our density matrix, so we get

\mathrm{tr}(B^* A^* \rho^*) = \mathrm{tr}(B A \rho) = \mathrm{tr}(\rho B A) = \langle B A \rangle

where in the second step we used the cyclic property of the trace. In short:

\langle AB\rangle^* = \langle BA \rangle

If we take real parts, we get something symmetric:

\mathrm{Re} \langle AB\rangle =  \mathrm{Re} \langle BA \rangle

So, if we redefine the Fisher information metric to be the real part of the covariance matrix:

g_{ij} = \mathrm{Re} \langle (X_i - \langle X_i \rangle) \; (X_j- \langle X_j \rangle)\rangle

then it’s symmetric, as it should be.

Last time I mentioned a general setup using von Neumann algebras, that handles the classical and quantum situations simultaneously. That applies here! Taking the real part has no effect in classical mechanics, so we don’t need it there — but it doesn’t hurt, either.

Taking the real part never has any effect when i = j, either, since the expected value of the square of an observable is a nonnegative number:

\langle (X_i - \langle X_i \rangle)^2 \rangle \ge 0

This has two nice consequences.

First, we get

g_{ii} = \langle (X_i - \langle X_i \rangle)^2 \rangle  \ge 0

and since this is true in any coordinate system, our would-be metric g is indeed nonnegative. It’ll be an honest Riemannian metric whenever it’s positive definite.

Second, suppose we’re working in the special case discussed in Part 2, where our manifold is an open subset of \mathbb{R}^n, and \mathbb{\rho} at the point x \in \mathbb{R}^n is the Gibbs state with \langle X_i \rangle = x_i. Then all the usual rules of statistical mechanics apply. So, we can compute the variance of the observable X_i using the partition function Z:

\langle (X_i - \langle X_i \rangle)^2 \rangle = \frac{\partial^2}{\partial \lambda_i^2} \ln Z

In other words,

g_{ii} =  \frac{\partial^2}{\partial \lambda_i^2} \ln Z

But since this is true in any coordinate system, we must have

g_{ij} =  \frac{\partial^2}{\partial \lambda_i \partial \lambda_j} \ln Z

(Here I’m using a little math trick: two symmetric bilinear forms whose diagonal entries agree in any basis must be equal. We’ve already seen that the left side is symmetric, and the right side is symmetric by a famous fact about mixed partial derivatives.)

However, I’m pretty sure this cute formula

g_{ij} =  \frac{\partial^2}{\partial \lambda_i \partial \lambda_j} \ln Z

only holds in the special case I’m talking about now, where points in \mathbb{R}^n are parametrizing Gibbs states in the obvious way. In general we must use

g_{ij} = \mathrm{Re} \langle (X_i - \langle X_i \rangle)(X_j- \langle X_j \rangle)\rangle

or equivalently,

g_{ij} = \mathrm{Re} \, \mathrm{tr} (\rho \; \frac{\partial \ln \rho}{\partial \lambda_i} \frac{\partial \ln \rho}{\partial \lambda_j})

Okay. So much for cleaning up Last Week’s Mess. Here’s something new. We’ve seen that whenever A and B are observables (that is, self-adjoint),

\langle AB\rangle^* = \langle BA \rangle

We got something symmetric by taking the real part:

\mathrm{Re} \langle AB\rangle =  \mathrm{Re} \langle BA \rangle

Indeed,

\mathrm{Re} \langle AB \rangle = \frac{1}{2} \langle AB + BA \rangle

But by the same reasoning, we get something antisymmetric by taking the imaginary part:

\mathrm{Im} \langle AB\rangle =  - \mathrm{Im} \langle BA \rangle

and indeed,

\mathrm{Im} \langle AB \rangle = \frac{1}{2i} \langle AB - BA \rangle

Commutators like AB-BA are important in quantum mechanics, so maybe we shouldn’t just throw out the imaginary part of the covariance matrix in our desperate search for a Riemannian metric! Besides the symmetric tensor on our manifold M:

g_{ij} = \mathrm{Re} \, \mathrm{tr} (\rho \; \frac{\partial \ln \rho}{\partial \lambda_i} \frac{\partial \ln \rho}{\partial \lambda_j})

we can also define a skew-symmetric tensor:

\omega_{ij} = \mathrm{Im} \,  \mathrm{tr} (\rho \; \frac{\partial \ln \rho}{\partial \lambda_i} \frac{\partial \ln \rho}{\partial \lambda_j})

This will vanish in the classical case, but not in the quantum case!

If you’ve studied enough geometry, you should now be reminded of things like ‘Kähler manifolds’ and ‘almost Kähler manifolds’. A Kähler manifold is a manifold that’s equipped with a symmetric tensor g and a skew-symmetric tensor \omega which fit together in the best possible way. An almost Kähler manifold is something similar, but not quite as nice. We should probably see examples of these arising in information geometry! And that could be pretty interesting.

But in general, if we start with any old manifold M together with a function \rho taking values in mixed states, we seem to be making M into something even less nice. It gets a symmetric bilinear form g on each tangent space, and a skew-symmetric bilinear form \omega, and they vary smoothly from point to point… but they might be degenerate, and I don’t see any reason for them to ‘fit together’ in the nice way we need for a Kähler or almost Kähler manifold.

However, I still think something interesting might be going on here. For one thing, there are other situations in physics where a space of states is equipped with a symmetric g and a skew-symmetric \omega. They show up in ‘dissipative mechanics’ — the study of systems whose entropy increases.

To conclude, let me remind you of some things I said in week295 of This Week’s Finds. This is a huge digression from information geometry, but I’d like to lay out the the puzzle pieces in public view, in case it helps anyone get some good ideas.

I wrote:

• Hans Christian Öttinger, Beyond Equilibrium Thermodynamics, Wiley, 2005.

I thank Arnold Neumaier for pointing out this book! It considers a fascinating generalization of Hamiltonian mechanics that applies to systems with dissipation: for example, electrical circuits with resistors, or mechanical systems with friction.

In ordinary Hamiltonian mechanics the space of states is a manifold and time evolution is a flow on this manifold determined by a smooth function called the Hamiltonian, which describes the energy of any state. In this generalization the space of states is still a manifold, but now time evolution is determined by two smooth functions: the energy and the entropy! In ordinary Hamiltonian mechanics, energy is automatically conserved. In this generalization that’s also true, but energy can go into the form of heat… and entropy automatically increases!

Mathematically, the idea goes like this. We start with a Poisson manifold, but in addition to the skew-symmetric Poisson bracket {F,G} of smooth functions on some manifold, we also have a symmetric bilinear bracket [F,G] obeying the Leibniz law

[F,GH] = [F,G]H + G[F,H]

and this positivity condition:

[F,F] ≥ 0

The time evolution of any function is given by a generalization of Hamilton’s equations:

dF/dt = {H,F} + [S,F]

where H is a function called the "energy" or "Hamiltonian", and S is a function called the "entropy". The first term on the right is the usual one. The new second term describes dissipation: as we shall see, it pushes the state towards increasing entropy.

If we require that

[H,F] = {S,F} = 0

for every function F, then we get conservation of energy, as usual in Hamiltonian mechanics:

dH/dt = {H,H} + [S,H] = 0

But we also get the second law of thermodynamics:

dS/dt = {H,S} + [S,S] ≥ 0

Entropy always increases!

Öttinger calls this framework “GENERIC” – an annoying acronym for “General Equation for the NonEquilibrium Reversible-Irreversible Coupling”. There are lots of papers about it. But I’m wondering if any geometers have looked into it!

If we didn’t need the equations [H,F] = {S,F} = 0, we could easily get the necessary brackets starting with a Kähler manifold. The imaginary part of the Kähler structure is a symplectic structure, say ω, so we can define

{F,G} = ω(dF,dG)

as usual to get Poisson brackets. The real part of the Kähler structure is a Riemannian structure, say g, so we can define

[F,G] = g(dF,dG)

This satisfies

[F,GH] = [F,G]H + G[F,H]

and

[F,F] ≥ 0

Don’t be fooled: this stuff is not rocket science. In particular, the inequality above has a simple meaning: when we move in the direction of the gradient of F, the function F increases. So adding the second term to Hamilton’s equations has the effect of pushing the system towards increasing entropy.

Note that I’m being a tad unorthodox by letting ω and g eat cotangent vectors instead of tangent vectors – but that’s no big deal. The big deal is this: if we start with a Kähler manifold and define brackets this way, we don’t get [H,F] = 0 or {S,F} = 0 for all functions F unless H and S are constant! That’s no good for applications to physics. To get around this problem, we would need to consider some sort of degenerate Kähler structure – one where ω and g are degenerate bilinear forms on the cotangent space.

Has anyone thought about such things? They remind me a little of "Dirac structures" and "generalized complex geometry" – but I don’t know enough about those subjects to know if they’re relevant here.

This GENERIC framework suggests that energy and entropy should be viewed as two parts of a single entity – maybe even its real and imaginary parts! And that in turn reminds me of other strange things, like the idea of using complex-valued Hamiltonians to describe dissipative systems, or the idea of “inverse temperature as imaginary time”. I can’t tell yet if there’s a big idea lurking here, or just a mess….


Information Geometry (Part 3)

25 October, 2010

So far in this series of posts I’ve been explaining a paper by Gavin Crooks. Now I want to go ahead and explain a little research of my own.

I’m not claiming my results are new — indeed I have no idea whether they are, and I’d like to hear from any experts who might know. I’m just claiming that this is some work I did last weekend.

People sometimes worry that if they explain their ideas before publishing them, someone will ‘steal’ them. But I think this overestimates the value of ideas, at least in esoteric fields like mathematical physics. The problem is not people stealing your ideas: the hard part is giving them away. And let’s face it, people in love with math and physics will do research unless you actively stop them. I’m reminded of this scene from the Marx Brothers movie where Harpo and Chico, playing wandering musicians, walk into a hotel and offer to play:

Groucho: What do you fellows get an hour?

Chico: Oh, for playing we getta ten dollars an hour.

Groucho: I see…What do you get for not playing?

Chico: Twelve dollars an hour.

Groucho: Well, clip me off a piece of that.

Chico: Now, for rehearsing we make special rate. Thatsa fifteen dollars an hour.

Groucho: That’s for rehearsing?

Chico: Thatsa for rehearsing.

Groucho: And what do you get for not rehearsing?

Chico: You couldn’t afford it.

So, I’m just rehearsing in public here — but I of course I hope to write a paper about this stuff someday, once I get enough material.

Remember where we were. We had considered a manifold — let’s finally give it a name, say M — that parametrizes Gibbs states of some physical system. By Gibbs state, I mean a state that maximizes entropy subject to constraints on the expected values of some observables. And we had seen that in favorable cases, we get a Riemannian metric on M! It looks like this:

g_{ij} = \mathrm{Re} \langle (X_i - \langle X_i \rangle)  (X_j - \langle X_j \rangle)  \rangle

where X_i are our observables, and the angle bracket means ‘expected value’.

All this applies to both classical or quantum mechanics. Crooks wrote down a beautiful formula for this metric in the classical case. But since I’m at the Centre for Quantum Technologies, not the Centre for Classical Technologies, I redid his calculation in the quantum case. The big difference is that in quantum mechanics, observables don’t commute! But in the calculations I did, that didn’t seem to matter much — mainly because I took a lot of traces, which imposes a kind of commutativity:

\mathrm{tr}(AB) = \mathrm{tr}(BA)

In fact, if I’d wanted to show off, I could have done the classical and quantum cases simultaneously by replacing all operators by elements of any von Neumann algebra equipped with a trace. Don’t worry about this much: it’s just a general formalism for treating classical and quantum mechanics on an equal footing. One example is the algebra of bounded operators on a Hilbert space, with the usual concept of trace. Then we’re doing quantum mechanics as usual. But another example is the algebra of suitably nice functions on a suitably nice space, where taking the trace of a function means integrating it. And then we’re doing classical mechanics!

For example, I showed you how to derive a beautiful formula for the metric I wrote down a minute ago:

g_{ij} = \mathrm{Re} \, \mathrm{tr}(\rho \; \frac{\partial \mathrm{ln} \rho }{\partial \lambda^i} \; \frac{\partial \mathrm{ln} \rho }{\partial \lambda^j} )

But if we want to do the classical version, we can say Hey, presto! and write it down like this:

g_{ij} = \int_\Omega p(\omega) \; \frac{\partial \mathrm{ln} p(\omega) }{\partial \lambda^i} \; \frac{\partial \mathrm{ln} p(\omega) }{\partial \lambda^j} \; d \omega

What did I do just now? I changed the trace to an integral over some space \Omega. I rewrote \rho as p to make you think ‘probability distribution’. And I don’t need to take the real part anymore, since is everything already real when we’re doing classical mechanics. Now this metric is the Fisher information metric that statisticians know and love!

In what follows, I’ll keep talking about the quantum case, but in the back of my mind I’ll be using von Neumann algebras, so everything will apply to the classical case too.

So what am I going to do? I’m going to fix a big problem with the story I’ve told so far.

Here’s the problem: so far we’ve only studied a special case of the Fisher information metric. We’ve been assuming our states are Gibbs states, parametrized by the expectation values of some observables X_1, \dots, X_n. Our manifold M was really just some open subset of \mathbb{R}^n: a point in here was a list of expectation values.

But people like to work a lot more generally. We could look at any smooth function \rho from a smooth manifold M to the set of density matrices for some quantum system. We can still write down the metric

g_{ij} = \mathrm{Re} \, \mathrm{tr}(\rho \; \frac{\partial \mathrm{ln} \rho }{\partial \lambda^i} \; \frac{\partial \mathrm{ln} \rho }{\partial \lambda^j} )

in this more general situation. Nobody can stop us! But it would be better if we could derive this formula, as before, starting from a formula like the one we had before:

g_{ij} = \mathrm{Re} \langle \, (X_i - \langle X_i \rangle) \, (X_j - \langle X_j \rangle)  \, \rangle

The challenge is that now we don’t have observables X_i to start with. All we have is a smooth function \rho from some manifold to some set of states. How can we pull observables out of thin air?

Well, you may remember that last time we had

\rho = \frac{1}{Z} e^{-\lambda^i X_i}

where \lambda^i were some functions on our manifold and

Z = \mathrm{tr}(e^{-\lambda^i X_i})

was the partition function. Let’s copy this idea.

So, we’ll start with our density matrix \rho, but then write it as

\rho = \frac{1}{Z} e^{-A}

where A is some self-adjoint operator and

Z = \mathrm{tr} (e^{-A})

(Note that A, like \rho, is really an operator-valued function on M. So, I should write something like A(x) to denote its value at a particular point x \in M, but I won’t usually do that. As usual, I expect some intelligence on your part!)

Now we can repeat some calculations I did last time. As before, let’s take the logarithm of \rho:

\mathrm{ln} \, \rho = -A - \mathrm{ln}\,  Z

and then differentiate it. Suppose \lambda^i are local coordinates near some point of M. Then

\frac{\partial}{\partial \lambda^i} \mathrm{ln}\, \rho = - \frac{\partial}{\partial \lambda^i} A - \frac{1}{Z} \frac{\partial }{\partial \lambda^i} Z

Last time we had nice formulas for both terms on the right-hand side above. To get similar formulas now, let’s define operators

X_i = \frac{\partial}{\partial \lambda^i} A

This gives a nice name to the first term on the right-hand side above. What about the second term? We can calculate it out:

\frac{1}{Z} \frac{\partial }{\partial \lambda^i} Z = \frac{1}{Z}  \frac{\partial }{\partial \lambda^i} \mathrm{tr}(e^{-A}) = \frac{1}{Z}  \mathrm{tr}(\frac{\partial }{\partial \lambda^i} e^{-A}) = - \frac{1}{Z}  \mathrm{tr}(e^{-A} \frac{\partial}{\partial \lambda^i} A)

where in the last step we use the chain rule. Next, use the definition of \rho and X_i, and get:

\frac{1}{Z} \frac{\partial }{\partial \lambda^i} Z = - \mathrm{tr}(\rho X_i) = - \langle X_i \rangle

This is just what we got last time! Ain’t it fun to calculate when it all works out so nicely?

So, putting both terms together, we see

\frac{\partial}{\partial \lambda^i} \mathrm{ln} \rho = - X_i + \langle X_i \rangle

or better:

X_i - \langle X_i \rangle = -\frac{\partial}{\partial \lambda^i} \mathrm{ln} \rho

This is a nice formula for the ‘fluctuation’ of the observables X_i, meaning how much they differ from their expected values. And it looks exactly like the formula we had last time! The difference is that last time we started out assuming we had a bunch of observables, X_i, and defined \rho to be the state maximizing the entropy subject to constraints on the expectation values of all these observables.
Now we’re starting with \rho and working backwards.

From here on out, it’s easy. As before, we can define g_{ij} to be the real part of the covariance matrix:

g_{ij} = \mathrm{Re} \langle (X_i - \langle X_i \rangle)  (X_j - \langle X_j \rangle)  \rangle

Using the formula

X_i - \langle X_i \rangle = -\frac{\partial}{\partial \lambda^i} \mathrm{ln} \rho

we get

g_{ij} = \mathrm{Re} \langle \frac{\partial \mathrm{ln} \rho}{\partial \lambda^i} \; \frac{\partial \mathrm{ln} \rho}{\partial \lambda^j} \rangle

or

g_{ij} = \mathrm{Re}\,\mathrm{tr}(\rho \; \frac{\partial \mathrm{ln} \rho}{\partial \lambda^i} \; \frac{\partial \mathrm{ln} \rho}{\partial \lambda^j})

Voilà!

When this matrix is positive definite at every point, we get a Riemanian metric on M. Last time I said this is what people call the ‘Bures metric’ — though frankly, now that I examine the formulas, I’m not so sure. But in the classical case, it’s called the Fisher information metric.

Differential geometers like to use \partial_i as a shorthand for \frac{\partial}{\partial_i}, so they’d write down our metric in a prettier way:

g_{ij} = \mathrm{Re} \, \mathrm{tr}(\rho \; \partial_i (\mathrm{ln} \, \rho) \; \partial_j (\mathrm{ln} \, \rho) )

Differential geometers like coordinate-free formulas, so let’s also give a coordinate-free formula for our metric. Suppose x \in M is a point in our manifold, and suppose v,w are tangent vectors to this point. Then

g(v,w) = \mathrm{Re} \, \langle v(\mathrm{ln}\, \rho) \; w(\mathrm{ln} \,\rho) \rangle \; = \; \mathrm{Re} \,\mathrm{tr}(\rho \; v(\mathrm{ln}\, \rho) \; w(\mathrm{ln}\, \rho))

Here \mathrm{ln}\, \rho is a smooth operator-valued function on M, and v(\mathrm{ln}\, \rho) means the derivative of this function in the v direction at the point x.

So, this is all very nice. To conclude, two more points: a technical one, and a more important philosophical one.

First, the technical point. When I said \rho could be any smooth function from a smooth manifold to some set of states, I was actually lying. That’s an important pedagogical technique: the brazen lie.

We can’t really take the logarithm of every density matrix. Remember, we take the log of a density matrix by taking the log of all its eigenvalues. These eigenvalues are ≥ 0, but if one of them is zero, we’re in trouble! The logarithm of zero is undefined.

On the other hand, there’s no problem taking the logarithm of our density-matrix-valued function \rho when it’s positive definite at each point of M. You see, a density matrix is positive definite iff its eigenvalues are all > 0. In this case it has a unique self-adjoint logarithm.

So, we must assume \rho is positive definite. But what’s the physical significance of this ‘positive definiteness’ condition? Well, any density matrix can be diagonalized using some orthonormal basis. It can then be seen as probabilistic mixture — not a quantum superposition! — of pure states taken from this basis. Its eigenvalues are the probabilities of finding the mixed state to be in one of these pure states. So, saying that all its eigenvalues are all > 0 amounts to saying that all the pure states in this orthonormal basis show up with nonzero probability! Intuitively, this means our mixed state is ‘really mixed’. For example, it can’t be a pure state. In math jargon, it means our mixed state is in the interior of the convex set of mixed states.

Second, the philosophical point. Instead of starting with the density matrix \rho, I took A as fundamental. But different choices of A give the same \rho. After all,

\rho = \frac{1}{Z} e^{-A}

where we cleverly divide by the normalization factor

Z = \mathrm{tr} (e^{-A})

to get \mathrm{tr} \, \rho = 1. So, if we multiply e^{-A} by any positive constant, or indeed any positive function on our manifold M, \rho will remain unchanged!

So we have added a little extra information when switching from \rho to A. You can think of this as ‘gauge freedom’, because I’m saying we can do any transformation like

A \mapsto A + f

where

f: M \to \mathbb{R}

is a smooth function. This doesn’t change \rho, so arguably it doesn’t change the ‘physics’ of what I’m doing. It does change Z. It also changes the observables

X_i = \frac{\partial}{\partial \lambda^i} A

But it doesn’t change their ‘fluctuations’

X_i - \langle X_i \rangle

so it doesn’t change the metric g_{ij}.

This gauge freedom is interesting, and I want to understand it better. It’s related to something very simple yet mysterious. In statistical mechanics the partition function Z begins life as ‘just a normalizing factor’. If you change the physics so that Z gets multiplied by some number, the Gibbs state doesn’t change. But then the partition function takes on an incredibly significant role as something whose logarithm you differentiate to get lots of physically interesting information! So in some sense the partition function doesn’t matter much… but changes in the partition function matter a lot.

This is just like the split personality of phases in quantum mechanics. On the one hand they ‘don’t matter’: you can multiply a unit vector by any phase and the pure state it defines doesn’t change. But on the other hand, changes in phase matter a lot.

Indeed the analogy here is quite deep: it’s the analogy between probabilities in statistical mechanics and amplitudes in quantum mechanics, the analogy between \mathrm{exp}(-\beta H) in statistical mechanics and \mathrm{exp}(-i t H / \hbar) in quantum mechanics, and so on. This is part of a bigger story about ‘rigs’ which I told back in the Winter 2007 quantum gravity seminar, especially in week13. So, it’s fun to see it showing up yet again… even though I don’t completely understand it here.

[Note: in the original version of this post, I omitted the real part in my definition g_{ij} = \mathrm{Re} \langle (X_i - \langle X_i \rangle)  (X_j - \langle X_j \rangle)  \rangle , giving a ‘Riemannian metric’ that was neither real nor symmetric in the quantum case. Most of the comments below are based on that original version, not the new fixed one.]


Information Geometry (Part 2)

23 October, 2010

Last time I provided some background to this paper:

• Gavin E. Crooks, Measuring thermodynamic length.

Now I’ll tell you a bit about what it actually says!

Remember the story so far: we’ve got a physical system that’s in a state of maximum entropy. I didn’t emphasize this yet, but that happens whenever our system is in thermodynamic equilibrium. An example would be a box of gas inside a piston. Suppose you choose any number for the energy of the gas and any number for its volume. Then there’s a unique state of the gas that maximizes its entropy, given the constraint that on average, its energy and volume have the values you’ve chosen. And this describes what the gas will be like in equilibrium!

Remember, by ‘state’ I mean mixed state: it’s a probabilistic description. And I say the energy and volume have chosen values on average because there will be random fluctuations. Indeed, if you look carefully at the head of the piston, you’ll see it quivering: the volume of the gas only equals the volume you’ve specified on average. Same for the energy.

More generally: imagine picking any list of numbers, and finding the maximum entropy state where some chosen observables have these numbers as their average values. Then there will be fluctuations in the values of these observables — thermal fluctuations, but also possibly quantum fluctuations. So, you’ll get a probability distribution on the space of possible values of your chosen observables. You should visualize this probability distribution as a little fuzzy cloud centered at the average value!

To a first approximation, this cloud will be shaped like a little ellipsoid. And if you can pick the average value of your observables to be whatever you’ll like, you’ll get lots of little ellipsoids this way, one centered at each point. And the cool idea is to imagine the space of possible values of your observables as having a weirdly warped geometry, such that relative to this geometry, these ellipsoids are actually spheres.

This weirdly warped geometry is an example of an ‘information geometry’: a geometry that’s defined using the concept of information. This shouldn’t be surprising: after all, we’re talking about maximum entropy, and entropy is related to information. But I want to gradually make this idea more precise.

Bring on the math!

We’ve got a bunch of observables X_1, \dots , X_n, and we’re assuming that for any list of numbers x_1, \dots , x_n, the system has a unique maximal-entropy state \rho for which the expected value of the observable X_i is x_i:

\langle X_i \rangle = x_i

This state \rho is called the Gibbs state and I told you how to find it when it exists. In fact it may not exist for every list of numbers x_1, \dots , x_n, but we’ll be perfectly happy if it does for all choices of

x = (x_1, \dots, x_n)

lying in some open subset of \mathbb{R}^n

By the way, I should really call this Gibbs state \rho(x) or something to indicate how it depends on x, but I won’t usually do that. I expect some intelligence on your part!

Now at each point x we can define a covariance matrix

\langle (X_i - \langle X_i \rangle)  (X_j - \langle X_j \rangle)  \rangle

If we take its real part, we get a symmetric matrix:

g_{ij} = \mathrm{Re} \langle (X_i - \langle X_i \rangle)  (X_j - \langle X_j \rangle)  \rangle

It’s also nonnegative — that’s easy to see, since the variance of a probability distribution can’t be negative. When we’re lucky this matrix will be positive definite. When we’re even luckier, it will depend smoothly on x. In this case, g_{ij} will define a Riemannian metric on our open set.

So far this is all review of last time. Sorry: I seem to have reached the age where I can’t say anything interesting without warming up for about 15 minutes first. It’s like when my mom tells me about an exciting event that happened to her: she starts by saying “Well, I woke up, and it was cloudy out…”

But now I want to give you an explicit formula for the metric g_{ij}, and then rewrite it in a way that’ll work even when the state \rho is not a maximal-entropy state. And this formula will then be the general definition of the ‘Fisher information metric’ (if we’re doing classical mechanics), or a quantum version thereof (if we’re doing quantum mechanics).

Crooks does the classical case — so let’s do the quantum case, okay? Last time I claimed that in the quantum case, our maximum-entropy state is the Gibbs state

\rho = \frac{1}{Z} e^{-\lambda^i X_i}

where \lambda^i are the ‘conjugate variables’ of the observables X_i, we’re using the Einstein summation convention to sum over repeated indices that show up once upstairs and once downstairs, and Z is the partition function

Z = \mathrm{tr} (e^{-\lambda^i X_i})

(To be honest: last time I wrote the indices on the conjugate variables \lambda^i as subscripts rather than superscripts, because I didn’t want some poor schlep out there to think that \lambda^1, \dots , \lambda^n were the powers of some number \lambda. But now I’m assuming you’re all grown up and ready to juggle indices! We’re doing Riemannian geometry, after all.)

Also last time I claimed that it’s tremendously fun and enlightening to take the derivative of the logarithm of Z. The reason is that it gives you the mean values of your observables:

\langle X_i \rangle = - \frac{\partial}{\partial \lambda^i} \ln Z

But now let’s take the derivative of the logarithm of \rho. Remember, \rho is an operator — in fact a density matrix. But we can take its logarithm as explained last time, and the usual rules apply, so starting from

\rho = \frac{1}{Z} e^{-\lambda^i X_i}

we get

\mathrm{ln}\, \rho = - \lambda^i X_i - \mathrm{ln} \,Z

Next, let’s differentiate both sides with respect to \lambda^i. Why? Well, from what I just said, you should be itching to differentiate \mathrm{ln}\, Z. So let’s give in to that temptation:

\frac{\partial  }{\partial \lambda^i} \mathrm{ln}  \rho = -X_i + \langle X_i \rangle

Hey! Now we’ve got a formula for the ‘fluctuation’ of the observable X_i — that is, how much it differs from its mean value:

X_i - \langle X_i \rangle = - \frac{\partial \mathrm{ln} \rho }{\partial \lambda^i}

This is incredibly cool! I should have learned this formula decades ago, but somehow I just bumped into it now. I knew of course that \mathrm{ln} \, \rho shows up in the formula for the entropy:

S(\rho) = \mathrm{tr} ( \rho \; \mathrm{ln} \, \rho)

But I never had the brains to think about \mathrm{ln}\, \rho all by itself. So I’m really excited to discover that it’s an interesting entity in its own right — and fun to differentiate, just like \mathrm{ln}\, Z.

Now we get our cool formula for g_{ij}. Remember, it’s defined by

g_{ij} = \mathrm{Re} \langle (X_i - \langle X_i \rangle)  (X_j - \langle X_j \rangle)  \rangle

But now that we know

X_i - \langle X_i \rangle = -\frac{\partial \mathrm{ln} \rho }{\partial \lambda^i}

we get the formula we were looking for:

g_{ij} = \mathrm{Re} \langle \frac{\partial \mathrm{ln} \rho }{\partial \lambda^i} \; \frac{\partial \mathrm{ln} \rho }{\partial \lambda^j}  \rangle

Beautiful, eh? And of course the expected value of any observable A in the state \rho is

\langle A \rangle = \mathrm{tr}(\rho A)

so we can also write the covariance matrix like this:

g_{ij} = \mathrm{Re}\, \mathrm{tr}(\rho \; \frac{\partial \mathrm{ln} \rho }{\partial \lambda^i} \; \frac{\partial \mathrm{ln} \rho }{\partial \lambda^j} )

Lo and behold! This formula makes sense whenever \rho is any density matrix depending smoothly on some parameters \lambda^i. We don’t need it to be a Gibbs state! So, we can work more generally.

Indeed, whenever we have any smooth function from a manifold to the space of density matrices for some Hilbert space, we can define g_{ij} by the above formula! And when it’s positive definite, we get a Riemannian metric on our manifold: the Bures information metric.

The classical analogue is the somewhat more well-known ‘Fisher information metric’. When we go from quantum to classical, operators become functions and traces become integrals. There’s nothing complex anymore, so taking the real part becomes unnecessary. So the Fisher information metric looks like this:

g_{ij} = \int_\Omega p(\omega) \; \frac{\partial \mathrm{ln} p(\omega) }{\partial \lambda^i} \; \frac{\partial \mathrm{ln} p(\omega) }{\partial \lambda^j} \; d \omega

Here I’m assuming we’ve got a smooth function p from some manifold M to the space of probability distributions on some measure space (\Omega, d\omega). Working in local coordinates \lambda^i on our manifold M, the above formula defines a Riemannian metric on M, at least when g_{ij} is positive definite. And that’s the Fisher information metric!

Crooks says more: he describes an experiment that would let you measure the length of a path with respect to the Fisher information metric — at least in the case where the state \rho(x) is the Gibbs state with \langle X_i \rangle = x_i. And that explains why he calls it ‘thermodynamic length’.

There’s a lot more to say about this, and also about another question: What use is the Fisher information metric in the general case where the states \rho(x) aren’t Gibbs states?

But it’s dinnertime, so I’ll stop here.

[Note: in the original version of this post, I omitted the real part in my definition g_{ij} = \mathrm{Re} \langle (X_i - \langle X_i \rangle)  (X_j - \langle X_j \rangle)  \rangle , giving a ‘Riemannian metric’ that was neither real nor symmetric in the quantum case. Most of the comments below are based on that original version, not the new fixed one.]


Information Geometry (Part 1)

22 October, 2010

I’d like to provide a bit of background to this interesting paper:

• Gavin E. Crooks, Measuring thermodynamic length.

which was pointed out by John F in our discussion of entropy and uncertainty.

The idea here should work for either classical or quantum statistical mechanics. The paper describes the classical version, so just for a change of pace let me describe the quantum version.

First a lightning review of quantum statistical mechanics. Suppose you have a quantum system with some Hilbert space. When you know as much as possible about your system, then you describe it by a unit vector in this Hilbert space, and you say your system is in a pure state. Sometimes people just call a pure state a ‘state’. But that can be confusing, because in statistical mechanics you also need more general ‘mixed states’ where you don’t know as much as possible. A mixed state is described by a density matrix, meaning a positive operator \rho with trace equal to 1:

\mathrm{tr}(\rho) = 1

The idea is that any observable is described by a self-adjoint operator A, and the expected value of this observable in the mixed state \rho is

\langle A \rangle = \mathrm{tr}(\rho A)

The entropy of a mixed state is defined by

S(\rho) = -\mathrm{tr}(\rho \; \mathrm{ln} \, \rho)

where we take the logarithm of the density matrix just by taking the log of each of its eigenvalues, while keeping the same eigenvectors. This formula for entropy should remind you of the one that Gibbs and Shannon used — the one I explained a while back.

Back then I told you about the ‘Gibbs ensemble’: the mixed state that maximizes entropy subject to the constraint that some observable have a given value. We can do the same thing in quantum mechanics, and we can even do it for a bunch of observables at once. Suppose we have some observables X_1, \dots, X_n and we want to find the mixed state \rho that maximizes entropy subject to these constraints:

\langle X_i \rangle = x_i

for some numbers x_i. Then a little exercise in Lagrange multipliers shows that the answer is the Gibbs state:

\rho = \frac{1}{Z} \mathrm{exp}(-\lambda_1 X_1 + \cdots + \lambda_n X_n)

Huh?

This answer needs some explanation. First of all, the numbers \lambda_1, \dots \lambda_n are called Lagrange multipliers. You have to choose them right to get

\langle X_i \rangle = x_i

So, in favorable cases, they will be functions of the numbers x_i. And when you’re really lucky, you can solve for the numbers x_i in terms of the numbers \lambda_i. We call \lambda_i the conjugate variable of the observable X_i. For example, the conjugate variable of energy is inverse temperature!

Second of all, we take the exponential of a self-adjoint operator just as we took the logarithm of a density matrix: just take the exponential of each eigenvalue.

(At least this works when our self-adjoint operator has only eigenvalues in its spectrum, not any continuous spectrum. Otherwise we need to get serious and use the functional calculus. Luckily, if your system’s Hilbert space is finite-dimensional, you can ignore this parenthetical remark!)

But third: what’s that number Z? It begins life as a humble normalizing factor. Its job is to make sure \rho has trace equal to 1:

Z = \mathrm{tr}(\mathrm{exp}(-\lambda_1 X_1 + \cdots + \lambda_n X_n))

However, once you get going, it becomes incredibly important! It’s called the partition function of your system.

As an example of what it’s good for, it turns out you can compute the numbers x_i as follows:

x_i = - \frac{\partial}{\partial \lambda_i} \mathrm{ln} Z

In other words, you can compute the expected values of the observables X_i by differentiating the log of the partition function:

\langle X_i \rangle = - \frac{\partial}{\partial \lambda_i} \mathrm{ln} Z

Or in still other words,

\langle X_i \rangle = - \frac{1}{Z} \; \frac{\partial Z}{\partial \lambda_i}

To believe this you just have to take the equations I’ve given you so far and mess around — there’s really no substitute for doing it yourself. I’ve done it fifty times, and every time I feel smarter.

But we can go further: after the ‘expected value’ or ‘mean’ of an observable comes its variance, which is the square of its standard deviation:

(\Delta A)^2 = \langle A^2 \rangle - \langle A \rangle^2

This measures the size of fluctuations around the mean. And in the Gibbs state, we can compute the variance of the observable X_i as the second derivative of the log of the partition function:

\langle X_i^2 \rangle - \langle X_i \rangle^2 =  \frac{\partial^2}{\partial^2 \lambda_i} \mathrm{ln} Z

Again: calculate and see.

But when we’ve got lots of observables, there’s something better than the variance of each one. There’s the covariance matrix of the whole lot of them! Each observable X_i fluctuates around its mean value x_i… but these fluctuations are not independent! They’re correlated, and the covariance matrix says how.

All this is very visual, at least for me. If you imagine the fluctuations as forming a blurry patch near the point (x_1, \dots, x_n), this patch will be ellipsoidal in shape, at least when all our random fluctuations are Gaussian. And then the shape of this ellipsoid is precisely captured by the covariance matrix! In particular, the eigenvectors of the covariance matrix will point along the principal axes of this ellipsoid, and the eigenvalues will say how stretched out the ellipsoid is in each direction!

To understand the covariance matrix, it may help to start by rewriting the variance of a single observable A as

(\Delta A)^2 = \langle (A - \langle A \rangle)^2 \rangle

That’s a lot of angle brackets, but the meaning should be clear. First we look at the difference between our observable and its mean value, namely

A - \langle A \rangle

Then we square this, to get something that’s big and positive whenever our observable is far from its mean. Then we take the mean value of the that, to get an idea of how far our observable is from the mean on average.

We can use the same trick to define the covariance of a bunch of observables X_i. We get an n \times n matrix called the covariance matrix, whose entry in the ith row and jth column is

\langle (X_i - \langle X_i \rangle)  (X_j - \langle X_j \rangle)  \rangle

If you think about it, you can see that this will measure correlations in the fluctuations of your observables.

An interesting difference between classical and quantum mechanics shows up here. In classical mechanics the covariance matrix is always symmetric — but not in quantum mechanics! You see, in classical mechanics, whenever we have two observables A and B, we have

\langle A B \rangle = \langle B A \rangle

since observables commute. But in quantum mechanics this is not true! For example, consider the position q and momentum p of a particle. We have

q p = p q + i

so taking expectation values we get

\langle q p \rangle = \langle p q \rangle + i

So, it’s easy to get a non-symmetric covariance matrix when our observables X_i don’t commute. However, the real part of the covariance matrix is symmetric, even in quantum mechanics. So let’s define

g_{ij} =  \mathrm{Re}  \langle (X_i - \langle X_i \rangle)  (X_j - \langle X_j \rangle)  \rangle

You can check that the matrix entries here are the second derivatives of the partition function:

g_{ij} = \frac{\partial^2}{\partial \lambda_i \partial \lambda_j} \mathrm{ln} Z

And now for the cool part: this is where information geometry comes in! Suppose that for any choice of values x_i we have a Gibbs state with

\langle X_i \rangle = x_i

Then for each point

x = (x_1, \dots , x_n) \in \mathbb{R}^n

we have a matrix

g_{ij} =  \mathrm{Re}  \langle (X_i - \langle X_i \rangle)  (X_j - \langle X_j \rangle)  \rangle = \frac{\partial^2}{\partial \lambda_i \partial \lambda_j} \mathrm{ln} Z

And this matrix is not only symmetric, it’s also positive. And when it’s positive definite we can think of it as an inner product on the tangent space of the point x. In other words, we get a Riemannian metric on \mathbb{R}^n. This is called the Fisher information metric.

I hope you can see through the jargon to the simple idea. We’ve got a space. Each point x in this space describes the maximum-entropy state of a quantum system for which our observables have specified mean values. But in each of these states, the observables are random variables. They don’t just sit at their mean value, they fluctuate! You can picture these fluctuations as forming a little smeared-out blob in our space. To a first approximation, this blob is an ellipsoid. And if we think of this ellipsoid as a ‘unit ball’, it gives us a standard for measuring the length of any little vector sticking out of our point. In other words, we’ve got a Riemannian metric: the Fisher information metric!

Now if you look at the Wikipedia article you’ll see a more general but to me somewhat scarier definition of the Fisher information metric. This applies whenever we’ve got a manifold whose points label arbitrary mixed states of a system. But Crooks shows this definition reduces to his — the one I just described — when our manifold is \mathbb{R}^n and it’s parametrizing Gibbs states in the way we’ve just seen.

More precisely: both Crooks and the Wikipedia article describe the classical story, but it parallels the quantum story I’ve been telling… and I think the quantum version is well-known. I believe the quantum version of the Fisher information metric is sometimes called the Bures metric, though I’m a bit confused about what the Bures metric actually is.

[Note: in the original version of this post, I omitted the real part in my definition g_{ij} = \mathrm{Re} \langle (X_i - \langle X_i \rangle)  (X_j - \langle X_j \rangle)  \rangle , giving a ‘Riemannian metric’ that was neither real nor symmetric in the quantum case. Most of the comments below are based on that original version, not the new fixed one.]


The Art of Math

22 October, 2010

On this blog I’m focusing on quantum technology and “how scientists can save the planet”. But I have a previous life, when I worked on n-categories and quantum gravity — and I still have three grad students doing theses on those topics: John Huerta, Chris Rogers and Christopher Walker. They’re all finishing up next spring — you should hire them!

Unfortunately, it’s a bit hard to explain their work in simple terms.

Fortunately, Sophie Hedben has written an article that does a great job!

You can comment over there (that is, on the n-Café), if you want.


Entropy and Uncertainty

19 October, 2010

I was going to write about a talk at the CQT, but I found a preprint lying on a table in the lecture hall, and it was so cool I’ll write about that instead:

• Mario Berta, Matthias Christandl, Roger Colbeck, Joseph M. Renes, Renato Renner, The uncertainty principle in the presence of quantum memory, Nature Physics, July 25, 2010.

Actually I won’t talk about the paper per se, since it’s better if I tell you a more basic result that I first learned from reading this paper: the entropic uncertainty principle!

Everyone loves the concept of entropy, and everyone loves the uncertainty principle. Even folks who don’t understand ’em still love ’em. They just sound so mysterious and spooky and dark. I love ’em too. So, it’s nice to see a mathematical relation between them.

I explained entropy back here, so let me say a word about the uncertainty principle. It’s a limitation on how accurately you can measure two things at once in quantum mechanics. Sometimes you can only know a lot about one thing if you don’t know much about the other. This happens when those two things “fail to commute”.

Mathematically, the usual uncertainty principle says this:

\Delta A \cdot \Delta B \ge \frac{1}{2} |\langle [A,B] \rangle |

In plain English: the uncertainty in A times the uncertainty in B is bigger than the absolute value of the expected value of their commutator

[A,B] = A B - B A

Whoops! That started off as plain English, but it degenerated into plain gibberish near the end… which is probably why most people don’t understand the uncertainty principle. I don’t think I’m gonna cure that today, but let me just nail down the math a bit.

Suppose A and B are observables — and to keep things really simple, by observable I’ll just mean a self-adjoint n \times n matrix. Suppose \psi is a state: that is, a unit vector in \mathbb{C}^n. Then the expected value of A in the state \psi is the average answer you get when you measure that observable in that state. Mathematically it’s equal to

\langle A \rangle = \langle \psi, A \psi \rangle

Sorry, there are a lot of angle brackets running around here: the ones at right stand for the inner product in \mathbb{C}^n, which I’m assuming you understand, while the ones at left are being defined by this equation. They’re just a shorthand.

Once we can compute averages, we can compute standard deviations, so we define the standard deviation of an observable A in the state \psi to be \Delta A where

(\Delta A)^2 = \langle A^2 \rangle - \langle A \rangle^2

Got it? Just like in probability theory. So now I hope you know what every symbol here means:

\Delta A \cdot \Delta B \ge \frac{1}{2} |\langle [A,B] \rangle |

and if you’re a certain sort of person you can have fun going home and proving this. Hint: it takes an inequality to prove an inequality. Other hint: what’s the most important inequality in the universe?

But now for the fun part: entropy!

Whenever you have an observable A and a state \psi, you get a probability distribution: the distribution of outcomes when you measure that observable in that state. And this probability distribution has an entropy! Let’s call the entropy S(A). I’ll define it a bit more carefully later.

But the point is: this entropy is really a very nice way to think about our uncertainty, or ignorance, of the observable A. It’s better, in many ways, than the standard deviation. For example, it doesn’t change if we multiply A by 2. The standard deviation doubles, but we’re not twice as ignorant!

Entropy is invariant under lots of transformations of our observables. So we should want an uncertainty principle that only involves entropy. And here it is, the entropic uncertainty principle:

S(A) + S(B) \ge \mathrm{log} \, \frac{1}{c}

Here c is defined as follows. To keep things simple, suppose that A is nondegenerate, meaning that all its eigenvalues are distinct. If it’s not, we can tweak it a tiny bit and it will be. Let its eigenvectors be called \phi_i. Similarly, suppose B is nondegenerate and call its eigenvectors \chi_j. Then we let

c = \mathrm{max}_{i,j} |\langle \phi_i, \chi_j \rangle|^2

Note this becomes 1 when there’s an eigenvector of A that’s also an eigenvector of B. In this case its possible to find a state where we know both observables precisely, and in this case also

\mathrm{log}\, \frac{1}{c} = 0

And that makes sense: in this case S(A) + S(B), which measures our ignorance of both observables, is indeed zero.

But if there’s no eigenvector of A that’s also an eigenvector of B, then c is smaller than 1, so

\mathrm{log} \, \frac{1}{c} > 0

so the entropic uncertainty principle says we really must have some ignorance about either A or B (or both).

So the entropic uncertainty principle makes intuitive sense. But let me define the entropy S(A), to make the principle precise. If \phi_i are the eigenvectors of A, the probabilities of getting various outcomes when we measure A in the state \psi are

p_i = |\langle \phi_i, \psi \rangle|^2

So, we define the entropy by

S(A) = - \sum_i p_i \; \mathrm{log}\, p_i

Here you can use any base for your logarithm, as long as you’re consistent. Mathematicians and physicists use e, while computer scientists, who prefer integers, settle for the best known integer approximation: 2.

Just kidding! Darn — now I’ve insulted all the computer scientists. I hope none of them reads this.

Who came up with this entropic uncertainty principle? I’m not an expert on this, so I’ll probably get this wrong, but I gather it came from an idea of Deutsch:

• David Deutsch, Uncertainty in quantum measurements, Phys. Rev. Lett. 50 (1983), 631-633.

Then it got improved and formulated as a conjecture by Kraus:

• K. Kraus, Complementary observables and uncertainty relations, Phys. Rev. D 35 (1987), 3070-3075.

and then that conjecture was proved here:

• H. Maassen and J. B. Uffink, Generalized entropic uncertainty relations, Phys. Rev. Lett. 60 (1988), 1103-1106.

The paper I found in the lecture hall proves a more refined version where the system being measured — let’s call it X — is entangled to the observer’s memory apparatus — let’s call it O. In this situation they show

S(A|O) + S(B|O) \ge S(X|O) + \mathrm{log} \, \frac{1}{c}

where I’m using a concept of “conditional entropy”: the entropy of something given something else. Here’s their abstract:

The uncertainty principle, originally formulated by Heisenberg, clearly illustrates the difference between classical and quantum mechanics. The principle bounds the uncertainties about the outcomes of two incompatible measurements, such as position and momentum, on a particle. It implies that one cannot predict the outcomes for both possible choices of measurement to arbitrary precision, even if information about the preparation of the particle is available in a classical memory. However, if the particle is prepared entangled with a quantum memory, a device that might be available in the not-too-distant future, it is possible to predict the outcomes for both measurement choices precisely. Here, we extend the uncertainty principle to incorporate this case, providing a lower bound on the uncertainties, which depends on the amount of entanglement between the particle and the quantum memory. We detail the application of our result to witnessing entanglement and to quantum key distribution.

By the way, on a really trivial note…

My wisecrack about 2 being the best known integer approximation to e made me wonder: since 3 is actually closer to e, are there some applications where ternary digits would theoretically be better than binary ones? I’ve heard of "trits" but I don’t actually know any applications where they’re optimal.

Oh — here’s one.


Follow

Get every new post delivered to your Inbox.

Join 3,095 other followers