Information Geometry (Part 7)

2 March, 2011

Today, I want to describe how the Fisher information metric is related to relative entropy. I’ve explained both these concepts separately (click the links for details); now I want to put them together.

But first, let me explain what this whole series of blog posts is about. Information geometry, obviously! But what’s that?

Information geometry is the geometry of ‘statistical manifolds’. Let me explain that concept twice: first vaguely, and then precisely.

Vaguely speaking, a statistical manifold is a manifold whose points are hypotheses about some situation. For example, suppose you have a coin. You could have various hypotheses about what happens when you flip it. For example: you could hypothesize that the coin will land heads up with probability x, where x is any number between 0 and 1. This makes the interval [0,1] into a statistical manifold. Technically this is a manifold with boundary, but that’s okay.

Or, you could have various hypotheses about the IQ’s of American politicians. For example: you could hypothesize that they’re distributed according to a Gaussian probability distribution with mean x and standard deviation y. This makes the space of pairs (x,y) into a statistical manifold. Of course we require y \ge 0, which gives us a manifold with boundary. We might also want to assume x \ge 0, which would give us a manifold with corners, but that’s okay too. We’re going to be pretty relaxed about what counts as a ‘manifold’ here.

If we have a manifold whose points are hypotheses about some situation, we say the manifold ‘parametrizes’ these hypotheses. So, the concept of statistical manifold is fundamental to the subject known as parametric statistics.

Parametric statistics is a huge subject! You could say that information geometry is the application of geometry to this subject.

But now let me go ahead and make the idea of ‘statistical manifold’ more precise. There’s a classical and a quantum version of this idea. I’m working at the Centre of Quantum Technologies, so I’m being paid to be quantum—but today I’m in a classical mood, so I’ll only describe the classical version. Let’s say a classical statistical manifold is a smooth function p from a manifold M to the space of probability distributions on some measure space \Omega.

We should think of \Omega as a space of events. In our first example, it’s just \{H, T\}: we flip a coin and it lands either heads up or tails up. In our second it’s \mathbb{R}: we measure the IQ of an American politician and get some real number.

We should think of M as a space of hypotheses. For each point x \in M, we have a probability distribution p_x on \Omega. This is hypothesis about the events in question: for example “when I flip the coin, there’s 55% chance that it will land heads up”, or “when I measure the IQ of an American politician, the answer will be distributed according to a Gaussian with mean 0 and standard deviation 100.”

Now, suppose someone hands you a classical statistical manifold (M,p). Each point in M is a hypothesis. Apparently some hypotheses are more similar than others. It would be nice to make this precise. So, you might like to define a metric on M that says how ‘far apart’ two hypotheses are. People know lots of ways to do this; the challenge is to find ways that have clear meanings.

Last time I explained the concept of relative entropy. Suppose we have two probability distributions on \Omega, say p and q. Then the entropy of p relative to q is the amount of information you gain when you start with the hypothesis q but then discover that you should switch to the new improved hypothesis p. It equals:

\int_\Omega \; \frac{p}{q} \; \ln(\frac{p}{q}) \; q d \omega

You could try to use this to define a distance between points x and y in our statistical manifold, like this:

S(x,y) =  \int_\Omega \; \frac{p_x}{p_y} \; \ln(\frac{p_x}{p_y}) \; p_y d \omega

This is definitely an important function. Unfortunately, as I explained last time, it doesn’t obey the axioms that a distance function should! Worst of all, it doesn’t obey the triangle inequality.

Can we ‘fix’ it? Yes, we can! And when we do, we get the Fisher information metric, which is actually a Riemannian metric on M. Suppose we put local coordinates on some patch of M containing the point x. Then the Fisher information metric is given by:

g_{ij}(x) = \int_\Omega  \partial_i (\ln p_x) \; \partial_j (\ln p_x) \; p_x d \omega

You can think of my whole series of articles so far as an attempt to understand this funny-looking formula. I’ve shown how to get it from a few different starting-points, most recently back in Part 3. But now let’s get it starting from relative entropy!

Fix any point in our statistical manifold and choose local coordinates for which this point is the origin, 0. The amount of information we gain if move to some other point x is the relative entropy S(x,0). But what’s this like when x is really close to 0? We can imagine doing a Taylor series expansion of S(x,0) to answer this question.

Surprisingly, to first order the answer is always zero! Mathematically:

\partial_i S(x,0)|_{x = 0} = 0

In plain English: if you change your mind slightly, you learn a negligible amount — not an amount proportional to how much you changed your mind.

This must have some profound significance. I wish I knew what. Could it mean that people are reluctant to change their minds except in big jumps?

Anyway, if you think about it, this fact makes it obvious that S(x,y) can’t obey the triangle inequality. S(x,y) could be pretty big, but if we draw a curve from x and y, and mark n closely spaced points x_i on this curve, then S(x_{i+1}, x_i) is zero to first order, so it must be of order 1/n^2, so if the triangle inequality were true we’d have

S(x,y) \le \sum_i S(x_{i+i},x_i) \le \mathrm{const} \, n \cdot \frac{1}{n^2}

for all n, which is a contradiction.

In plain English: if you change your mind in one big jump, the amount of information you gain is more than the sum of the amounts you’d gain if you change your mind in lots of little steps! This seems pretty darn strange, but the paper I mentioned in part 1 helps:

• Gavin E. Crooks, Measuring thermodynamic length.

You’ll see he takes a curve and chops it into lots of little pieces as I just did, and explains what’s going on.

Okay, so what about second order? What’s

\partial_i \partial_j S(x,0)|_{x = 0} ?

Well, this is the punchline of this blog post: it’s the Fisher information metric:

\partial_i \partial_j S(x,0)|_{x = 0} = g_{ij}

And since the Fisher information metric is a Riemannian metric, we can then apply the usual recipe and define distances in a way that obeys the triangle inequality. Crooks calls this distance thermodynamic length in the special case that he considers, and he explains its physical meaning.

Now let me prove that

\partial_i S(x,0)|_{x = 0} = 0

and

\partial_i \partial_j S(x,0)|_{x = 0} = g_{ij}

This can be somewhat tedious if you do it by straighforwardly grinding it out—I know, I did it. So let me show you a better way, which requires more conceptual acrobatics but less brute force.

The trick is to work with the universal statistical manifold for the measure space \Omega. Namely, we take M to be the space of all probability distributions on \Omega! This is typically an infinite-dimensional manifold, but that’s okay: we’re being relaxed about what counts as a manifold here. In this case, we don’t need to write p_x for the probability distribution corresponding to the point x \in M. In this case, a point of M just is a probability distribution on \Omega, so we’ll just call it p.

If we can prove the formulas for this universal example, they’ll automatically follow for every other example, by abstract nonsense. Why? Because any statistical manifold with measure space \Omega is the same as a manifold with a smooth map to the universal statistical manifold! So, geometrical structures on the universal one ‘pull back’ to give structures on all the rest. The Fisher information metric and the function S can be defined as pullbacks in this way! So, to study them, we can just study the universal example.

(If you’re familiar with ‘classifying spaces for bundles’ or other sorts of ‘classifying spaces’, all this should seem awfully familiar. It’s a standard math trick.)

So, let’s prove that

\partial_i S(x,0)|_{x = 0} = 0

by proving it in the universal example. Given any probability distribution q, and taking a nearby probability distribution p, we can write

\frac{p}{q} = 1 + f

where f is some small function. We only need to show that S(p,q) is zero to first order in f. And this is pretty easy. By definition:

S(p,q) =  \int_\Omega \; \frac{p}{q} \, \ln(\frac{p}{q}) \; q d \omega

or in other words,

S(p,q) =  \int_\Omega \; (1 + f) \, \ln(1 + f) \; q d \omega

We can calculate this to first order in f and show we get zero. But let’s actually work it out to second order, since we’ll need that later:

\ln (1 + f) = f - \frac{1}{2} f^2 + \cdots

so

(1 + f) \, \ln (1+ f) = f + \frac{1}{2} f^2 + \cdots

so

\begin{aligned} S(p,q) &=& \int_\Omega \; (1 + f) \; \ln(1 + f) \; q d \omega \\ &=& \int_\Omega f \, q d \omega + \frac{1}{2} \int_\Omega f^2\, q d \omega + \cdots \end{aligned}

Why does this vanish to first order in f? It’s because p and q are both probability distributions and p/q = 1 + f, so

\int_\Omega (1 + f) \, q d\omega = \int_\Omega p d\omega = 1

but also

\int_\Omega q d\omega = 1

so subtracting we see

\int_\Omega f \, q d\omega = 0

So, S(p,q) vanishes to first order in f. Voilà!

Next let’s prove the more interesting formula:

\partial_i \partial_j S(x,0)|_{x = 0} = g_{ij}

which relates relative entropy to the Fisher information metric. Since both sides are symmetric matrices, it suffices to show their diagonal entries agree in any coordinate system:

\partial^2_i S(x,0)|_{x = 0} = g_{ii}

Devoted followers of this series of posts will note that I keep using this trick, which takes advantage of the polarization identity.

To prove

\partial^2_i S(x,0)|_{x = 0} = g_{ii}

it’s enough to consider the universal example. We take the origin to be some probability distribution q and take x to be a nearby probability distribution p which is pushed a tiny bit in the ith coordinate direction. As before we write p/q = 1 + f. We look at the second-order term in our formula for S(p,q):

\frac{1}{2} \int_\Omega f^2\, q d \omega

Using the usual second-order Taylor’s formula, which has a \frac{1}{2} built into it, we can say

\partial^2_i S(x,0)|_{x = 0} = \int_\Omega f^2\, q d \omega

On the other hand, our formula for the Fisher information metric gives

g_{ii} = \left. \int_\Omega  \partial_i \ln p \; \partial_i \ln p \; q d \omega \right|_{p=q}

The right hand sides of the last two formulas look awfully similar! And indeed they agree, because we can show that

\left. \partial_i \ln p \right|_{p = q} = f

How? Well, we assumed that p is what we get by taking q and pushing it a little bit in the ith coordinate direction; we have also written that little change as

p/q = 1 + f

for some small function f. So,

\partial_i (p/q) = f

and thus:

\partial_i p = f q

and thus:

\partial_i \ln p = \frac{\partial_i p}{p} = \frac{fq}{p}

so

\left. \partial_i \ln p \right|_{p=q} = f

as desired.

This argument may seem a little hand-wavy and nonrigorous, with words like ‘a little bit’. If you’re used to taking arguments involving infinitesimal changes and translating them into calculus (or differential geometry), it should make sense. If it doesn’t, I apologize. It’s easy to make it more rigorous, but only at the cost of more annoying notation, which doesn’t seem good in a blog post.

Boring technicalities

If you’re actually the kind of person who reads a section called ‘boring technicalities’, I’ll admit to you that my calculations don’t make sense if the integrals diverge, or we’re dividing by zero in the ratio p/q. To avoid these problems, here’s what we should do. Fix a \sigma-finite measure space (\Omega, d\omega). Then, define the universal statistical manifold to be the space P(\Omega,d \omega) consisting of all probability measures that are equivalent to d\omega, in the usual sense of measure theory. By Radon-Nikodym, we can write any such measure as q d \omega where q \in L^1(\Omega, d\omega). Moreover, given two of these guys, say p d \omega and q d\omega, they are absolutely continuous with respect to each other, so we can write

p d \omega = \frac{p}{q} \; q d \omega

where the ratio p/q is well-defined almost everywhere and lies in L^1(\Omega, q d\omega). This is enough to guarantee that we’re never dividing by zero, and I think it’s enough to make sure all my integrals converge.

We do still need to make P(\Omega,d \omega) into some sort of infinite-dimensional manifold, to justify all the derivatives. There are various ways to approach this issue, all of which start from the fact that L^1(\Omega, d\omega) is a Banach space, which is about the nicest sort of infinite-dimensional manifold one could imagine. Sitting in L^1(\Omega, d\omega) is the hyperplane consisting of functions q with

\int_\Omega q d\omega = 1

and this is a Banach manifold. To get P(\Omega,d \omega) we need to take a subspace of that hyperplane. If this subspace were open then P(\Omega,d \omega) would be a Banach manifold in its own right. I haven’t checked this yet, for various reasons.

For one thing, there’s a nice theory of ‘diffeological spaces’, which generalize manifolds. Every Banach manifold is a diffeological space, and every subset of a diffeological space is again a diffeological space. For many purposes we don’t need our ‘statistical manifolds’ to be manifolds: diffeological spaces will do just fine. This is one reason why I’m being pretty relaxed here about what counts as a ‘manifold’.

For another, I know that people have worked out a lot of this stuff, so I can just look things up when I need to. And so can you! This book is a good place to start:

• Paolo Gibilisco, Eva Riccomagno, Maria Piera Rogantin and Henry P. Wynn, Algebraic and Geometric Methods in Statistics, Cambridge U. Press, Cambridge, 2009.

I find the chapters by Raymond Streater especially congenial. For the technical issue I’m talking about now it’s worth reading section 14.2, “Manifolds modelled by Orlicz spaces”, which tackles the problem of constructing a universal statistical manifold in a more sophisticated way than I’ve just done. And in chapter 15, “The Banach manifold of quantum states”, he tackles the quantum version!


Information Geometry (Part 6)

21 January, 2011

So far, my thread on information geometry hasn’t said much about information. It’s time to remedy that.

I’ve been telling you about the Fisher information metric. In statistics this is nice a way to define a ‘distance’ between two probability distributions. But it also has a quantum version.

So far I’ve showed you how to define the Fisher information metric in three equivalent ways. I also showed that in the quantum case, the Fisher information metric is the real part of a complex-valued thing. The imaginary part is related to the uncertainty principle.

You can see it all here:

Part 1     • Part 2     • Part 3     • Part 4     • Part 5

But there’s yet another way to define the Fisher information metric, which really involves information.

To explain this, I need to start with the idea of ‘information gain’, or ‘relative entropy’. And it looks like I should do a whole post on this.

So:

Suppose that \Omega is a measure space — that is, a space you can do integrals over. By a probability distribution on \Omega, I’ll mean a nonnegative function

p : \Omega \to \mathbb{R}

whose integral is 1. Here d \omega is my name for the measure on \Omega. Physicists might call \Omega the ‘phase space’ of some classical system, but probability theorists might call it a space of ‘events’. Today I’ll use the probability theorist’s language. The idea here is that

\int_A \; p(\omega) \; d \omega

gives the probability that when an event happens, it’ll be one in the subset A \subseteq \Omega. That’s why we want

p \ge 0

Probabilities are supposed to be nonnegative. And that’s also why we want

\int_\Omega \; p(\omega) \; d \omega = 1

This says that the probability of some event happening is 1.

Now, suppose we have two probability distributions on \Omega, say p and q. The information gain as we go from q to p is

S(p,q) = \int_\Omega \; p(\omega) \log(\frac{p(\omega)}{q(\omega)}) \; d \omega

We also call this the entropy of p relative to q. It says how much information you learn if you discover that the probability distribution of an event is p, if before you had thought it was q.

I like relative entropy because it’s related to the Bayesian interpretation of probability. The idea here is that you can’t really ‘observe’ probabilities as frequencies of events, except in some unattainable limit where you repeat an experiment over and over infinitely many times. Instead, you start with some hypothesis about how likely things are: a probability distribution called the prior. Then you update this using Bayes’ rule when you gain new information. The updated probability distribution — your new improved hypothesis — is called the posterior.

And if you don’t do the updating right, you need a swift kick in the posterior!

So, we can think of q as the prior probability distribution, and p as the posterior. Then S(p,q) measures the amount of information that caused you to change your views.

For example, suppose you’re flipping a coin, so your set of events is just

\Omega = \{ \mathrm{heads}, \mathrm{tails} \}

In this case all the integrals are just sums with two terms. Suppose your prior assumption is that the coin is fair. Then

q(\mathrm{heads}) = 1/2, \; q(\mathrm{tails}) = 1/2

But then suppose someone you trust comes up and says “Sorry, that’s a trick coin: it always comes up heads!” So you update our probability distribution and get this posterior:

p(\mathrm{heads}) = 1, \; p(\mathrm{tails}) = 0

How much information have you gained? Or in other words, what’s the relative entropy? It’s this:

S(p,q) = \int_\Omega \; p(\omega) \log(\frac{p(\omega)}{q(\omega)}) \; d \omega = 1 \cdot \log(\frac{1}{1/2}) + 0 \cdot \log(\frac{0}{1/2}) = 1

Here I’m doing the logarithm in base 2, and you’re supposed to know that in this game 0 \log 0 = 0.

So: you’ve learned one bit of information!

That’s supposed to make perfect sense. On the other hand, the reverse scenario takes a bit more thought.

You start out feeling sure that the coin always lands heads up. Then someone you trust says “No, that’s a perfectly fair coin.” If you work out the amount of information you learned this time, you’ll see it’s infinite.

Why is that?

The reason is that something that you thought was impossible — the coin landing tails up — turned out to be possible. In this game, it counts as infinitely shocking to learn something like that, so the information gain is infinite. If you hadn’t been so darn sure of yourself — if you had just believed that the coin almost always landed heads up — your information gain would be large but finite.

The Bayesian philosophy is built into the concept of information gain, because information gain depends on two things: the prior and the posterior. And that’s just as it should be: you can only say how much you learned if you know what you believed beforehand!

You might say that information gain depends on three things: p, q and the measure d \omega. And you’d be right! Unfortunately, the notation S(p,q) is a bit misleading. Information gain really does depend on just two things, but these things are not p and q: they’re p(\omega) d\omega and q(\omega) d\omega. These are called probability measures, and they’re ultimately more important than the probability distributions p and q.

To see this, take our information gain:

\int_\Omega \; p(\omega) \log(\frac{p(\omega)}{q(\omega)}) \; d \omega

and juggle it ever so slightly to get this:

\int_\Omega \;  \log(\frac{p(\omega) d\omega}{q(\omega)d \omega}) \; p(\omega) d \omega

Clearly this depends only on p(\omega) d\omega and q(\omega) d\omega. Indeed, it’s good to work directly with these probability measures and give them short names, like

d\mu = p(\omega) d \omega

d\nu = q(\omega) d \omega

Then the formula for information gain looks more slick:

\int_\Omega \; \log(\frac{d\mu}{d\nu}) \; d\mu

And by the way, in case you’re wondering, the d here doesn’t actually mean much: we’re just so brainwashed into wanting a d x in our integrals that people often use d \mu for a measure even though the simpler notation \mu might be more logical. So, the function

\frac{d\mu}{d\nu}

is really just a ratio of probability measures, but people call it a Radon-Nikodym derivative, because it looks like a derivative (and in some important examples it actually is). So, if I were talking to myself, I could have shortened this blog entry immensely by working with directly probability measures, leaving out the d‘s, and saying:

Suppose \mu and \nu are probability measures; then the entropy of \mu relative to \nu, or information gain, is

S(\mu, \nu) =  \int_\Omega \; \log(\frac{\mu}{\nu}) \; \mu

But I’m under the impression that people are actually reading this stuff, and that most of you are happier with functions than measures. So, I decided to start with

S(p,q) =  \int_\Omega \; p(\omega) \log(\frac{p(\omega)}{q(\omega)}) \; d \omega

and then gradually work my way up to the more sophisticated way to think about relative entropy! But having gotten that off my chest, now I’ll revert to the original naive way.

As a warmup for next time, let me pose a question. How much is this quantity

S(p,q) =  \int_\Omega \; p(\omega) \log(\frac{p(\omega)}{q(\omega)}) \; d \omega

like a distance between probability distributions? A distance function, or metric, is supposed to satisfy some axioms. Alas, relative entropy satisfies some of these, but not the most interesting one!

• If you’ve got a metric, the distance between points should always be nonnegative. Indeed, this holds:

S(p,q) \ge 0

So, we never learn a negative amount when we update our prior, at least according to this definition. It’s a fun exercise to prove this inequality, at least if you know some tricks involving inequalities and convex functions — otherwise it might be hard.

• If you’ve got a metric, the distance between two points should only be zero if they’re really the same point. In fact,

S(p,q) = 0

if and only if

p d\omega = q d \omega

It’s possible to have p d\omega = q d \omega even if p \ne q, because d \omega can be zero somewhere. But this is just more evidence that we should really be talking about the probability measure p d \omega instead of the probability distribution p. If we do that, we’re okay so far!

• If you’ve got a metric, the distance from your first point to your second point is the same as the distance from the second to the first. Alas,

S(p,q) \ne S(q,p)

in general. We already saw this in our example of the flipped coin. This is a slight bummer, but I could live with it, since Lawvere has already shown that it’s wise to generalize the concept of metric by dropping this axiom.

• If you’ve got a metric, it obeys the triangle inequality. This is the really interesting axiom, and alas, this too fails. Later we’ll see why.

So, relative entropy does a fairly miserable job of acting like a distance function. People call it a divergence. In fact, they often call it the Kullback-Leibler divergence. I don’t like that, because ‘the Kullback-Leibler divergence’ doesn’t really explain the idea: it sounds more like the title of a bad spy novel. ‘Relative entropy’, on the other hand, makes a lot of sense if you understand entropy. And ‘information gain’ makes sense if you understand information.

Anyway: how can we save this miserable attempt to get a distance function on the space of probability distributions? Simple: take its matrix of second derivatives and use that to define a Riemannian metric g_{ij}. This Riemannian metric in turn defines a metric of the more elementary sort we’ve been discussing today.

And this Riemannian metric is the Fisher information metric I’ve been talking about all along!

More details later, I hope.


Information Geometry (Part 5)

2 November, 2010

I’m trying to understand the Fisher information metric and how it’s related to Öttinger’s formalism for ‘dissipative mechanics’ — that is, mechanics including friction. They involve similar physics, and they involve similar math, but it’s not quite clear how they fit together.

I think it will help to do an example. The harmonic oscillator is a trusty workhorse throughout physics, so let’s do that.

So: suppose you have a rock hanging on a spring, and it can bounce up and down. Suppose it’s in thermal equilibrium with its environment. It will wiggle up and down ever so slightly, thanks to thermal fluctuations. The hotter it is, the more it wiggles. These vibrations are random, so its position and momentum at any given moment can be treated as random variables.

If we take quantum mechanics into account, there’s an extra source of randomness: quantum fluctuations. Now there will be fluctuations even at zero temperature. Ultimately this is due to the uncertainty principle. Indeed, if you know the position for sure, you can’t know the momentum at all!

Let’s see how the position, momentum and energy of our rock will fluctuate given that we know all three of these quantities on average. The fluctuations will form a little fuzzy blob, roughly ellipsoidal in shape, in the 3-dimensional space whose coordinates are position, momentum and energy:

Yeah, I know you’re sick of this picture, but this time it’s for real: I want to calculate what this ellipsoid actually looks like! I’m not promising I’ll do it — I may get stuck, or bored — but at least I’ll try.

Before I start the calculation, let’s guess the answer. A harmonic oscillator has a position q and momentum p, and its energy is

H = \frac{1}{2}(q^2 + p^2)

Here I’m working in units where lots of things equal 1, to keep things simple.

You’ll notice that this energy has rotational symmetry in the position-momentum plane. This is ultimately what makes the harmonic oscillator such a beloved physical system. So, we might naively guess that our little ellipsoid will have rotational symmetry as well, like this:

or this:

Here I’m using the x and y coordinates for position and momentum, while the z coordinate stands for energy. So in these examples the position and momentum fluctuations are the same size, while the energy fluctuations, drawn in the vertical direction, might be bigger or smaller.

Unfortunately, this guess really is naive. After all, there are lots of these ellipsoids, one centered at each point in position-momentum-energy space. Remember the rules of the game! You give me any point in this space. I take the coordinates of this point as the mean values of position, momentum and energy, and I find the maximum-entropy state with these mean values. Then I work out the fluctuations in this state, and draw them as an ellipsoid.

If you pick a point where position and momentum have mean value zero, you haven’t broken the rotational symmetry of the problem. So, my ellipsoid must be rotationally symmetric. But if you pick some other mean value for position and momentum, all bets are off!

Fortunately, this naive guess is actually right: all the ellipsoids are rotationally symmetric — even the ones centered at nonzero values of position and momentum! We’ll see why soon. And if you’ve been following this series of posts, you’ll know what this implies: the “Fisher information metric” g on position-momentum-energy space has rotational symmetry about any vertical axis. (Again, I’m using the vertical direction for energy.) So, if we slice this space with any horizontal plane, the metric on this plane must be the plane’s usual metric times a constant:

g = \mathrm{constant} \, (dq^2 + dp^2)

Why? Because only the usual metric on the plane, or any multiple of it, has ordinary rotations around every point as symmetries.

So, roughly speaking, we’re recovering the ‘obvious’ geometry of the position-momentum plane from the Fisher information metric. We’re recovering ‘ordinary’ geometry from information geometry!

But this should not be terribly surprising, since we used the harmonic oscillator Hamiltonian

H = \frac{1}{2}(q^2 + p^2)

as an input to our game. It’s mainly just a confirmation that things are working as we’d hope.

There’s more, though. Last time I realized that because observables in quantum mechanics don’t commute, the Fisher information metric has a curious skew-symmetric partner called \omega. So, we should also study this in our example. And when we do, we’ll see that restricted to any horizontal plane in position-momentum-energy space, we get

\omega = \mathrm{constant} \, (dq \, dp - dp \, dq)

This looks like a mutant version of the Fisher information metric

g = \mathrm{constant} \, (dq^2 + dp^2)

and if you know your geometry, you’ll know it’s the usual ‘symplectic structure’ on the position-energy plane — at least, times some constant.

All this is very reminiscent of Öttinger’s work on dissipative mechanics. But we’ll also see something else: while the constant in g depends on the energy — that is, on which horizontal plane we take — the constant in \omega does not!

Why? It’s perfectly sensible. The metric g on our horizontal plane keeps track of fluctuations in position and momentum. Thermal fluctuations get bigger when it’s hotter — and to boost the average energy of our oscillator, we must heat it up. So, as we increase the energy, moving our horizontal plane further up in position-momentum-energy space, the metric on the plane gets bigger! In other words, our ellipsoids get a fat cross-section at high energies.

On the other hand, the symplectic structure \omega arises from the fact that position q and momentum p don’t commute in quantum mechanics. They obey Heisenberg’s ‘canonical commutation relation’:

q p - p q = i

This relation doesn’t involve energy, so \omega will be the same on every horizontal plane. And it turns out this relation implies

\omega  = \mathrm{constant} \, (dq \, dp - dp \, dq)

for some constant we’ll compute later.

Okay, that’s the basic idea. Now let’s actually do some computations. For starters, let’s see why all our ellipsoids have rotational symmetry!

To do this, we need to understand a bit about the mixed state \rho that maximizes entropy given certain mean values of position, momentum and energy. So, let’s choose the numbers we want for these mean values (also known as ‘expected values’ or ‘expectation values’):

\langle H \rangle = E

\langle q \rangle = q_0

\langle p \rangle = p_0

I hope this isn’t too confusing: H, p, q are our observables which are operators, while E, p_0, q_0 are the mean values we have chosen for them. The state \rho depends on E, p_0 and q_0.

We’re doing quantum mechanics, so position q and momentum p are both self-adjoint operators on the Hilbert space L^2(\mathbb{R}):

(q\psi)(x) = x \psi(x)

(p\psi)(x) = - i \frac{d \psi}{dx}(x)

Indeed all our observables, including the Hamiltonian

H = \frac{1}{2} (p^2 + q^2)

are self-adjoint operators on this Hilbert space, and the state \rho is a density matrix on this space, meaning a positive self-adjoint operator with trace 1.

Now: how do we compute \rho? It’s a Lagrange multiplier problem: maximizing some function given some constraints. And it’s well-known that when you solve this problem, you get

\rho = \frac{1}{Z} e^{-(\lambda^1 q + \lambda^2 p + \lambda^3 H)}

where \lambda^1, \lambda^2, \lambda^3 are three numbers we yet have to find, and Z is a normalizing factor called the partition function:

Z = \mathrm{tr} (e^{-(\lambda^1 q + \lambda^2 p + \lambda^3 H)} )

Now let’s look at a special case. If we choose \lambda^1 = \lambda^2 = 0, we’re back a simpler and more famous problem, namely maximizing entropy subject to a constraint only on energy! The solution is then

\rho = \frac{1}{Z} e^{-\beta H} , \qquad Z = \mathrm{tr} (e^{- \beta H} )

Here I’m using the letter \beta instead of \lambda^3 because this is traditional. This quantity has an important physical meaning! It’s the reciprocal of temperature in units where Boltzmann’s constant is 1.

Anyway, back to our special case! In this special case it’s easy to explicitly calculate \rho and Z. Indeed, people have known how ever since Planck put the ‘quantum’ in quantum mechanics! He figured out how black-body radiation works. A box of hot radiation is just a big bunch of harmonic oscillators in thermal equilibrium. You can work out its partition function by multiplying the partition function of each one.

So, it would be great to reduce our general problem to this special case. To do this, let’s rewrite

Z = \mathrm{tr} (e^{-(\lambda^1 q + \lambda^2 p + \lambda^3 H)} )

in terms of some new variables, like this:

\rho = \frac{1}{Z} e^{-\beta(H - f q - g p)}

where now

Z = \mathrm{tr} (e^{-\beta(H - f q - g p)} )

Think about it! Now our problem is just like an oscillator with a modified Hamiltonian

H' = H - f q - g p

What does this mean, physically? Well, if you push on something with a force f, its potential energy will pick up a term - f q. So, the first two terms are just the Hamiltonian for a harmonic oscillator with an extra force pushing on it!

I don’t know a nice interpretation for the - g p term. We could say that besides the extra force equal to f, we also have an extra ‘gorce’ equal to g. I don’t know what that means. Luckily, I don’t need to! Mathematically, our whole problem is invariant under rotations in the position-momentum plane, so whatever works for q must also work for p.

Now here’s the cool part. We can complete the square:

\begin{aligned} H' & = \frac{1}{2} (q^2 + p^2) -  f q - g p \\ &= \frac{1}{2}(q^2 - 2 q f + f^2) + \frac{1}{2}(p^2 - 2 q g + g^2) - \frac{1}{2}(g^2 + f^2)  \\ &= \frac{1}{2}((q - f)^2 + (p - g)^2)  - \frac{1}{2}(g^2 + f^2)  \end{aligned}

so if we define ‘translated’ position and momentum operators:

q' = q - f, \qquad p' = p - g

we have

H' = \frac{1}{2}({q'}^2 + {p'}^2) -  \frac{1}{2}(g^2 + f^2)

So: apart from a constant, H' is just the harmonic oscillator Hamiltonian in terms of ‘translated’ position and momentum operators!

In other words: we’re studying a strange variant of the harmonic oscillator, where we are pushing on it with an extra force and also an extra ‘gorce’. But this strange variant is exactly the same as the usual harmonic oscillator, except that we’re working in translated coordinates on position-momentum space, and subtracting a constant from the Hamiltonian.

These are pretty minor differences. So, we’ve succeeded in reducing our problem to the problem of a harmonic oscillator in thermal equilibrium at some temperature!

This makes it easy to calculate

Z = \mathrm{tr} (e^{-\beta(H - f q - g p)} ) = \mathrm{tr}(e^{-\beta H'})

By our formula for H', this is just

Z = e^{\frac{1}{2}(g^2 + f^2)} \; \mathrm{tr} (e^{-\frac{1}{2}({q'}^2 + {p'}^2)})

And the second factor here equals the partition function for the good old harmonic oscillator:

Z = e^{\frac{1}{2}(g^2 + f^2)} \; \mathrm{tr} (e^{-\beta H})

So now we’re back to a textbook problem. The eigenvalues of the harmonic oscillator Hamiltonian are

n + \frac{1}{2}

where

n = 0,1,2,3, \dots

So, the eigenvalues of e^{-\beta H} are are just

e^{-\beta(n + \frac{1}{2})}

and to take the trace of this operator, we sum up these eigenvalues:

\mathrm{tr}(e^{-\beta H}) = \sum_{n = 0}^\infty e^{-\beta (n + \frac{1}{2})} = \frac{e^{-\beta/2}}{1 - e^{-\beta}}

So:

Z = e^{\frac{1}{2}(g^2 + f^2)} \; \frac{e^{-\beta/2}}{1 - e^{-\beta}}

We can now compute the Fisher information metric using this formula:

g_{ij} = \frac{\partial^2}{\partial \lambda^i \partial \lambda^j} \ln Z

if we remember how our new variables are related to the \lambda^i:

\lambda^1 = \beta f , \qquad \lambda^2 = \beta g, \qquad \lambda^3 = \beta

It’s just calculus! But I’m feeling a bit tired, so I’ll leave this pleasure to you.

For now, I’d rather go back to our basic intuition about how the Fisher information metric describes fluctuations of observables. Mathematically, this means it’s the real part of the covariance matrix

g_{ij} = \mathrm{Re} \langle \, (X_i - \langle X_i \rangle) \, (X_j - \langle X_j \rangle)  \, \rangle

where for us

X_1 = q, \qquad X_2 = p, \qquad X_3 = E

Here we are taking expected values using the mixed state \rho. We’ve seen this mixed state is just like the maximum-entropy state of a harmonic oscillator at fixed temperature — except for two caveats: we’re working in translated coordinates on position-momentum space, and subtracting a constant from the Hamiltonian. But neither of these two caveats affects the fluctuations (X_i - \langle X_i \rangle) or the covariance matrix.

So, as indeed we’ve already seen, g_{ij} has rotational symmetry in the 1-2 plane. Thus, we’ll completely know it once we know g_{11} = g_{22} and g_{33}; the other components are zero for symmetry reasons. g_{11} will equal the variance of position for a harmonic oscillator at a given temperature, while g_{33} will equal the variance of its energy. We can work these out or look them up.

I won’t do that now: I’m after insight, not formulas. For physical reasons, it’s obvious that g_{11} must diminish with diminishing energy — but not go to zero. Why? Well, as the temperature approaches zero, a harmonic oscillator in thermal equilibrium approaches its state of least energy: the so-called ‘ground state’. In its ground state, the standard deviations of position and momentum are as small as allowed by the Heisenberg uncertainty principle:

\Delta p \Delta q  \ge \frac{1}{2}

and they’re equal, so

g_{11} = (\Delta q)^2 = \frac{1}{2}.

That’s enough about the metric. Now, what about the metric’s skew-symmetric partner? This is:

\omega_{ij} = \mathrm{Im} \langle \, (X_i - \langle X_i \rangle) \, (X_j - \langle X_j \rangle)  \, \rangle

Last time we saw that \omega is all about expected values of commutators:

\omega_{ij} = \frac{1}{2i} \langle [X_i, X_j] \rangle

and this makes it easy to compute. For example,

[X_1, X_2] = q p - p q = i

so

\omega_{12} = \frac{1}{2}

Of course

\omega_{11} = \omega_{22} = 0

by skew-symmetry, so we know the restriction of \omega to any horizontal plane. We can also work out other components, like \omega_{13}, but I don’t want to. I’d rather just state this:

Summary: Restricted to any horizontal plane in the position-momentum-energy space, the Fisher information metric for the harmonic oscillator is

g = \mathrm{constant} (dq_0^2 + dp_0^2)

with a constant depending on the temperature, equalling \frac{1}{2} in the zero-temperature limit, and increasing as the temperature rises. Restricted to the same plane, the Fisher information metric’s skew-symmetric partner is

\omega = \frac{1}{2} dq_0 \wedge dp_0

(Remember, the mean values q_0, p_0, E_0 are the coordinates on position-momentum-energy space. We could also use coordinates f, g, \beta or f, g and temperature. In the chatty intro to this article you saw formulas like those above but without the subscripts; that’s before I got serious about using q and p to mean operators.)

And now for the moral. Actually I have two: a physics moral and a math moral.

First, what is the physical meaning of g or \omega when restricted to a plane of constant E_0, or if you prefer, a plane of constant temperature?

Physics Moral: Restricted to a constant-temperature plane, g is the covariance matrix for our observables. It is temperature-dependent. In the zero-temperature limit, the thermal fluctuations go away and g depends only on quantum fluctuations in the ground state. On the other hand, \omega restricted to a constant-temperature plane describes Heisenberg uncertainty relations for noncommuting observables. In our example, it is temperature-independent.

Second, what does this have to do with Kähler geometry? Remember, the complex plane has a complex-valued metric on it, called a Kähler structure. Its real part is a Riemannian metric, and its imaginary part is a symplectic structure. We can think of the the complex plane as the position-momentum plane for a point particle. Then the symplectic structure is the basic ingredient needed for Hamiltonian mechanics, while the Riemannian structure is the basic ingredient needed for the harmonic oscillator Hamiltonian.

Math Moral: In the example we considered, \omega restricted to a constant-temperature plane is equal to \frac{1}{2} the usual symplectic structure on the complex plane. On the other hand, g restricted to a constant-temperature plane is a multiple of the usual Riemannian metric on the complex plane — but this multiple is \frac{1}{2} only when the temperature is zero! So, only at temperature zero are g and \omega the real and imaginary parts of a Kähler structure.

It will be interesting to see how much of this stuff is true more generally. The harmonic oscillator is much nicer than your average physical system, so it can be misleading, but I think some of the morals we’ve seen here can be generalized.

Some other time I may so more about how all this is
related to Öttinger’s formalism, but the quick point is that he too has mixed states, and a symmetric g, and a skew-symmetric \omega. So it’s nice to see if they match up in an example.

Finally, two footnotes on terminology:

β: In fact, this quantity \beta = 1/kT is so important it deserves a better name than ‘reciprocal of temperature’. How about ‘coolness’? An important lesson from statistical mechanics is that coolness is more fundamental than temperature. This makes some facts more plausible. For example, if you say “you can never reach absolute zero,” it sounds very odd, since you can get as close as you like, and it’s even possible to get negative temperatures — but temperature zero remains tantalizingly out of reach. But “you can never attain infinite coolness” — now that makes sense.

Gorce: I apologize to Richard Feynman for stealing the word ‘gorce’ and using it a different way. Does anyone have a good intuition for what’s going on when you apply my sort of ‘gorce’ to a point particle? You need to think about velocity-dependent potentials, of that I’m sure. In the presence of a velocity-dependent potential, momentum is not just mass times velocity. Which is good: if it were, we could never have a system where the mean value of both q and p stayed constant over time!


Information Geometry (Part 4)

29 October, 2010

Before moving on, I’d like to clear up a mistake I’d been making in all my previous posts on this subject.

(By now I’ve tried to fix those posts, because people often get information from the web in a hasty way, and I don’t want my mistake to spread. But you’ll still see traces of my mistake infecting the comments on those posts.)

So what’s the mistake? It’s embarrassingly simple, but also simple to fix. A Riemannian metric must be symmetric:

g_{ij} = g_{ji}

Now, I had defined the Fisher information metric to be the so-called ‘covariance matrix’:

g_{ij} = \langle (X_i - \langle X_i \rangle) \;(X_j- \langle X_j \rangle)\rangle

where X_i are some observable-valued functions on a manifold M, and the angle brackets mean “expectation value”, computed using a mixed state \rho that also depends on the point in M.

The covariance matrix is symmetric in classical mechanics, since then observables commute, so:

\langle AB \rangle = \langle BA \rangle

But it’s not symmetric is quantum mechanics! After all, suppose q is the position operator for a particle, and p is the momentum operator. Then according to Heisenberg

qp = pq + i

in units where Planck’s constant is 1. Taking expectation values, we get:

\langle qp \rangle = \langle pq \rangle + i

and in particular:

\langle qp \rangle \ne \langle pq \rangle

We can use this to get examples where g_{ij} is not symmetric.

However, it turns out that the real part of the covariance matrix is symmetric, even in quantum mechanics — and that’s what we should use as our Fisher information metric.

Why is the real part of the covariance matrix symmetric, even in quantum mechanics? Well, suppose \rho is any density matrix, and A and B are any observables. Then by definition

\langle AB \rangle = \mathrm{tr} (\rho AB)

so taking the complex conjugate of both sides

\langle AB\rangle^*  = \mathrm{tr}(\rho AB)^* = \mathrm{tr}((\rho A B)^*) = \mathrm{tr}(B^* A^* \rho^*)

where I’m using an asterisk both for the complex conjugate of a number and the adjoint of an operator. But our observables are self-adjoint, and so is our density matrix, so we get

\mathrm{tr}(B^* A^* \rho^*) = \mathrm{tr}(B A \rho) = \mathrm{tr}(\rho B A) = \langle B A \rangle

where in the second step we used the cyclic property of the trace. In short:

\langle AB\rangle^* = \langle BA \rangle

If we take real parts, we get something symmetric:

\mathrm{Re} \langle AB\rangle =  \mathrm{Re} \langle BA \rangle

So, if we redefine the Fisher information metric to be the real part of the covariance matrix:

g_{ij} = \mathrm{Re} \langle (X_i - \langle X_i \rangle) \; (X_j- \langle X_j \rangle)\rangle

then it’s symmetric, as it should be.

Last time I mentioned a general setup using von Neumann algebras, that handles the classical and quantum situations simultaneously. That applies here! Taking the real part has no effect in classical mechanics, so we don’t need it there — but it doesn’t hurt, either.

Taking the real part never has any effect when i = j, either, since the expected value of the square of an observable is a nonnegative number:

\langle (X_i - \langle X_i \rangle)^2 \rangle \ge 0

This has two nice consequences.

First, we get

g_{ii} = \langle (X_i - \langle X_i \rangle)^2 \rangle  \ge 0

and since this is true in any coordinate system, our would-be metric g is indeed nonnegative. It’ll be an honest Riemannian metric whenever it’s positive definite.

Second, suppose we’re working in the special case discussed in Part 2, where our manifold is an open subset of \mathbb{R}^n, and \mathbb{\rho} at the point x \in \mathbb{R}^n is the Gibbs state with \langle X_i \rangle = x_i. Then all the usual rules of statistical mechanics apply. So, we can compute the variance of the observable X_i using the partition function Z:

\langle (X_i - \langle X_i \rangle)^2 \rangle = \frac{\partial^2}{\partial \lambda_i^2} \ln Z

In other words,

g_{ii} =  \frac{\partial^2}{\partial \lambda_i^2} \ln Z

But since this is true in any coordinate system, we must have

g_{ij} =  \frac{\partial^2}{\partial \lambda_i \partial \lambda_j} \ln Z

(Here I’m using a little math trick: two symmetric bilinear forms whose diagonal entries agree in any basis must be equal. We’ve already seen that the left side is symmetric, and the right side is symmetric by a famous fact about mixed partial derivatives.)

However, I’m pretty sure this cute formula

g_{ij} =  \frac{\partial^2}{\partial \lambda_i \partial \lambda_j} \ln Z

only holds in the special case I’m talking about now, where points in \mathbb{R}^n are parametrizing Gibbs states in the obvious way. In general we must use

g_{ij} = \mathrm{Re} \langle (X_i - \langle X_i \rangle)(X_j- \langle X_j \rangle)\rangle

or equivalently,

g_{ij} = \mathrm{Re} \, \mathrm{tr} (\rho \; \frac{\partial \ln \rho}{\partial \lambda_i} \frac{\partial \ln \rho}{\partial \lambda_j})

Okay. So much for cleaning up Last Week’s Mess. Here’s something new. We’ve seen that whenever A and B are observables (that is, self-adjoint),

\langle AB\rangle^* = \langle BA \rangle

We got something symmetric by taking the real part:

\mathrm{Re} \langle AB\rangle =  \mathrm{Re} \langle BA \rangle

Indeed,

\mathrm{Re} \langle AB \rangle = \frac{1}{2} \langle AB + BA \rangle

But by the same reasoning, we get something antisymmetric by taking the imaginary part:

\mathrm{Im} \langle AB\rangle =  - \mathrm{Im} \langle BA \rangle

and indeed,

\mathrm{Im} \langle AB \rangle = \frac{1}{2i} \langle AB - BA \rangle

Commutators like AB-BA are important in quantum mechanics, so maybe we shouldn’t just throw out the imaginary part of the covariance matrix in our desperate search for a Riemannian metric! Besides the symmetric tensor on our manifold M:

g_{ij} = \mathrm{Re} \, \mathrm{tr} (\rho \; \frac{\partial \ln \rho}{\partial \lambda_i} \frac{\partial \ln \rho}{\partial \lambda_j})

we can also define a skew-symmetric tensor:

\omega_{ij} = \mathrm{Im} \,  \mathrm{tr} (\rho \; \frac{\partial \ln \rho}{\partial \lambda_i} \frac{\partial \ln \rho}{\partial \lambda_j})

This will vanish in the classical case, but not in the quantum case!

If you’ve studied enough geometry, you should now be reminded of things like ‘Kähler manifolds’ and ‘almost Kähler manifolds’. A Kähler manifold is a manifold that’s equipped with a symmetric tensor g and a skew-symmetric tensor \omega which fit together in the best possible way. An almost Kähler manifold is something similar, but not quite as nice. We should probably see examples of these arising in information geometry! And that could be pretty interesting.

But in general, if we start with any old manifold M together with a function \rho taking values in mixed states, we seem to be making M into something even less nice. It gets a symmetric bilinear form g on each tangent space, and a skew-symmetric bilinear form \omega, and they vary smoothly from point to point… but they might be degenerate, and I don’t see any reason for them to ‘fit together’ in the nice way we need for a Kähler or almost Kähler manifold.

However, I still think something interesting might be going on here. For one thing, there are other situations in physics where a space of states is equipped with a symmetric g and a skew-symmetric \omega. They show up in ‘dissipative mechanics’ — the study of systems whose entropy increases.

To conclude, let me remind you of some things I said in week295 of This Week’s Finds. This is a huge digression from information geometry, but I’d like to lay out the the puzzle pieces in public view, in case it helps anyone get some good ideas.

I wrote:

• Hans Christian Öttinger, Beyond Equilibrium Thermodynamics, Wiley, 2005.

I thank Arnold Neumaier for pointing out this book! It considers a fascinating generalization of Hamiltonian mechanics that applies to systems with dissipation: for example, electrical circuits with resistors, or mechanical systems with friction.

In ordinary Hamiltonian mechanics the space of states is a manifold and time evolution is a flow on this manifold determined by a smooth function called the Hamiltonian, which describes the energy of any state. In this generalization the space of states is still a manifold, but now time evolution is determined by two smooth functions: the energy and the entropy! In ordinary Hamiltonian mechanics, energy is automatically conserved. In this generalization that’s also true, but energy can go into the form of heat… and entropy automatically increases!

Mathematically, the idea goes like this. We start with a Poisson manifold, but in addition to the skew-symmetric Poisson bracket {F,G} of smooth functions on some manifold, we also have a symmetric bilinear bracket [F,G] obeying the Leibniz law

[F,GH] = [F,G]H + G[F,H]

and this positivity condition:

[F,F] ≥ 0

The time evolution of any function is given by a generalization of Hamilton’s equations:

dF/dt = {H,F} + [S,F]

where H is a function called the "energy" or "Hamiltonian", and S is a function called the "entropy". The first term on the right is the usual one. The new second term describes dissipation: as we shall see, it pushes the state towards increasing entropy.

If we require that

[H,F] = {S,F} = 0

for every function F, then we get conservation of energy, as usual in Hamiltonian mechanics:

dH/dt = {H,H} + [S,H] = 0

But we also get the second law of thermodynamics:

dS/dt = {H,S} + [S,S] ≥ 0

Entropy always increases!

Öttinger calls this framework “GENERIC” – an annoying acronym for “General Equation for the NonEquilibrium Reversible-Irreversible Coupling”. There are lots of papers about it. But I’m wondering if any geometers have looked into it!

If we didn’t need the equations [H,F] = {S,F} = 0, we could easily get the necessary brackets starting with a Kähler manifold. The imaginary part of the Kähler structure is a symplectic structure, say ω, so we can define

{F,G} = ω(dF,dG)

as usual to get Poisson brackets. The real part of the Kähler structure is a Riemannian structure, say g, so we can define

[F,G] = g(dF,dG)

This satisfies

[F,GH] = [F,G]H + G[F,H]

and

[F,F] ≥ 0

Don’t be fooled: this stuff is not rocket science. In particular, the inequality above has a simple meaning: when we move in the direction of the gradient of F, the function F increases. So adding the second term to Hamilton’s equations has the effect of pushing the system towards increasing entropy.

Note that I’m being a tad unorthodox by letting ω and g eat cotangent vectors instead of tangent vectors – but that’s no big deal. The big deal is this: if we start with a Kähler manifold and define brackets this way, we don’t get [H,F] = 0 or {S,F} = 0 for all functions F unless H and S are constant! That’s no good for applications to physics. To get around this problem, we would need to consider some sort of degenerate Kähler structure – one where ω and g are degenerate bilinear forms on the cotangent space.

Has anyone thought about such things? They remind me a little of "Dirac structures" and "generalized complex geometry" – but I don’t know enough about those subjects to know if they’re relevant here.

This GENERIC framework suggests that energy and entropy should be viewed as two parts of a single entity – maybe even its real and imaginary parts! And that in turn reminds me of other strange things, like the idea of using complex-valued Hamiltonians to describe dissipative systems, or the idea of “inverse temperature as imaginary time”. I can’t tell yet if there’s a big idea lurking here, or just a mess….


Information Geometry (Part 3)

25 October, 2010

So far in this series of posts I’ve been explaining a paper by Gavin Crooks. Now I want to go ahead and explain a little research of my own.

I’m not claiming my results are new — indeed I have no idea whether they are, and I’d like to hear from any experts who might know. I’m just claiming that this is some work I did last weekend.

People sometimes worry that if they explain their ideas before publishing them, someone will ‘steal’ them. But I think this overestimates the value of ideas, at least in esoteric fields like mathematical physics. The problem is not people stealing your ideas: the hard part is giving them away. And let’s face it, people in love with math and physics will do research unless you actively stop them. I’m reminded of this scene from the Marx Brothers movie where Harpo and Chico, playing wandering musicians, walk into a hotel and offer to play:

Groucho: What do you fellows get an hour?

Chico: Oh, for playing we getta ten dollars an hour.

Groucho: I see…What do you get for not playing?

Chico: Twelve dollars an hour.

Groucho: Well, clip me off a piece of that.

Chico: Now, for rehearsing we make special rate. Thatsa fifteen dollars an hour.

Groucho: That’s for rehearsing?

Chico: Thatsa for rehearsing.

Groucho: And what do you get for not rehearsing?

Chico: You couldn’t afford it.

So, I’m just rehearsing in public here — but I of course I hope to write a paper about this stuff someday, once I get enough material.

Remember where we were. We had considered a manifold — let’s finally give it a name, say M — that parametrizes Gibbs states of some physical system. By Gibbs state, I mean a state that maximizes entropy subject to constraints on the expected values of some observables. And we had seen that in favorable cases, we get a Riemannian metric on M! It looks like this:

g_{ij} = \mathrm{Re} \langle (X_i - \langle X_i \rangle)  (X_j - \langle X_j \rangle)  \rangle

where X_i are our observables, and the angle bracket means ‘expected value’.

All this applies to both classical or quantum mechanics. Crooks wrote down a beautiful formula for this metric in the classical case. But since I’m at the Centre for Quantum Technologies, not the Centre for Classical Technologies, I redid his calculation in the quantum case. The big difference is that in quantum mechanics, observables don’t commute! But in the calculations I did, that didn’t seem to matter much — mainly because I took a lot of traces, which imposes a kind of commutativity:

\mathrm{tr}(AB) = \mathrm{tr}(BA)

In fact, if I’d wanted to show off, I could have done the classical and quantum cases simultaneously by replacing all operators by elements of any von Neumann algebra equipped with a trace. Don’t worry about this much: it’s just a general formalism for treating classical and quantum mechanics on an equal footing. One example is the algebra of bounded operators on a Hilbert space, with the usual concept of trace. Then we’re doing quantum mechanics as usual. But another example is the algebra of suitably nice functions on a suitably nice space, where taking the trace of a function means integrating it. And then we’re doing classical mechanics!

For example, I showed you how to derive a beautiful formula for the metric I wrote down a minute ago:

g_{ij} = \mathrm{Re} \, \mathrm{tr}(\rho \; \frac{\partial \mathrm{ln} \rho }{\partial \lambda^i} \; \frac{\partial \mathrm{ln} \rho }{\partial \lambda^j} )

But if we want to do the classical version, we can say Hey, presto! and write it down like this:

g_{ij} = \int_\Omega p(\omega) \; \frac{\partial \mathrm{ln} p(\omega) }{\partial \lambda^i} \; \frac{\partial \mathrm{ln} p(\omega) }{\partial \lambda^j} \; d \omega

What did I do just now? I changed the trace to an integral over some space \Omega. I rewrote \rho as p to make you think ‘probability distribution’. And I don’t need to take the real part anymore, since is everything already real when we’re doing classical mechanics. Now this metric is the Fisher information metric that statisticians know and love!

In what follows, I’ll keep talking about the quantum case, but in the back of my mind I’ll be using von Neumann algebras, so everything will apply to the classical case too.

So what am I going to do? I’m going to fix a big problem with the story I’ve told so far.

Here’s the problem: so far we’ve only studied a special case of the Fisher information metric. We’ve been assuming our states are Gibbs states, parametrized by the expectation values of some observables X_1, \dots, X_n. Our manifold M was really just some open subset of \mathbb{R}^n: a point in here was a list of expectation values.

But people like to work a lot more generally. We could look at any smooth function \rho from a smooth manifold M to the set of density matrices for some quantum system. We can still write down the metric

g_{ij} = \mathrm{Re} \, \mathrm{tr}(\rho \; \frac{\partial \mathrm{ln} \rho }{\partial \lambda^i} \; \frac{\partial \mathrm{ln} \rho }{\partial \lambda^j} )

in this more general situation. Nobody can stop us! But it would be better if we could derive this formula, as before, starting from a formula like the one we had before:

g_{ij} = \mathrm{Re} \langle \, (X_i - \langle X_i \rangle) \, (X_j - \langle X_j \rangle)  \, \rangle

The challenge is that now we don’t have observables X_i to start with. All we have is a smooth function \rho from some manifold to some set of states. How can we pull observables out of thin air?

Well, you may remember that last time we had

\rho = \frac{1}{Z} e^{-\lambda^i X_i}

where \lambda^i were some functions on our manifold and

Z = \mathrm{tr}(e^{-\lambda^i X_i})

was the partition function. Let’s copy this idea.

So, we’ll start with our density matrix \rho, but then write it as

\rho = \frac{1}{Z} e^{-A}

where A is some self-adjoint operator and

Z = \mathrm{tr} (e^{-A})

(Note that A, like \rho, is really an operator-valued function on M. So, I should write something like A(x) to denote its value at a particular point x \in M, but I won’t usually do that. As usual, I expect some intelligence on your part!)

Now we can repeat some calculations I did last time. As before, let’s take the logarithm of \rho:

\mathrm{ln} \, \rho = -A - \mathrm{ln}\,  Z

and then differentiate it. Suppose \lambda^i are local coordinates near some point of M. Then

\frac{\partial}{\partial \lambda^i} \mathrm{ln}\, \rho = - \frac{\partial}{\partial \lambda^i} A - \frac{1}{Z} \frac{\partial }{\partial \lambda^i} Z

Last time we had nice formulas for both terms on the right-hand side above. To get similar formulas now, let’s define operators

X_i = \frac{\partial}{\partial \lambda^i} A

This gives a nice name to the first term on the right-hand side above. What about the second term? We can calculate it out:

\frac{1}{Z} \frac{\partial }{\partial \lambda^i} Z = \frac{1}{Z}  \frac{\partial }{\partial \lambda^i} \mathrm{tr}(e^{-A}) = \frac{1}{Z}  \mathrm{tr}(\frac{\partial }{\partial \lambda^i} e^{-A}) = - \frac{1}{Z}  \mathrm{tr}(e^{-A} \frac{\partial}{\partial \lambda^i} A)

where in the last step we use the chain rule. Next, use the definition of \rho and X_i, and get:

\frac{1}{Z} \frac{\partial }{\partial \lambda^i} Z = - \mathrm{tr}(\rho X_i) = - \langle X_i \rangle

This is just what we got last time! Ain’t it fun to calculate when it all works out so nicely?

So, putting both terms together, we see

\frac{\partial}{\partial \lambda^i} \mathrm{ln} \rho = - X_i + \langle X_i \rangle

or better:

X_i - \langle X_i \rangle = -\frac{\partial}{\partial \lambda^i} \mathrm{ln} \rho

This is a nice formula for the ‘fluctuation’ of the observables X_i, meaning how much they differ from their expected values. And it looks exactly like the formula we had last time! The difference is that last time we started out assuming we had a bunch of observables, X_i, and defined \rho to be the state maximizing the entropy subject to constraints on the expectation values of all these observables.
Now we’re starting with \rho and working backwards.

From here on out, it’s easy. As before, we can define g_{ij} to be the real part of the covariance matrix:

g_{ij} = \mathrm{Re} \langle (X_i - \langle X_i \rangle)  (X_j - \langle X_j \rangle)  \rangle

Using the formula

X_i - \langle X_i \rangle = -\frac{\partial}{\partial \lambda^i} \mathrm{ln} \rho

we get

g_{ij} = \mathrm{Re} \langle \frac{\partial \mathrm{ln} \rho}{\partial \lambda^i} \; \frac{\partial \mathrm{ln} \rho}{\partial \lambda^j} \rangle

or

g_{ij} = \mathrm{Re}\,\mathrm{tr}(\rho \; \frac{\partial \mathrm{ln} \rho}{\partial \lambda^i} \; \frac{\partial \mathrm{ln} \rho}{\partial \lambda^j})

Voilà!

When this matrix is positive definite at every point, we get a Riemanian metric on M. Last time I said this is what people call the ‘Bures metric’ — though frankly, now that I examine the formulas, I’m not so sure. But in the classical case, it’s called the Fisher information metric.

Differential geometers like to use \partial_i as a shorthand for \frac{\partial}{\partial_i}, so they’d write down our metric in a prettier way:

g_{ij} = \mathrm{Re} \, \mathrm{tr}(\rho \; \partial_i (\mathrm{ln} \, \rho) \; \partial_j (\mathrm{ln} \, \rho) )

Differential geometers like coordinate-free formulas, so let’s also give a coordinate-free formula for our metric. Suppose x \in M is a point in our manifold, and suppose v,w are tangent vectors to this point. Then

g(v,w) = \mathrm{Re} \, \langle v(\mathrm{ln}\, \rho) \; w(\mathrm{ln} \,\rho) \rangle \; = \; \mathrm{Re} \,\mathrm{tr}(\rho \; v(\mathrm{ln}\, \rho) \; w(\mathrm{ln}\, \rho))

Here \mathrm{ln}\, \rho is a smooth operator-valued function on M, and v(\mathrm{ln}\, \rho) means the derivative of this function in the v direction at the point x.

So, this is all very nice. To conclude, two more points: a technical one, and a more important philosophical one.

First, the technical point. When I said \rho could be any smooth function from a smooth manifold to some set of states, I was actually lying. That’s an important pedagogical technique: the brazen lie.

We can’t really take the logarithm of every density matrix. Remember, we take the log of a density matrix by taking the log of all its eigenvalues. These eigenvalues are ≥ 0, but if one of them is zero, we’re in trouble! The logarithm of zero is undefined.

On the other hand, there’s no problem taking the logarithm of our density-matrix-valued function \rho when it’s positive definite at each point of M. You see, a density matrix is positive definite iff its eigenvalues are all > 0. In this case it has a unique self-adjoint logarithm.

So, we must assume \rho is positive definite. But what’s the physical significance of this ‘positive definiteness’ condition? Well, any density matrix can be diagonalized using some orthonormal basis. It can then be seen as probabilistic mixture — not a quantum superposition! — of pure states taken from this basis. Its eigenvalues are the probabilities of finding the mixed state to be in one of these pure states. So, saying that all its eigenvalues are all > 0 amounts to saying that all the pure states in this orthonormal basis show up with nonzero probability! Intuitively, this means our mixed state is ‘really mixed’. For example, it can’t be a pure state. In math jargon, it means our mixed state is in the interior of the convex set of mixed states.

Second, the philosophical point. Instead of starting with the density matrix \rho, I took A as fundamental. But different choices of A give the same \rho. After all,

\rho = \frac{1}{Z} e^{-A}

where we cleverly divide by the normalization factor

Z = \mathrm{tr} (e^{-A})

to get \mathrm{tr} \, \rho = 1. So, if we multiply e^{-A} by any positive constant, or indeed any positive function on our manifold M, \rho will remain unchanged!

So we have added a little extra information when switching from \rho to A. You can think of this as ‘gauge freedom’, because I’m saying we can do any transformation like

A \mapsto A + f

where

f: M \to \mathbb{R}

is a smooth function. This doesn’t change \rho, so arguably it doesn’t change the ‘physics’ of what I’m doing. It does change Z. It also changes the observables

X_i = \frac{\partial}{\partial \lambda^i} A

But it doesn’t change their ‘fluctuations’

X_i - \langle X_i \rangle

so it doesn’t change the metric g_{ij}.

This gauge freedom is interesting, and I want to understand it better. It’s related to something very simple yet mysterious. In statistical mechanics the partition function Z begins life as ‘just a normalizing factor’. If you change the physics so that Z gets multiplied by some number, the Gibbs state doesn’t change. But then the partition function takes on an incredibly significant role as something whose logarithm you differentiate to get lots of physically interesting information! So in some sense the partition function doesn’t matter much… but changes in the partition function matter a lot.

This is just like the split personality of phases in quantum mechanics. On the one hand they ‘don’t matter’: you can multiply a unit vector by any phase and the pure state it defines doesn’t change. But on the other hand, changes in phase matter a lot.

Indeed the analogy here is quite deep: it’s the analogy between probabilities in statistical mechanics and amplitudes in quantum mechanics, the analogy between \mathrm{exp}(-\beta H) in statistical mechanics and \mathrm{exp}(-i t H / \hbar) in quantum mechanics, and so on. This is part of a bigger story about ‘rigs’ which I told back in the Winter 2007 quantum gravity seminar, especially in week13. So, it’s fun to see it showing up yet again… even though I don’t completely understand it here.

[Note: in the original version of this post, I omitted the real part in my definition g_{ij} = \mathrm{Re} \langle (X_i - \langle X_i \rangle)  (X_j - \langle X_j \rangle)  \rangle , giving a ‘Riemannian metric’ that was neither real nor symmetric in the quantum case. Most of the comments below are based on that original version, not the new fixed one.]


Information Geometry (Part 2)

23 October, 2010

Last time I provided some background to this paper:

• Gavin E. Crooks, Measuring thermodynamic length.

Now I’ll tell you a bit about what it actually says!

Remember the story so far: we’ve got a physical system that’s in a state of maximum entropy. I didn’t emphasize this yet, but that happens whenever our system is in thermodynamic equilibrium. An example would be a box of gas inside a piston. Suppose you choose any number for the energy of the gas and any number for its volume. Then there’s a unique state of the gas that maximizes its entropy, given the constraint that on average, its energy and volume have the values you’ve chosen. And this describes what the gas will be like in equilibrium!

Remember, by ‘state’ I mean mixed state: it’s a probabilistic description. And I say the energy and volume have chosen values on average because there will be random fluctuations. Indeed, if you look carefully at the head of the piston, you’ll see it quivering: the volume of the gas only equals the volume you’ve specified on average. Same for the energy.

More generally: imagine picking any list of numbers, and finding the maximum entropy state where some chosen observables have these numbers as their average values. Then there will be fluctuations in the values of these observables — thermal fluctuations, but also possibly quantum fluctuations. So, you’ll get a probability distribution on the space of possible values of your chosen observables. You should visualize this probability distribution as a little fuzzy cloud centered at the average value!

To a first approximation, this cloud will be shaped like a little ellipsoid. And if you can pick the average value of your observables to be whatever you’ll like, you’ll get lots of little ellipsoids this way, one centered at each point. And the cool idea is to imagine the space of possible values of your observables as having a weirdly warped geometry, such that relative to this geometry, these ellipsoids are actually spheres.

This weirdly warped geometry is an example of an ‘information geometry’: a geometry that’s defined using the concept of information. This shouldn’t be surprising: after all, we’re talking about maximum entropy, and entropy is related to information. But I want to gradually make this idea more precise.

Bring on the math!

We’ve got a bunch of observables X_1, \dots , X_n, and we’re assuming that for any list of numbers x_1, \dots , x_n, the system has a unique maximal-entropy state \rho for which the expected value of the observable X_i is x_i:

\langle X_i \rangle = x_i

This state \rho is called the Gibbs state and I told you how to find it when it exists. In fact it may not exist for every list of numbers x_1, \dots , x_n, but we’ll be perfectly happy if it does for all choices of

x = (x_1, \dots, x_n)

lying in some open subset of \mathbb{R}^n

By the way, I should really call this Gibbs state \rho(x) or something to indicate how it depends on x, but I won’t usually do that. I expect some intelligence on your part!

Now at each point x we can define a covariance matrix

\langle (X_i - \langle X_i \rangle)  (X_j - \langle X_j \rangle)  \rangle

If we take its real part, we get a symmetric matrix:

g_{ij} = \mathrm{Re} \langle (X_i - \langle X_i \rangle)  (X_j - \langle X_j \rangle)  \rangle

It’s also nonnegative — that’s easy to see, since the variance of a probability distribution can’t be negative. When we’re lucky this matrix will be positive definite. When we’re even luckier, it will depend smoothly on x. In this case, g_{ij} will define a Riemannian metric on our open set.

So far this is all review of last time. Sorry: I seem to have reached the age where I can’t say anything interesting without warming up for about 15 minutes first. It’s like when my mom tells me about an exciting event that happened to her: she starts by saying “Well, I woke up, and it was cloudy out…”

But now I want to give you an explicit formula for the metric g_{ij}, and then rewrite it in a way that’ll work even when the state \rho is not a maximal-entropy state. And this formula will then be the general definition of the ‘Fisher information metric’ (if we’re doing classical mechanics), or a quantum version thereof (if we’re doing quantum mechanics).

Crooks does the classical case — so let’s do the quantum case, okay? Last time I claimed that in the quantum case, our maximum-entropy state is the Gibbs state

\rho = \frac{1}{Z} e^{-\lambda^i X_i}

where \lambda^i are the ‘conjugate variables’ of the observables X_i, we’re using the Einstein summation convention to sum over repeated indices that show up once upstairs and once downstairs, and Z is the partition function

Z = \mathrm{tr} (e^{-\lambda^i X_i})

(To be honest: last time I wrote the indices on the conjugate variables \lambda^i as subscripts rather than superscripts, because I didn’t want some poor schlep out there to think that \lambda^1, \dots , \lambda^n were the powers of some number \lambda. But now I’m assuming you’re all grown up and ready to juggle indices! We’re doing Riemannian geometry, after all.)

Also last time I claimed that it’s tremendously fun and enlightening to take the derivative of the logarithm of Z. The reason is that it gives you the mean values of your observables:

\langle X_i \rangle = - \frac{\partial}{\partial \lambda^i} \ln Z

But now let’s take the derivative of the logarithm of \rho. Remember, \rho is an operator — in fact a density matrix. But we can take its logarithm as explained last time, and the usual rules apply, so starting from

\rho = \frac{1}{Z} e^{-\lambda^i X_i}

we get

\mathrm{ln}\, \rho = - \lambda^i X_i - \mathrm{ln} \,Z

Next, let’s differentiate both sides with respect to \lambda^i. Why? Well, from what I just said, you should be itching to differentiate \mathrm{ln}\, Z. So let’s give in to that temptation:

\frac{\partial  }{\partial \lambda^i} \mathrm{ln}  \rho = -X_i + \langle X_i \rangle

Hey! Now we’ve got a formula for the ‘fluctuation’ of the observable X_i — that is, how much it differs from its mean value:

X_i - \langle X_i \rangle = - \frac{\partial \mathrm{ln} \rho }{\partial \lambda^i}

This is incredibly cool! I should have learned this formula decades ago, but somehow I just bumped into it now. I knew of course that \mathrm{ln} \, \rho shows up in the formula for the entropy:

S(\rho) = \mathrm{tr} ( \rho \; \mathrm{ln} \, \rho)

But I never had the brains to think about \mathrm{ln}\, \rho all by itself. So I’m really excited to discover that it’s an interesting entity in its own right — and fun to differentiate, just like \mathrm{ln}\, Z.

Now we get our cool formula for g_{ij}. Remember, it’s defined by

g_{ij} = \mathrm{Re} \langle (X_i - \langle X_i \rangle)  (X_j - \langle X_j \rangle)  \rangle

But now that we know

X_i - \langle X_i \rangle = -\frac{\partial \mathrm{ln} \rho }{\partial \lambda^i}

we get the formula we were looking for:

g_{ij} = \mathrm{Re} \langle \frac{\partial \mathrm{ln} \rho }{\partial \lambda^i} \; \frac{\partial \mathrm{ln} \rho }{\partial \lambda^j}  \rangle

Beautiful, eh? And of course the expected value of any observable A in the state \rho is

\langle A \rangle = \mathrm{tr}(\rho A)

so we can also write the covariance matrix like this:

g_{ij} = \mathrm{Re}\, \mathrm{tr}(\rho \; \frac{\partial \mathrm{ln} \rho }{\partial \lambda^i} \; \frac{\partial \mathrm{ln} \rho }{\partial \lambda^j} )

Lo and behold! This formula makes sense whenever \rho is any density matrix depending smoothly on some parameters \lambda^i. We don’t need it to be a Gibbs state! So, we can work more generally.

Indeed, whenever we have any smooth function from a manifold to the space of density matrices for some Hilbert space, we can define g_{ij} by the above formula! And when it’s positive definite, we get a Riemannian metric on our manifold: the Bures information metric.

The classical analogue is the somewhat more well-known ‘Fisher information metric’. When we go from quantum to classical, operators become functions and traces become integrals. There’s nothing complex anymore, so taking the real part becomes unnecessary. So the Fisher information metric looks like this:

g_{ij} = \int_\Omega p(\omega) \; \frac{\partial \mathrm{ln} p(\omega) }{\partial \lambda^i} \; \frac{\partial \mathrm{ln} p(\omega) }{\partial \lambda^j} \; d \omega

Here I’m assuming we’ve got a smooth function p from some manifold M to the space of probability distributions on some measure space (\Omega, d\omega). Working in local coordinates \lambda^i on our manifold M, the above formula defines a Riemannian metric on M, at least when g_{ij} is positive definite. And that’s the Fisher information metric!

Crooks says more: he describes an experiment that would let you measure the length of a path with respect to the Fisher information metric — at least in the case where the state \rho(x) is the Gibbs state with \langle X_i \rangle = x_i. And that explains why he calls it ‘thermodynamic length’.

There’s a lot more to say about this, and also about another question: What use is the Fisher information metric in the general case where the states \rho(x) aren’t Gibbs states?

But it’s dinnertime, so I’ll stop here.

[Note: in the original version of this post, I omitted the real part in my definition g_{ij} = \mathrm{Re} \langle (X_i - \langle X_i \rangle)  (X_j - \langle X_j \rangle)  \rangle , giving a ‘Riemannian metric’ that was neither real nor symmetric in the quantum case. Most of the comments below are based on that original version, not the new fixed one.]


Information Geometry (Part 1)

22 October, 2010

I’d like to provide a bit of background to this interesting paper:

• Gavin E. Crooks, Measuring thermodynamic length.

which was pointed out by John F in our discussion of entropy and uncertainty.

The idea here should work for either classical or quantum statistical mechanics. The paper describes the classical version, so just for a change of pace let me describe the quantum version.

First a lightning review of quantum statistical mechanics. Suppose you have a quantum system with some Hilbert space. When you know as much as possible about your system, then you describe it by a unit vector in this Hilbert space, and you say your system is in a pure state. Sometimes people just call a pure state a ‘state’. But that can be confusing, because in statistical mechanics you also need more general ‘mixed states’ where you don’t know as much as possible. A mixed state is described by a density matrix, meaning a positive operator \rho with trace equal to 1:

\mathrm{tr}(\rho) = 1

The idea is that any observable is described by a self-adjoint operator A, and the expected value of this observable in the mixed state \rho is

\langle A \rangle = \mathrm{tr}(\rho A)

The entropy of a mixed state is defined by

S(\rho) = -\mathrm{tr}(\rho \; \mathrm{ln} \, \rho)

where we take the logarithm of the density matrix just by taking the log of each of its eigenvalues, while keeping the same eigenvectors. This formula for entropy should remind you of the one that Gibbs and Shannon used — the one I explained a while back.

Back then I told you about the ‘Gibbs ensemble’: the mixed state that maximizes entropy subject to the constraint that some observable have a given value. We can do the same thing in quantum mechanics, and we can even do it for a bunch of observables at once. Suppose we have some observables X_1, \dots, X_n and we want to find the mixed state \rho that maximizes entropy subject to these constraints:

\langle X_i \rangle = x_i

for some numbers x_i. Then a little exercise in Lagrange multipliers shows that the answer is the Gibbs state:

\rho = \frac{1}{Z} \mathrm{exp}(-\lambda_1 X_1 + \cdots + \lambda_n X_n)

Huh?

This answer needs some explanation. First of all, the numbers \lambda_1, \dots \lambda_n are called Lagrange multipliers. You have to choose them right to get

\langle X_i \rangle = x_i

So, in favorable cases, they will be functions of the numbers x_i. And when you’re really lucky, you can solve for the numbers x_i in terms of the numbers \lambda_i. We call \lambda_i the conjugate variable of the observable X_i. For example, the conjugate variable of energy is inverse temperature!

Second of all, we take the exponential of a self-adjoint operator just as we took the logarithm of a density matrix: just take the exponential of each eigenvalue.

(At least this works when our self-adjoint operator has only eigenvalues in its spectrum, not any continuous spectrum. Otherwise we need to get serious and use the functional calculus. Luckily, if your system’s Hilbert space is finite-dimensional, you can ignore this parenthetical remark!)

But third: what’s that number Z? It begins life as a humble normalizing factor. Its job is to make sure \rho has trace equal to 1:

Z = \mathrm{tr}(\mathrm{exp}(-\lambda_1 X_1 + \cdots + \lambda_n X_n))

However, once you get going, it becomes incredibly important! It’s called the partition function of your system.

As an example of what it’s good for, it turns out you can compute the numbers x_i as follows:

x_i = - \frac{\partial}{\partial \lambda_i} \mathrm{ln} Z

In other words, you can compute the expected values of the observables X_i by differentiating the log of the partition function:

\langle X_i \rangle = - \frac{\partial}{\partial \lambda_i} \mathrm{ln} Z

Or in still other words,

\langle X_i \rangle = - \frac{1}{Z} \; \frac{\partial Z}{\partial \lambda_i}

To believe this you just have to take the equations I’ve given you so far and mess around — there’s really no substitute for doing it yourself. I’ve done it fifty times, and every time I feel smarter.

But we can go further: after the ‘expected value’ or ‘mean’ of an observable comes its variance, which is the square of its standard deviation:

(\Delta A)^2 = \langle A^2 \rangle - \langle A \rangle^2

This measures the size of fluctuations around the mean. And in the Gibbs state, we can compute the variance of the observable X_i as the second derivative of the log of the partition function:

\langle X_i^2 \rangle - \langle X_i \rangle^2 =  \frac{\partial^2}{\partial^2 \lambda_i} \mathrm{ln} Z

Again: calculate and see.

But when we’ve got lots of observables, there’s something better than the variance of each one. There’s the covariance matrix of the whole lot of them! Each observable X_i fluctuates around its mean value x_i… but these fluctuations are not independent! They’re correlated, and the covariance matrix says how.

All this is very visual, at least for me. If you imagine the fluctuations as forming a blurry patch near the point (x_1, \dots, x_n), this patch will be ellipsoidal in shape, at least when all our random fluctuations are Gaussian. And then the shape of this ellipsoid is precisely captured by the covariance matrix! In particular, the eigenvectors of the covariance matrix will point along the principal axes of this ellipsoid, and the eigenvalues will say how stretched out the ellipsoid is in each direction!

To understand the covariance matrix, it may help to start by rewriting the variance of a single observable A as

(\Delta A)^2 = \langle (A - \langle A \rangle)^2 \rangle

That’s a lot of angle brackets, but the meaning should be clear. First we look at the difference between our observable and its mean value, namely

A - \langle A \rangle

Then we square this, to get something that’s big and positive whenever our observable is far from its mean. Then we take the mean value of the that, to get an idea of how far our observable is from the mean on average.

We can use the same trick to define the covariance of a bunch of observables X_i. We get an n \times n matrix called the covariance matrix, whose entry in the ith row and jth column is

\langle (X_i - \langle X_i \rangle)  (X_j - \langle X_j \rangle)  \rangle

If you think about it, you can see that this will measure correlations in the fluctuations of your observables.

An interesting difference between classical and quantum mechanics shows up here. In classical mechanics the covariance matrix is always symmetric — but not in quantum mechanics! You see, in classical mechanics, whenever we have two observables A and B, we have

\langle A B \rangle = \langle B A \rangle

since observables commute. But in quantum mechanics this is not true! For example, consider the position q and momentum p of a particle. We have

q p = p q + i

so taking expectation values we get

\langle q p \rangle = \langle p q \rangle + i

So, it’s easy to get a non-symmetric covariance matrix when our observables X_i don’t commute. However, the real part of the covariance matrix is symmetric, even in quantum mechanics. So let’s define

g_{ij} =  \mathrm{Re}  \langle (X_i - \langle X_i \rangle)  (X_j - \langle X_j \rangle)  \rangle

You can check that the matrix entries here are the second derivatives of the partition function:

g_{ij} = \frac{\partial^2}{\partial \lambda_i \partial \lambda_j} \mathrm{ln} Z

And now for the cool part: this is where information geometry comes in! Suppose that for any choice of values x_i we have a Gibbs state with

\langle X_i \rangle = x_i

Then for each point

x = (x_1, \dots , x_n) \in \mathbb{R}^n

we have a matrix

g_{ij} =  \mathrm{Re}  \langle (X_i - \langle X_i \rangle)  (X_j - \langle X_j \rangle)  \rangle = \frac{\partial^2}{\partial \lambda_i \partial \lambda_j} \mathrm{ln} Z

And this matrix is not only symmetric, it’s also positive. And when it’s positive definite we can think of it as an inner product on the tangent space of the point x. In other words, we get a Riemannian metric on \mathbb{R}^n. This is called the Fisher information metric.

I hope you can see through the jargon to the simple idea. We’ve got a space. Each point x in this space describes the maximum-entropy state of a quantum system for which our observables have specified mean values. But in each of these states, the observables are random variables. They don’t just sit at their mean value, they fluctuate! You can picture these fluctuations as forming a little smeared-out blob in our space. To a first approximation, this blob is an ellipsoid. And if we think of this ellipsoid as a ‘unit ball’, it gives us a standard for measuring the length of any little vector sticking out of our point. In other words, we’ve got a Riemannian metric: the Fisher information metric!

Now if you look at the Wikipedia article you’ll see a more general but to me somewhat scarier definition of the Fisher information metric. This applies whenever we’ve got a manifold whose points label arbitrary mixed states of a system. But Crooks shows this definition reduces to his — the one I just described — when our manifold is \mathbb{R}^n and it’s parametrizing Gibbs states in the way we’ve just seen.

More precisely: both Crooks and the Wikipedia article describe the classical story, but it parallels the quantum story I’ve been telling… and I think the quantum version is well-known. I believe the quantum version of the Fisher information metric is sometimes called the Bures metric, though I’m a bit confused about what the Bures metric actually is.

[Note: in the original version of this post, I omitted the real part in my definition g_{ij} = \mathrm{Re} \langle (X_i - \langle X_i \rangle)  (X_j - \langle X_j \rangle)  \rangle , giving a ‘Riemannian metric’ that was neither real nor symmetric in the quantum case. Most of the comments below are based on that original version, not the new fixed one.]


Relative Entropy in Evolutionary Dynamics

22 January, 2014

guest post by Marc Harper

In John’s information geometry series, he mentioned some of my work in evolutionary dynamics. Today I’m going to tell you about some exciting extensions!

The replicator equation

First a little refresher. For a population of n replicating types, such as individuals with different eye colors or a gene with n distinct alleles, the ‘replicator equation’ expresses the main idea of natural selection: the relative rate of growth of each type should be proportional to the difference between the fitness of the type and the mean fitness in the population.

To see why this equation should be true, let P_i be the population of individuals of the ith type, which we allow to be any nonnegative real number. We can list all these numbers and get a vector:

P = (P_1, \dots, P_n)

The Lotka–Volterra equation is a very general rule for how these numbers can change with time:

\displaystyle{ \frac{d P_i}{d t} = f_i(P) P_i }

Each population grows at a rate proportional to itself, where the ‘constant of proportionality’, f_i(P), is not necessarily constant: it can be any real-valued function of P. This function is called the fitness of the ith type. Taken all together, these functions f_i are called the fitness landscape.

Let p_i be the fraction of individuals who are of the ith type:

\displaystyle{ p_i = \frac{P_i}{\sum_{i =1}^n P_i } }

These numbers p_i are between 0 and 1, and they add up to 1. So, we can also think of them as probabilities: p_i is the probability that a randomly chosen individual is of the ith type. This is how probability theory, and eventually entropy, gets into the game.

Again, we can bundle these numbers into a vector:

p = (p_1, \dots, p_n)

which we call the population distribution. It turns out that the Lotka–Volterra equation implies the replicator equation:

\displaystyle{ \frac{d p_i}{d t} = \left( f_i(P) - \langle f(P) \rangle \right) \, p_i }

where

\displaystyle{ \langle f(P) \rangle = \sum_{i =1}^n  f_i(P)  p_i  }

is the mean fitness of all the individuals. You can see the proof in Part 9 of the information geometry series.

By the way: if each fitness f_i(P) only depends on the fraction of individuals of each type, not the total numbers, we can write the replicator equation in a simpler way:

\displaystyle{ \frac{d p_i}{d t} = \left( f_i(p) - \langle f(p) \rangle \right) \, p_i }

From now on, when talking about this equation, that’s what I’ll do.

Anyway, the take-home message is this: the replicator equation says the fraction of individuals of any type changes at a rate proportional to fitness of that type minus the mean fitness.

Now, it has been known since the late 1970s or early 1980s, thanks to the work of Akin, Bomze, Hofbauer, Shahshahani, and others, that the replicator equation has some very interesting properties. For one thing, it often makes ‘relative entropy’ decrease. For another, it’s often an example of ‘gradient flow’. Let’s look at both of these in turn, and then talk about some new generalizations of these facts.

Relative entropy as a Lyapunov function

I mentioned that we can think of a population distribution as a probability distribution. This lets us take ideas from probability theory and even information theory and apply them to evolutionary dynamics! For example, given two population distributions p and q, the information of q relative to p is

I(q,p) = \displaystyle{ \sum_i q_i \ln \left(\frac{q_i}{p_i }\right)}

This measures how much information you gain if you have a hypothesis about some state of affairs given by the probability distribution p, and then someone tells you “no, the best hypothesis is q!”

It may seem weird to treat a population distribution as a hypothesis, but this turns out to be a good idea. Evolution can then be seen as a learning process: a process of improving the hypothesis.

We can make this precise by seeing how the relative information changes with the passage of time. Suppose we have two population distributions q and p. Suppose q is fixed, while p evolves in time according to the replicator equation. Then

\displaystyle{  \frac{d}{d t} I(q,p)  =  \sum_i f_i(P) (p_i - q_i) }

For the proof, see Part 11 of the information geometry series.

So, the information of q relative to p will decrease as p evolves according to the replicator equation if

\displaystyle{  \sum_i f_i(P) (p_i - q_i) } \le 0

If q makes this true for all p, we say q is an evolutionarily stable state. For some reasons why, see Part 13.

What matters now is that when q is an evolutionarily stable state, I(q,p) says how much information the population has ‘left to learn’—and we’re seeing that this always decreases. Moreover, it turns out that we always have

I(q,p) \ge 0

and I(q,p) = 0 precisely when p = q.

People summarize all this by saying that relative information is a ‘Lyapunov function’. Very roughly, a Lyapunov function is something that decreases with the passage of time, and is zero only at the unique stable state. To be a bit more precise, suppose we have a differential equation like

\displaystyle{  \frac{d}{d t} x(t) = v(x(t)) }

where x(t) \in \mathbb{R}^n and v is some smooth vector field on \mathbb{R}^n. Then a smooth function

V : \mathbb{R}^n \to \mathbb{R}

is a Lyapunov function if

V(x) \ge 0 for all x

V(x) = 0 iff x is some particular point x_0

and

\displaystyle{ \frac{d}{d t} V(x(t)) \le 0 } for every solution of our differential equation.

In this situation, the point x_0 is a stable equilibrium for our differential equation: this is Lyapunov’s theorem.

The replicator equation as a gradient flow equation

The basic idea of Lyapunov’s theorem is that when a ball likes to roll downhill and the landscape has just one bottom point, that point will be the unique stable equilibrium for the ball.

The idea of gradient flow is similar, but different: sometimes things like to roll downhill as efficiently as possible: they move in the exactly the best direction to make some quantity smaller! Under certain conditions, the replicator equation is an example of this phenomenon.

Let’s fill in some details. For starters, suppose we have some function

F : \mathbb{R}^n \to \mathbb{R}

Think of V as ‘height’. Then the gradient flow equation says how a point x(t) \in \mathbb{R}^n will move if it’s always trying its very best to go downhill:

\displaystyle{ \frac{d}{d t} x(t) = - \nabla V(x(t)) }

Here \nabla is the usual gradient in Euclidean space:

\displaystyle{ \nabla V = \left(\partial_1 V, \dots, \partial_n V \right)  }

where \partial_i is short for the partial derivative with respect to the ith coordinate.

The interesting thing is that under certain conditions, the replicator equation is an example of a gradient flow equation… but typically not one where \nabla is the usual gradient in Euclidean space. Instead, it’s the gradient on some other space, the space of all population distributions, which has a non-Euclidean geometry!

The space of all population distributions is a simplex:

\{ p \in \mathbb{R}^n : \; p_i \ge 0, \; \sum_{i = 1}^n p_i = 1 \} .

For example, it’s an equilateral triangle when n = 3. The equilateral triangle looks flat, but if we measure distances another way it becomes round, exactly like a portion of a sphere, and that’s the non-Euclidean geometry we need!

In fact this trick works in any dimension. The idea is to give the simplex a special Riemannian metric, the ‘Fisher information metric’. The usual metric on Euclidean space is

\delta_{i j} = \left\{\begin{array}{ccl} 1 & \mathrm{ if } & i = j \\                                       0 &\mathrm{ if } & i \ne j \end{array} \right.

This simply says that two standard basis vectors like (0,1,0,0) and (0,0,1,0) have dot product zero if the 1’s are in different places, and one if they’re in the same place. The Fisher information metric is a bit more complicated:

\displaystyle{ g_{i j} = \frac{\delta_{i j}}{p_i} }

As before, g_{i j} is a formula for the dot product of the ith and jth standard basis vectors, but now it depends on where you are in the simplex of population distributions.

We saw how this formula arises from information theory back in Part 7. I won’t repeat the calculation, but the idea is this. Fix a population distribution p and consider the information of another one, say q, relative to this. We get I(q,p). If q = p this is zero:

\displaystyle{ \left. I(q,p)\right|_{q = p} = 0 }

and this point is a local minimum for the relative information. So, the first derivative of I(q,p) as we change q must be zero:

\displaystyle{ \left. \frac{\partial}{\partial q_i} I(q,p) \right|_{q = p} = 0 }

But the second derivatives are not zero. In fact, since we’re at a local minimum, it should not be surprising that we get a positive definite matrix of second derivatives:

\displaystyle{  g_{i j} = \left. \frac{\partial^2}{\partial q_i \partial q_j} I(q,p) \right|_{q = p} = 0 }

And, this is the Fisher information metric! So, the Fisher information metric is a way of taking dot products between vectors in the simplex of population distribution that’s based on the concept of relative information.

This is not the place to explain Riemannian geometry, but any metric gives a way to measure angles and distances, and thus a way to define the gradient of a function. After all, the gradient of a function should point at right angles to the level sets of that function, and its length should equal the slope of that function:

So, if we change our way of measuring angles and distances, we get a new concept of gradient! The ith component of this new gradient vector field turns out to b

(\nabla_g V)^i = g^{i j} \partial_j V

where g^{i j} is the inverse of the matrix g_{i j}, and we sum over the repeated index j. As a sanity check, make sure you see why this is the usual Euclidean gradient when g_{i j} = \delta_{i j}.

Now suppose the fitness landscape is the good old Euclidean gradient of some function. Then it turns out that the replicator equation is a special case of gradient flow on the space of population distributions… but where we use the Fisher information metric to define our concept of gradient!

To get a feel for this, it’s good to start with the Lotka–Volterra equation, which describes how the total number of individuals of each type changes. Suppose the fitness landscape is the Euclidean gradient of some function V:

\displaystyle{ f_i(P) = \frac{\partial V}{\partial P_i} }

Then the Lotka–Volterra equation becomes this:

\displaystyle{ \frac{d P_i}{d t} = \frac{\partial V}{\partial P_i} \, P_i }

This doesn’t look like the gradient flow equation, thanks to that annoying P_i on the right-hand side! It certainly ain’t the gradient flow coming from the function V and the usual Euclidean gradient. However, it is gradient flow coming from V and some other metric on the space

\{ P \in \mathbb{R}^n : \; P_i \ge 0 \}

For a proof, and the formula for this other metric, see Section 3.7 in this survey:

• Marc Harper, Information geometry and evolutionary game theory.

Now let’s turn to the replicator equation:

\displaystyle{ \frac{d p_i}{d t} = \left( f_i(p)  - \langle f(p) \rangle \right) \, p_i }

Again, if the fitness landscape is a Euclidean gradient, we can rewrite the replicator equation as a gradient flow equation… but again, not with respect to the Euclidean metric. This time we need to use the Fisher information metric! I sketch a proof in my paper above.

In fact, both these results were first worked out by Shahshahani:

• Siavash Shahshahani, A New Mathematical Framework for the Study of Linkage and Selection, Memoirs of the AMS 17, 1979.

New directions

All this is just the beginning! The ideas I just explained are unified in information geometry, where distance-like quantities such as the relative entropy and the Fisher information metric are studied. From here it’s a short walk to a very nice version of Fisher’s fundamental theorem of natural selection, which is familiar to researchers both in evolutionary dynamics and in information geometry.

You can see some very nice versions of this story for maximum likelihood estimators and linear programming here:

• Akio Fujiwara and Shun-ichi Amari, Gradient systems in view of information geometry, Physica D: Nonlinear Phenomena 80 (1995), 317–327.

Indeed, this seems to be the first paper discussing the similarities between evolutionary game theory and information geometry.

Dash Fryer (at Pomona College) and I have generalized this story in several interesting ways.

First, there are two famous ways to generalize the usual formula for entropy: Tsallis entropy and Rényi entropy, both of which involve a parameter q. There are Tsallis and Rényi versions of relative entropy and the Fisher information metric as well. Everything I just explained about:

• conditions under which relative entropy is a Lyapunov function for the replicator equation, and

• conditions under which the replicator equation is a special case of gradient flow

generalize to these cases! However, these generalized entropies give modified versions of the replicator equation. When we set q=1 we get back the usual story. See

• Marc Harper, Escort evolutionary game theory.

My initial interest in these alternate entropies was mostly mathematical—what is so special about the corresponding geometries?—but now researchers are starting to find populations that evolve according to these kinds of modified population dynamics! For example:

• A. Hernando et al, The workings of the Maximum Entropy Principle in collective human behavior.

There’s an interesting special case worth some attention. Lots of people fret about the relative entropy not being a distance function obeying the axioms that mathematicians like: for example, it doesn’t obey the triangle inequality. Many describe the relative entropy as a distance-like function, and this is often a valid interpretation contextually. On the other hand, the q=0 relative entropy is one-half the Euclidean distance squared! In this case the modified version of the replicator equation looks like this:

\displaystyle{ \frac{d p_i}{d t} = f_i(p) - \frac{1}{n} \sum_{j = 1}^n f_j(p) }

This equation is called the projection dynamic.

Later, I showed that there is a reasonable definition of relative entropy for a much larger family of geometries that satisfies a similar distance minimization property.

In a different direction, Dash showed that you can change the way that selection acts by using a variety of alternative ‘incentives’, extending the story to some other well-known equations describing evolutionary dynamics. By replacing the terms x_i f_i(x) in the replicator equation with a variety of other functions, called incentives, we can generate many commonly studied models of evolutionary dynamics. For instance if we exponentiate the fitness landscape (to make it always positive), we get what is commonly known as the logit dynamic. This amounts to changing the fitness landscape as follows:

\displaystyle{ f_i \mapsto \frac{x_i e^{\beta f_i}}{\sum_j{x_j e^{\beta f_j}}} }

where \beta is known as an inverse temperature in statistical thermodynamics and as an intensity of selection in evolutionary dynamics. There are lots of modified versions of the replicator equation, like the best-reply and projection dynamics, more common in economic applications of evolutionary game theory, and they can all be captured in this family. (There are also other ways to simultaneously capture such families, such as Bill Sandholm’s revision protocols, which were introduced earlier in his exploration of the foundations of game dynamics.)

Dash showed that there is a natural generalization of evolutionarily stable states to ‘incentive stable states’, and that for incentive stable states, the relative entropy is decreasing to zero when the trajectories get near the equilibrium. For the logit and projection dynamics, incentive stable states are simply evolutionarily stable states, and this happens frequently, but not always.

The third generalization is to look at different ‘time-scales’—that is, different ways of describing time! We can make up the symbol \mathbb{T} for a general choice of ‘time-scale’. So far I’ve been treating time as a real number, so

\mathbb{T} = \mathbb{R}

But we can also treat time as coming in discrete evenly spaced steps, which amounts to treating time as an integer:

\mathbb{T} = \mathbb{Z}

More generally, we can make the steps have duration h, where h is any positive real number:

\mathbb{T} = h\mathbb{Z}

There is a nice way to simultaneously describe the cases \mathbb{T} = \mathbb{R} and \mathbb{T} = h\mathbb{Z} using the time-scale calculus and time-scale derivatives. For the time-scale \mathbb{T} = \mathbb{R} the time-scale derivative is just the ordinary derivative. For the time-scale \mathbb{T} = h\mathbb{Z}, the time-scale derivative is given by the difference quotient from first year calculus:

\displaystyle{ f^{\Delta}(z) = \frac{f(z+h) - f(z)}{h} }

and using this as a substitute for the derivative gives difference equations like a discrete-time version of the replicator equation. There are many other choices of time-scale, such as the quantum time-scale given by \mathbb{T} = q^{\mathbb{Z}}, in which case the time-scale derivative is called the q-derivative, but that’s a tale for another time. In any case, the fact that the successive relative entropies are decreasing can be simply state by saying they have negative \mathbb{T} = h\mathbb{Z} time-scale derivative. The continuous case we started with corresponds to \mathbb{T} = \mathbb{R}.

Remarkably, Dash and I were able to show that you can combine all three of these generalizations into one theorem, and even allow for multiple interacting populations! This produces some really neat population trajectories, such as the following two populations with three types, with fitness functions corresponding to the rock-paper-scissors game. On top we have the replicator equation, which goes along with the Fisher information metric; on the bottom we have the logit dynamic, which goes along with the Euclidean metric on the simplex:

From our theorem, it follows that the relative entropy (ordinary relative entropy on top, the q = 0 entropy on bottom) converges to zero along the population trajectories.

The final form of the theorem is loosely as follows. Pick a Riemannian geometry given by a metric g (obeying some mild conditions) and an incentive for each population, as well as a time scale (\mathbb{R} or h \mathbb{Z}) for every population. This gives an evolutionary dynamic with a natural generalization of evolutionarily stable states, and a suitable version of the relative entropy. Then, if there is an evolutionarily stable state in the interior of the simplex, the time-scale derivative of sum of the relative entropies for each population will decrease as the trajectories converge to the stable state!

When there isn’t such a stable state, we still get some interesting population dynamics, like the following:


See this paper for details:

• Marc Harper and Dashiell E. A. Fryer, Stability of evolutionary dynamics on time scales.

Next time we’ll see how to make the main idea work in finite populations, without derivatives or deterministic trajectories!


The Geometry of Quantum Phase Transitions

13 August, 2010

Today at the CQT, Paolo Zanardi from the University of Southern California is giving a talk on “Quantum Fidelity and the Geometry of Quantum Criticality”. Here are my rough notes…

The motto from the early days of quantum information theory was “Information is physical.” You need to care about the physical medium in which information is encoded. But we can also turn it around: “Physics is informational”.

In a “classical phase transition”, thermal fluctuations play a crucial role. At zero temperature these go away, but there can still be different phases depending on other parameters. A transition between phases at zero temperature is called a quantum phase transitions. One way to detect a quantum phase transition is simply to notice that ground state depends very sensitively on the parameters near such a point. We can do this mathematically using a precise way of measuring distances between states: the Fubini-Study metric, which I’ll define below.

Suppose that M is a manifold parametrizing Hamiltonians for a quantum system, so each point x \in M gives a self-adjoint operator H(x) on some finite-dimensional Hilbert space, say \mathbb{C}^n. Of course in the thermodynamic limit (the limit of infinite volume) we expect our quantum system to be described by an infinite-dimensional Hilbert space, but let’s start out with a finite-dimensional one.

Furthermore, let’s suppose each Hamiltonian has a unique ground state, or at least a chosen ground state, say \psi(x). Here x does not indicate a point in space: it’s a point in M, our space of Hamiltonians!

This ground state \psi(x) is really defined only up to phase, so we should think of it as giving an element of the projective space \mathbb{C P}^{n-1}. There’s a god-given metric on projective space, called the Fubini-Study metric. Since we have a map from M to projective space, sending each point x to the state \psi(x) (modulo phase), we can pull back the Fubini-Study metric via this map to get a metric on M.

But, the resulting metric may not be smooth, because \psi(x) may not depend smoothly on x. The metric may have singularities at certain points, especially after we take the thermodynamic limit. We can think of these singular points as being ‘phase transitions’.

If what I said in the last two paragraphs makes no sense, perhaps a version in something more like plain English will be more useful. We’ve got a quantum system depending on some parameters, and there may be points where the ground state of this quantum system depends in a very drastic way on slight changes in the parameters.

But we can also make the math a bit more explicit. What’s the Fubini-Study metric? Given two unit vectors in a Hilbert space, say \psi and \psi', their Fubini-Study distance is just the angle between them:

d(\psi, \psi') = \cos^{-1}|\langle \psi, \psi' \rangle|

This is an honest Riemannian metric on the projective version of the Hilbert space. And in case you’re wondering about the term ‘quantum fidelity’ in the title of Zanardi’s talk, the quantity

|\langle \psi, \psi' \rangle|

is called the fidelity. The fidelity ranges between 0 and 1, and it’s 1 when two unit vectors are the same up to a phase. To convert this into a distance we take the arc-cosine.

When we pull the Fubini-Study metric back to M, we get a Riemannian metric away from the singular points, and in local coordinates this metric is given by the following cool formula:

g_{\mu \nu} = {\rm Re}\left(\langle \partial_\mu \psi, \partial_\nu \psi \rangle - \langle \partial_\mu \psi, \psi \rangle \langle \psi, \partial_\nu \psi \rangle \right)

where \partial_\mu \psi is the derivative of the ground state \psi(x) as we move x in the \muth coordinate direction.

But Michael Berry came up with an even cooler formula for g_{\mu \nu}. Let’s call the eigenstates of the Hamiltonian \psi_n(x), so that

H(x) \psi_n(x) = E_n(x) \, \psi_n(x)

And let’s rename the ground state \psi_0(x), so

\psi(x) = \psi_0(x)

and

H(x) \psi_0(x) = E_0(x) \, \psi_0(x)

Then a calculation familiar to those you’d see in first-order perturbation theory shows that

g_{\mu \nu} = {\rm Re} \sum_n \langle \psi_0 , \partial_\mu H \; \psi_n \rangle \langle \partial_\nu H \; \psi_n, \psi_0 \rangle / (E_n - E_0)^2

This is nice because it shows g_{\mu \nu} is likely to become singular at points where the ground state becomes degenerate, i.e. where two different states both have minimal energy, so some difference E_n - E_0 becomes zero.

To illustrate these ideas, Zanardi did an example: the XY model in an external magnetic field. This is a ‘spin chain’: a bunch of spin-1/2 particles in a row, each interacting with their nearest neighbors. So, for a chain of length L, the Hilbert space is a tensor product of L copies of \mathbb{C}^2:

\mathbb{C}^2 \otimes \cdots \otimes \mathbb{C}^2

The Hamiltonian of the XY model depends on two real parameters \lambda and \gamma. The parameter \lambda describes a magnetic field pointing in the z direction:

H(\lambda, \gamma) = \sum_i \left(\frac{1+\gamma}{2}\right) \, \sigma^x_i \sigma^x_{i+1} \; + \; \left(\frac{1-\gamma}{2}\right) \, \sigma^y_i \sigma^y_{i+1} \; + \; \lambda \sigma_i^z

where the \sigma‘s are the ever-popular Pauli matrices. The first term makes the x components of the spins of neighboring particles want to point in opposite directions when \gamma is big. The second term makes y components of neighboring spins want to point in the same direction when \gamma is big. And the third term makes all the spins want to point up (resp. down) in the z direction when \lambda is big and negative (resp. positive).

What’s our poor spin chain to do, faced with such competing directives? At zero temperature it seeks the state of lowest energy. When \lambda is less than -1 all the spins get polarized in the spin-up state; when it’s bigger than 1 they all get polarized in the spin-down state. For \lambda in between, there is also some sort of phase transition at \gamma = 0. What’s this like? Some sort of transition between ferromagnetic and antiferromagnetic?

We can use a transformation to express this as a fermionic system and solve it exactly. Physicists love exactly solvable systems, so there have been thousands of papers about the XY model. In the thermodynamic limit (L \to +\infty) the ground state can be computed explicitly, so we can explicitly work out the metric d on the parameter space that has \lambda, \gamma as coordinates!

I will not give the formulas — Zanardi did, but they’re too scary for me. I’ll skip straight to the punchline. Away from phase transitions, we see that for nearby values of parameters, say

x = (\gamma, \delta)

and

x' = (\gamma', \delta')

the ground states have

|\langle \psi(x), \psi(x') \rangle| \sim \exp(-c L)

for some constant c. That’s not surprising: even though the two ground states are locally very similar, since we have a total of L spins in our spin chain, the overall inner product goes like \exp(-c L).

But at phase transitions, the inner product |\langle \psi(x), \psi(x') \rangle| decays even faster with L:

|\langle \psi(x), \psi(x') \rangle| \sim \exp(-c' L^2)

for some other constant c'.

This is called enhanced orthogonalization since it means the ground states at slightly different values of our parameters get close to orthogonal even faster as L grows. Or in other words: their distance as measured by the metric g_{\mu \nu} grows even faster.

This sort of phase transition is an example of a “quantum phase transition”. Note: we’re detecting this phase transition not by looking at the ground state expectation value of a given observable, but by how the ground state itself changes drastically as we change the parameters governing the Hamiltonian.

The exponent of L here — namely the 2 in L^2 — is ‘universal’: i.e., it’s robust with respect to changes in the parameters and even the detailed form of the Hamiltonian.

Zanardi concluded with an argument showing that not every quantum phase transition can be detected by enhanced orthogonalization. For more details, try:

• Silvano Garnerone, N. Tobias Jacobson, Stephan Haas and Paolo Zanardi, Fidelity approach to the disordered quantum XY model.

• Silvano Garnerone, N. Tobias Jacobson, Stephan Haas and Paolo Zanardi, Scaling of the fidelity susceptibility in a disordered quantum spin chain.

For more on the basic concepts, start here:

• Lorenzo Campos Venuti and Paolo Zanardi, Quantum critical scaling of the geometric tensors, 10.1103 Phys. Rev. Lett. 99.095701.

As a final little footnote, I should add that Paolo Zanardi said the metric g_{\mu \nu} defined as above was analogous to Fisher information metric. So, David Corfield should like this…


Network Theory for Economists

15 January, 2013

Tomorrow I’m giving a talk in the econometrics seminar at U.C. Riverside. I was invited to speak on my work on network theory, so I don’t feel too bad about the fact that I’ll be saying only a little about economics and practically nothing about econometrics. Still, I’ve tried to slant the talk in a way that emphasizes possible applications to economics and game theory. Here are the slides:

Network Theory.

For long-time readers here the fun comes near the end. I explain how reaction networks can be used to describe evolutionary games. I point out that in certain classes of evolutionary games, evolution tends to increase ‘fitness’, and/or lead the players to a ‘Nash equilibrium’. For precise theorems you’ll have to click the links in my talk and read the references!

I conclude with an example: a game with three strategies and 7 Nash equilibria. Here evolution makes the proportion of these three strategies follow these flow lines, at least in the limit of large numbers of players:

This picture is from William Sandholm’s nice expository paper:

• William H. Sandholm, Evolutionary game theory, 2007.

I mentioned it before in Information Geometry (Part 12), en route to showing a proof that some quantity always decreases in a class of evolutionary games. Sometime I want to tell the whole story linking:

reaction networks
evolutionary games
the 2nd law of thermodynamics

and

Fisher’s fundamental theorem of natural selection.

But not today! Think of these talk slides as a little appetizer.


Follow

Get every new post delivered to your Inbox.

Join 3,094 other followers