Quantum Information Processing 2011 (Day 1)

11 January, 2011

This year’s session of the big conference on quantum computation, quantum cryptography, and so on is being held in Singapore this year:

QIP 2011, the 14th Workshop on Quantum Information Processing, 8-14 January 2011, Singapore.

Because the battery on my laptop is old and not very energetic, and I can’t find any sockets in the huge ballroom where the talks are being delivered, I can’t live-blog the talks. So, my reports here will be quite incomplete.

Here are microscopic summaries of just three talks from Monday’s session. You can see arXiv references, slides, and videos of the talks here. I’ll just give links to the slides.

Christian Kurtsiefer gave a nice talk on how to exploit the physics of photodetectors to attack quantum key distribution systems! By cutting the optical fiber and shining a lot of light down both directions, evil Eve can ‘blind’ Alice and Bob’s photodetectors. Then, by shining a quick even brighter pulse of light, she can fool one of their photodetectors into thinking it’s seen a single photon. She can even fool them into thinking they’ve seen a violation of Bell’s inequality, by purely classical means, thanks to the fact that only a small percentage of photons make it down the cable in the first place. Christian and his collaborators have actually done this trick in an experiment here at the CQT!

Tzu-Chieh Wei and Akimasa Miyake gave a two-part joint talk on how the AKLT ground state is universal for measurement-based quantum computation. The AKLT ground state works like this: you’ve got a hexagonal lattice with three spin-1/2 particles at each vertex. Think of each particle as attached to one of the three edges coming out of that vertex. In the ground state, you start by putting the pair of particles at the ends of each edge in the spin-0 (also known as “singlet”, or antisymmetrized) state, and then you project the three particles at each vertex down to the spin-3/2 (completely symmetrized) state. This is indeed the ground state of a cleverly chosen antiferromagnetic Hamiltonian. But has anyone ever prepared this sort of system in the lab?

David Poulin gave a talk on how to efficiently compute time evolution given a time-dependent quantum Hamiltonian. The trickiness arises from Hamiltonians that change very rapidly with time. In a naive evenly spaced discretization of the time-ordered exponential, this would require you to use lots of tiny time steps to get a good approximation. But using a clever randomly chosen discretization you can avoid this problem, at least for uniformly bounded Hamiltonians, those obeying:

\| H(t) \| \le K

for all times t. The key is that the high-frequency part of a time-dependent Hamiltonian only couples faraway energy levels, but a uniformly bounded Hamiltonian doesn’t have faraway energy levels.

A couple more things — really just notes to myself:

Steve Flammia told me about this paper relating the Cramer-Rao bound (which involves Fisher information) to the time-energy uncertainty principle:

• Sergio Boixo, Steven T. Flammia, Carlton M. Caves, and J.M. Geremia, Generalized limits for single-parameter quantum estimation.

Markus Müller told me about a paper mentioning relations between Maxwell’s demon and algorithmic entropy. I need to get some references on this work — it might help me make progress on algorithmic thermodynamics. It’s probably one of these:

• Markus Müller, Quantum Kolmogorov complexity and the quantum Turing machine (PhD thesis).

• Markus Müller, On the quantum Kolmogorov complexity of classical strings, Int. J. Quant. Inf. 7 (2009), 701-711.

Hmm — the first one says:

A concrete proposal for an application of quantum Kolmogorov complexity is to analyze a quantum version of the thought experiment of Maxwell’s demon. In one of the versions of this thought experiment, some microscopic device tries to decrease the entropy of some gas in a box, without the expense of energy, by intelligently opening or closing some little door that separates both halves of the box. It is clear that a device like this cannot work as described, since its existence would violate the second law of thermodynamics. But then, the question is what prevents such a little device (or “demon”) from operating.

Roughly, the answer is that the demon has to make observations to decide whether to close or open the door, and these observations accumulate information. From time to time, the demon must erase this additional information, which is only possible at the expense of energy, due to Landauer’s principle. In Li and Vitanyi’s book An Introduction to Kolmogorov Complexity and Its Applications, this cost of energy is analyzed under very weak assumptions with the help of Kolmogorov complexity. Basically, the energy that the demon can extract from the gas is limited by the difference of the entropy of the gas, plus the difference of the Kolmogorov complexity of the demon’s memory before and after the demon’s actions. The power of this analysis is that it even encloses the case that the demon has a computer to do clever calculations, e.g. to compress the accumulated information before erasing it.

So, I guess I need to reread Li and Vitanyi’s book!


Algorithmic Thermodynamics (Part 2)

6 January, 2011

Here are some notes for a talk tomorrow at the Centre for Quantum Technologies. You can think of this as a kind of followup to my first post on this subject.

Introduction

The idea of entropy arose in thermodynamics, and its connection to probability theory is fundamental to statistical mechanics. Its connection to information was developed later by Shannon. We now think of entropy and information as two sides of the same coin.

But there’s another concept, called algorithmic information, which was developed in work on logic and computer science. This concept is related to Shannon’s earlier notion of information, but it also seems rather different.

For example, Shannon’s ideas let us compute the information of a probability distribution on bit strings. So if we detect a radio signal from an extraterrestrial life form, that sends us a string of a dozen 0’s and 1’s each day, and it seems the string is chosen randomly according to some probability distribution, we can calculate the information per message. But if I just show you a single bit string, like this:

011101100001

it makes no sense to ask what its information is. On the other hand, we can define its algorithmic information.

Nonetheless, my goal here is to show you that algorithmic information can really be seen as a special case of the old probabilistic concept of information. This lets us take all the familiar tools from statistical mechanics and thermodynamics, and apply them to algorithmic information theory. Mike Stay and I started to do this here:

• John Baez and Mike Stay, Algorithmic thermodynamics.

Unfortunately, I’ll only have time to sketch a bit of what we did!

Algorithmic information – first definition

Here’s a definition of algorithmic information. Later we’ll see a better one that’s almost equivalent.

Fix a programming language where a program is a finite bit string and its output, if it halts, is again a finite bit string. Then the algorithmic information of a finite bit string is the length of the shortest program that has that bit string as output.

So, for example, the algorithmic information of a string of a trillion 0’s is low, because you can write a short program that prints this out. On the other hand, the algorithmic information of a long “random” string of bits will be high, because the shortest program to print it out will be a program that says “print out this string: 01011100101100…” — so the program is approximately as long as the string itself: slightly longer, but only by a fixed amount.

Of course you may ask what “random” means here. I haven’t defined it! But the point is, people have used these ideas to give a definition what it means for a string of bits to be random. There are different ways to do it. For example: a bit string is ε-random if the shortest program that prints it out is at least ε times as long as the string itself.

You may also wonder: “Doesn’t the definition of algorithmic information depend on the details of our programming language?” And the answer is, “Yes, but not very much.” More precisely, for any two universal programming languages, there’s a constant K such that for any finite bit string, the algorithmic information of that string will differ by at most K depending on which language we use.

Ordinary information

Let’s see how algorithmic information compares to the usual concept of information introduced by Shannon. The usual concept works like this:

If an event of probability p occurs, we define the information gained by learning this event has occurred to be - \log p. We take the logarithm because probabilities of independent events multiply, and we want information to add. The minus sign makes the information positive. We can use any base we want for the logarithm: physicists like e, while computer scientists favor 2.

If there’s a set X of possible outcomes, where the outcome x \in X occurs with probability p_x, then the average or expected amount of information we gain by learning the outcome is

S(p) = - \sum_{x \in X} p_x \log p_x

We call this the entropy of the probability distribution p.

Now, ordinary information doesn’t look very much like algorithmic information! But there’s a second definition of algorithmic information, almost equivalent to the first one, that makes the relation clearer.

Algorithmic information – second definition

First, a minor shift in viewpoint. Before we defined algorithmic information for a finite bit string. Now let’s define it for a natural number. This change in viewpoint is no big deal, since we can set up a one-to-one correspondence between natural numbers and finite bit strings that’s easy to compute.

We’ll still think of our programs as finite bit strings. Not every such string gives a program that halts. We also may demand that bit strings have special features to count as programs. For example, we may demand that programs end in the string 11111, just like a Fortran program must end with the word END.

Let X be the set of programs that halt. Without loss of generality, we can assume that X is prefix-free. This means that if x \in X, no other bit string starting with the string x is also in X. For example, if the string 11111 marks the end of a program, and is only allowed to show up at the end of the program, X will be prefix-free.

Let V(x) be the length of the program x \in X. It’s easy to see that if X is prefix-free,

\sum_{x \in X} 2^{-V(x)} \le 1

Making this sum finite is the reason we want the prefix-free condition — you’ll see exactly why in a minute.

Let N(x) be the output of the program x \in X. Remember, we now think of the output as a natural number. So, we have functions

V, N : X \to \mathbb{N}

We define the algorithmic information of a number n \in \mathbb{N} to be

S(n) = - \log \sum_{x \in X, N(x) = n} 2^{-V(x)}

So: we do a sum over all programs x having the number n as output. We sum up 2^{-V(x)}, where V(x) is the length of the program x. The sum converges because

\sum_{x \in X} 2^{-V(x)} \le 1

That’s where the prefix-free condition comes into play.
Finally, we take minus the logarithm of this sum.

This seems a bit complicated! But suppose there’s just one program x having the number n as output. Then the sum consists of one term, the logarithm cancels out the exponential, and the minus signs cancel too, so we get

S(n) = V(x)

This matches our first definition: the algorithmic information of a number is the length of the shortest program having that number as output.

Of course in this case the shortest program is the only program having that output. If we have more than one program with n as output, the second definition of algorithmic entropy gives a smaller answer than the first definition:

S_{\mathrm{second}}(n) \le S_{\mathrm{first}}(n)

However, there’s a famous theorem, proved by Leonid Levin in 1974, that says the new definition is not very different from the old one. The difference is bounded by a constant. More precisely: for any universal prefix-free programming language, there’s a constant K such that for every n \in \mathbb{N},

S_{\mathrm{second}}(n) \ge S_{\mathrm{first}}(n) - K

Since the algorithmic information depends on our programming language, it was only well-defined within an additive constant in the first place. So, switching from the first definition to the second one doesn’t significantly change the concept. But, it makes the relation to ordinary information a lot easier to see!

From now on we’ll use the second definition.

Algorithmic information versus ordinary information

To relate algorithmic information to ordinary information, we need to get logarithms and probabilities into the game. We see a logarithm here:

S(n) = - \log \sum_{x \in X, N(x) = n} 2^{-V(x)}

but it’s in a funny place, outside the sum — and where are the probabilities?

To solve both these puzzles, let’s define a probability distribution p on the set of natural numbers by

p_n = \frac{1}{Z} \sum_{x \in X, N(x) = n} 2^{-V(x)}

Here Z is a normalizing constant, to make the probabilities sum to 1. Taking

Z = \sum_{x \in X} 2^{-V(x)}

ensures that

\sum_{n \in \mathbb{N}} p_n = 1

Now, notice that

S(n) = - \log p_n \; + \; \log Z

So, up to an additive constant, the algorithmic entropy of the number n is -\log p_n. But -\log p_n is just the information gained upon learning the number n, if we started out knowing only that n was chosen randomly according to the probability distribution p.

What’s the meaning of the probability distribution p? Simple: p_n is the probability that the number n is the output of a randomly chosen program… where ‘randomly chosen’ means chosen according to a certain specific rule. The rule is that increasing the length of a program by 1 makes it half as probable. In other words, the probability q_x of the program x is

q_x = \frac{1}{Z} 2^{-V(x)}

where Z is the normalization factor chosen to make these probabilities sum to 1.

So, the relation between algorithmic information and ordinary information is this:

The algorithmic information of a number is the information gained by learning that number, if beforehand we only knew that it was the output of a randomly chosen program.

Algorithmic thermodynamics

Why should the probability of the program x \in X be defined as

q_x = \frac{1}{Z} 2^{-V(x)} ?

There is no reason we need to use this probability distribution. However, there’s something good about it. It’s an example of a Gibbs state, meaning a probability distribution that maximizes entropy subject to a constraint on the expected value of some observable. Nature likes to maximize entropy, so Gibbs states are fundamental to statistical mechanics. So is the quantity Z: it’s called the partition function.

The idea works like this: suppose we want to maximize the entropy of a probability distribution p on the set of programs, subject to a constraint on the expected value of the length of the program. Then the answer is

p_x = \frac{1}{Z} e^{-\gamma V(x)}

where \gamma is some number, and the partition function Z ensures that the probabilities sum to 1:

Z = \sum_{x \in X} e^{- \gamma V(x)}

So, p equals our previous probability distribution q when

\gamma = \ln 2

However, it is also interesting to consider other values of \gamma, and in fact Tadaki has already done so:

• K. Tadaki, A generalization of Chaitin’s halting probability and halting self-similar sets, Hokkaido Math. J. 31 (2002), 219–253.

The partition function converges for \gamma \ge 2 but diverges otherwise.

More generally, we can look for the probability distribution that maximizes entropy subject to constraints on the expected value of several observables. For example, let:

E(x) = the logarithm of the runtime of the program x

V(x) = the length of the program x

N(x) = the output of the program x

Then the probability distribution that maximizes entropy subject to constraints on the expected values of these three quantities is

p_x = \frac{1}{Z} e^{-\beta E(x) - \gamma V(x) - \delta N(x)}

where now the partition function is

Z = \sum_{x \in X} e^{-\beta E(x) - \gamma V(x) - \delta N(x)}

We’ve chosen our notation to remind experts on statistical mechanics that they’ve seen formulas like this before. The exact same formulas describe a piston full of gas in thermal equilibrium! From a formal, purely mathematical viewpoint:

E is analogous to the internal energy of the gas, and \beta is analogous to 1/T where T is the temperature in units where Boltzmann’s constant is 1.

V is analogous to the volume of the gas, and \gamma is analogous to P/T where P is the pressure.

N is analogous to the number of molecules in the gas, and \delta is analogous to -\mu/T where \mu is the chemical potential.

The analogy here is quite arbitrary. I’m not saying that the length of a program is profoundly similar to the volume of a cylinder of gas; we could have chosen the letter E to stand for the length of a program and everything would still work.

But the analogy works. In other words: now we can take a lot of familiar facts about thermodynamics and instantly translate them into analogous facts about algorithmic information theory! For example, define the entropy of the probability distribution p in the usual way:

S(p) = \sum_{x \in X} p_x \ln p_x

We can think of this as a function of either T, P, and \mu or the expected values of E, V and N. In thermodynamics we learn lots of equations like this:

\left. \frac{\partial S}{\partial E} \right|_{V,N} = \frac{1}{T}

where by standard abuse of notation we’re using E, V, N to stand for their expected values. And, all these equations remain true in algorithmic thermodynamics!

The interesting part is figuring out what all these equations really mean in the context of algorithmic thermodynamics. For a start on that, try our paper.

For example, we call T algorithmic temperature. If you allow programs to run longer, more of them will halt and give an answer. Thanks to the equation

\left. \frac{\partial S}{\partial E} \right|_{V,N} = \frac{1}{T}

the algorithmic temperature is roughly the number of times you have to double the runtime in order to double the number of programs that satisfy the constraints on length and output.

We also consider the algorithmic analogues of thermodynamic cycles, familiar from old work on steam engines. Charles Babbage described a computer powered by a steam engine; we describe a heat engine powered by programs! The significance of this line of thinking remains fairly mysterious. However, I’m hoping that it will help point the way toward a further synthesis of algorithmic information theory and thermodynamics.


Information Geometry (Part 5)

2 November, 2010

I’m trying to understand the Fisher information metric and how it’s related to Öttinger’s formalism for ‘dissipative mechanics’ — that is, mechanics including friction. They involve similar physics, and they involve similar math, but it’s not quite clear how they fit together.

I think it will help to do an example. The harmonic oscillator is a trusty workhorse throughout physics, so let’s do that.

So: suppose you have a rock hanging on a spring, and it can bounce up and down. Suppose it’s in thermal equilibrium with its environment. It will wiggle up and down ever so slightly, thanks to thermal fluctuations. The hotter it is, the more it wiggles. These vibrations are random, so its position and momentum at any given moment can be treated as random variables.

If we take quantum mechanics into account, there’s an extra source of randomness: quantum fluctuations. Now there will be fluctuations even at zero temperature. Ultimately this is due to the uncertainty principle. Indeed, if you know the position for sure, you can’t know the momentum at all!

Let’s see how the position, momentum and energy of our rock will fluctuate given that we know all three of these quantities on average. The fluctuations will form a little fuzzy blob, roughly ellipsoidal in shape, in the 3-dimensional space whose coordinates are position, momentum and energy:

Yeah, I know you’re sick of this picture, but this time it’s for real: I want to calculate what this ellipsoid actually looks like! I’m not promising I’ll do it — I may get stuck, or bored — but at least I’ll try.

Before I start the calculation, let’s guess the answer. A harmonic oscillator has a position q and momentum p, and its energy is

H = \frac{1}{2}(q^2 + p^2)

Here I’m working in units where lots of things equal 1, to keep things simple.

You’ll notice that this energy has rotational symmetry in the position-momentum plane. This is ultimately what makes the harmonic oscillator such a beloved physical system. So, we might naively guess that our little ellipsoid will have rotational symmetry as well, like this:

or this:

Here I’m using the x and y coordinates for position and momentum, while the z coordinate stands for energy. So in these examples the position and momentum fluctuations are the same size, while the energy fluctuations, drawn in the vertical direction, might be bigger or smaller.

Unfortunately, this guess really is naive. After all, there are lots of these ellipsoids, one centered at each point in position-momentum-energy space. Remember the rules of the game! You give me any point in this space. I take the coordinates of this point as the mean values of position, momentum and energy, and I find the maximum-entropy state with these mean values. Then I work out the fluctuations in this state, and draw them as an ellipsoid.

If you pick a point where position and momentum have mean value zero, you haven’t broken the rotational symmetry of the problem. So, my ellipsoid must be rotationally symmetric. But if you pick some other mean value for position and momentum, all bets are off!

Fortunately, this naive guess is actually right: all the ellipsoids are rotationally symmetric — even the ones centered at nonzero values of position and momentum! We’ll see why soon. And if you’ve been following this series of posts, you’ll know what this implies: the “Fisher information metric” g on position-momentum-energy space has rotational symmetry about any vertical axis. (Again, I’m using the vertical direction for energy.) So, if we slice this space with any horizontal plane, the metric on this plane must be the plane’s usual metric times a constant:

g = \mathrm{constant} \, (dq^2 + dp^2)

Why? Because only the usual metric on the plane, or any multiple of it, has ordinary rotations around every point as symmetries.

So, roughly speaking, we’re recovering the ‘obvious’ geometry of the position-momentum plane from the Fisher information metric. We’re recovering ‘ordinary’ geometry from information geometry!

But this should not be terribly surprising, since we used the harmonic oscillator Hamiltonian

H = \frac{1}{2}(q^2 + p^2)

as an input to our game. It’s mainly just a confirmation that things are working as we’d hope.

There’s more, though. Last time I realized that because observables in quantum mechanics don’t commute, the Fisher information metric has a curious skew-symmetric partner called \omega. So, we should also study this in our example. And when we do, we’ll see that restricted to any horizontal plane in position-momentum-energy space, we get

\omega = \mathrm{constant} \, (dq \, dp - dp \, dq)

This looks like a mutant version of the Fisher information metric

g = \mathrm{constant} \, (dq^2 + dp^2)

and if you know your geometry, you’ll know it’s the usual ‘symplectic structure’ on the position-energy plane — at least, times some constant.

All this is very reminiscent of Öttinger’s work on dissipative mechanics. But we’ll also see something else: while the constant in g depends on the energy — that is, on which horizontal plane we take — the constant in \omega does not!

Why? It’s perfectly sensible. The metric g on our horizontal plane keeps track of fluctuations in position and momentum. Thermal fluctuations get bigger when it’s hotter — and to boost the average energy of our oscillator, we must heat it up. So, as we increase the energy, moving our horizontal plane further up in position-momentum-energy space, the metric on the plane gets bigger! In other words, our ellipsoids get a fat cross-section at high energies.

On the other hand, the symplectic structure \omega arises from the fact that position q and momentum p don’t commute in quantum mechanics. They obey Heisenberg’s ‘canonical commutation relation’:

q p - p q = i

This relation doesn’t involve energy, so \omega will be the same on every horizontal plane. And it turns out this relation implies

\omega  = \mathrm{constant} \, (dq \, dp - dp \, dq)

for some constant we’ll compute later.

Okay, that’s the basic idea. Now let’s actually do some computations. For starters, let’s see why all our ellipsoids have rotational symmetry!

To do this, we need to understand a bit about the mixed state \rho that maximizes entropy given certain mean values of position, momentum and energy. So, let’s choose the numbers we want for these mean values (also known as ‘expected values’ or ‘expectation values’):

\langle H \rangle = E

\langle q \rangle = q_0

\langle p \rangle = p_0

I hope this isn’t too confusing: H, p, q are our observables which are operators, while E, p_0, q_0 are the mean values we have chosen for them. The state \rho depends on E, p_0 and q_0.

We’re doing quantum mechanics, so position q and momentum p are both self-adjoint operators on the Hilbert space L^2(\mathbb{R}):

(q\psi)(x) = x \psi(x)

(p\psi)(x) = - i \frac{d \psi}{dx}(x)

Indeed all our observables, including the Hamiltonian

H = \frac{1}{2} (p^2 + q^2)

are self-adjoint operators on this Hilbert space, and the state \rho is a density matrix on this space, meaning a positive self-adjoint operator with trace 1.

Now: how do we compute \rho? It’s a Lagrange multiplier problem: maximizing some function given some constraints. And it’s well-known that when you solve this problem, you get

\rho = \frac{1}{Z} e^{-(\lambda^1 q + \lambda^2 p + \lambda^3 H)}

where \lambda^1, \lambda^2, \lambda^3 are three numbers we yet have to find, and Z is a normalizing factor called the partition function:

Z = \mathrm{tr} (e^{-(\lambda^1 q + \lambda^2 p + \lambda^3 H)} )

Now let’s look at a special case. If we choose \lambda^1 = \lambda^2 = 0, we’re back a simpler and more famous problem, namely maximizing entropy subject to a constraint only on energy! The solution is then

\rho = \frac{1}{Z} e^{-\beta H} , \qquad Z = \mathrm{tr} (e^{- \beta H} )

Here I’m using the letter \beta instead of \lambda^3 because this is traditional. This quantity has an important physical meaning! It’s the reciprocal of temperature in units where Boltzmann’s constant is 1.

Anyway, back to our special case! In this special case it’s easy to explicitly calculate \rho and Z. Indeed, people have known how ever since Planck put the ‘quantum’ in quantum mechanics! He figured out how black-body radiation works. A box of hot radiation is just a big bunch of harmonic oscillators in thermal equilibrium. You can work out its partition function by multiplying the partition function of each one.

So, it would be great to reduce our general problem to this special case. To do this, let’s rewrite

Z = \mathrm{tr} (e^{-(\lambda^1 q + \lambda^2 p + \lambda^3 H)} )

in terms of some new variables, like this:

\rho = \frac{1}{Z} e^{-\beta(H - f q - g p)}

where now

Z = \mathrm{tr} (e^{-\beta(H - f q - g p)} )

Think about it! Now our problem is just like an oscillator with a modified Hamiltonian

H' = H - f q - g p

What does this mean, physically? Well, if you push on something with a force f, its potential energy will pick up a term - f q. So, the first two terms are just the Hamiltonian for a harmonic oscillator with an extra force pushing on it!

I don’t know a nice interpretation for the - g p term. We could say that besides the extra force equal to f, we also have an extra ‘gorce’ equal to g. I don’t know what that means. Luckily, I don’t need to! Mathematically, our whole problem is invariant under rotations in the position-momentum plane, so whatever works for q must also work for p.

Now here’s the cool part. We can complete the square:

\begin{aligned} H' & = \frac{1}{2} (q^2 + p^2) -  f q - g p \\ &= \frac{1}{2}(q^2 - 2 q f + f^2) + \frac{1}{2}(p^2 - 2 q g + g^2) - \frac{1}{2}(g^2 + f^2)  \\ &= \frac{1}{2}((q - f)^2 + (p - g)^2)  - \frac{1}{2}(g^2 + f^2)  \end{aligned}

so if we define ‘translated’ position and momentum operators:

q' = q - f, \qquad p' = p - g

we have

H' = \frac{1}{2}({q'}^2 + {p'}^2) -  \frac{1}{2}(g^2 + f^2)

So: apart from a constant, H' is just the harmonic oscillator Hamiltonian in terms of ‘translated’ position and momentum operators!

In other words: we’re studying a strange variant of the harmonic oscillator, where we are pushing on it with an extra force and also an extra ‘gorce’. But this strange variant is exactly the same as the usual harmonic oscillator, except that we’re working in translated coordinates on position-momentum space, and subtracting a constant from the Hamiltonian.

These are pretty minor differences. So, we’ve succeeded in reducing our problem to the problem of a harmonic oscillator in thermal equilibrium at some temperature!

This makes it easy to calculate

Z = \mathrm{tr} (e^{-\beta(H - f q - g p)} ) = \mathrm{tr}(e^{-\beta H'})

By our formula for H', this is just

Z = e^{\frac{1}{2}(g^2 + f^2)} \; \mathrm{tr} (e^{-\frac{1}{2}({q'}^2 + {p'}^2)})

And the second factor here equals the partition function for the good old harmonic oscillator:

Z = e^{\frac{1}{2}(g^2 + f^2)} \; \mathrm{tr} (e^{-\beta H})

So now we’re back to a textbook problem. The eigenvalues of the harmonic oscillator Hamiltonian are

n + \frac{1}{2}

where

n = 0,1,2,3, \dots

So, the eigenvalues of e^{-\beta H} are are just

e^{-\beta(n + \frac{1}{2})}

and to take the trace of this operator, we sum up these eigenvalues:

\mathrm{tr}(e^{-\beta H}) = \sum_{n = 0}^\infty e^{-\beta (n + \frac{1}{2})} = \frac{e^{-\beta/2}}{1 - e^{-\beta}}

So:

Z = e^{\frac{1}{2}(g^2 + f^2)} \; \frac{e^{-\beta/2}}{1 - e^{-\beta}}

We can now compute the Fisher information metric using this formula:

g_{ij} = \frac{\partial^2}{\partial \lambda^i \partial \lambda^j} \ln Z

if we remember how our new variables are related to the \lambda^i:

\lambda^1 = \beta f , \qquad \lambda^2 = \beta g, \qquad \lambda^3 = \beta

It’s just calculus! But I’m feeling a bit tired, so I’ll leave this pleasure to you.

For now, I’d rather go back to our basic intuition about how the Fisher information metric describes fluctuations of observables. Mathematically, this means it’s the real part of the covariance matrix

g_{ij} = \mathrm{Re} \langle \, (X_i - \langle X_i \rangle) \, (X_j - \langle X_j \rangle)  \, \rangle

where for us

X_1 = q, \qquad X_2 = p, \qquad X_3 = E

Here we are taking expected values using the mixed state \rho. We’ve seen this mixed state is just like the maximum-entropy state of a harmonic oscillator at fixed temperature — except for two caveats: we’re working in translated coordinates on position-momentum space, and subtracting a constant from the Hamiltonian. But neither of these two caveats affects the fluctuations (X_i - \langle X_i \rangle) or the covariance matrix.

So, as indeed we’ve already seen, g_{ij} has rotational symmetry in the 1-2 plane. Thus, we’ll completely know it once we know g_{11} = g_{22} and g_{33}; the other components are zero for symmetry reasons. g_{11} will equal the variance of position for a harmonic oscillator at a given temperature, while g_{33} will equal the variance of its energy. We can work these out or look them up.

I won’t do that now: I’m after insight, not formulas. For physical reasons, it’s obvious that g_{11} must diminish with diminishing energy — but not go to zero. Why? Well, as the temperature approaches zero, a harmonic oscillator in thermal equilibrium approaches its state of least energy: the so-called ‘ground state’. In its ground state, the standard deviations of position and momentum are as small as allowed by the Heisenberg uncertainty principle:

\Delta p \Delta q  \ge \frac{1}{2}

and they’re equal, so

g_{11} = (\Delta q)^2 = \frac{1}{2}.

That’s enough about the metric. Now, what about the metric’s skew-symmetric partner? This is:

\omega_{ij} = \mathrm{Im} \langle \, (X_i - \langle X_i \rangle) \, (X_j - \langle X_j \rangle)  \, \rangle

Last time we saw that \omega is all about expected values of commutators:

\omega_{ij} = \frac{1}{2i} \langle [X_i, X_j] \rangle

and this makes it easy to compute. For example,

[X_1, X_2] = q p - p q = i

so

\omega_{12} = \frac{1}{2}

Of course

\omega_{11} = \omega_{22} = 0

by skew-symmetry, so we know the restriction of \omega to any horizontal plane. We can also work out other components, like \omega_{13}, but I don’t want to. I’d rather just state this:

Summary: Restricted to any horizontal plane in the position-momentum-energy space, the Fisher information metric for the harmonic oscillator is

g = \mathrm{constant} (dq_0^2 + dp_0^2)

with a constant depending on the temperature, equalling \frac{1}{2} in the zero-temperature limit, and increasing as the temperature rises. Restricted to the same plane, the Fisher information metric’s skew-symmetric partner is

\omega = \frac{1}{2} dq_0 \wedge dp_0

(Remember, the mean values q_0, p_0, E_0 are the coordinates on position-momentum-energy space. We could also use coordinates f, g, \beta or f, g and temperature. In the chatty intro to this article you saw formulas like those above but without the subscripts; that’s before I got serious about using q and p to mean operators.)

And now for the moral. Actually I have two: a physics moral and a math moral.

First, what is the physical meaning of g or \omega when restricted to a plane of constant E_0, or if you prefer, a plane of constant temperature?

Physics Moral: Restricted to a constant-temperature plane, g is the covariance matrix for our observables. It is temperature-dependent. In the zero-temperature limit, the thermal fluctuations go away and g depends only on quantum fluctuations in the ground state. On the other hand, \omega restricted to a constant-temperature plane describes Heisenberg uncertainty relations for noncommuting observables. In our example, it is temperature-independent.

Second, what does this have to do with Kähler geometry? Remember, the complex plane has a complex-valued metric on it, called a Kähler structure. Its real part is a Riemannian metric, and its imaginary part is a symplectic structure. We can think of the the complex plane as the position-momentum plane for a point particle. Then the symplectic structure is the basic ingredient needed for Hamiltonian mechanics, while the Riemannian structure is the basic ingredient needed for the harmonic oscillator Hamiltonian.

Math Moral: In the example we considered, \omega restricted to a constant-temperature plane is equal to \frac{1}{2} the usual symplectic structure on the complex plane. On the other hand, g restricted to a constant-temperature plane is a multiple of the usual Riemannian metric on the complex plane — but this multiple is \frac{1}{2} only when the temperature is zero! So, only at temperature zero are g and \omega the real and imaginary parts of a Kähler structure.

It will be interesting to see how much of this stuff is true more generally. The harmonic oscillator is much nicer than your average physical system, so it can be misleading, but I think some of the morals we’ve seen here can be generalized.

Some other time I may so more about how all this is
related to Öttinger’s formalism, but the quick point is that he too has mixed states, and a symmetric g, and a skew-symmetric \omega. So it’s nice to see if they match up in an example.

Finally, two footnotes on terminology:

β: In fact, this quantity \beta = 1/kT is so important it deserves a better name than ‘reciprocal of temperature’. How about ‘coolness’? An important lesson from statistical mechanics is that coolness is more fundamental than temperature. This makes some facts more plausible. For example, if you say “you can never reach absolute zero,” it sounds very odd, since you can get as close as you like, and it’s even possible to get negative temperatures — but temperature zero remains tantalizingly out of reach. But “you can never attain infinite coolness” — now that makes sense.

Gorce: I apologize to Richard Feynman for stealing the word ‘gorce’ and using it a different way. Does anyone have a good intuition for what’s going on when you apply my sort of ‘gorce’ to a point particle? You need to think about velocity-dependent potentials, of that I’m sure. In the presence of a velocity-dependent potential, momentum is not just mass times velocity. Which is good: if it were, we could never have a system where the mean value of both q and p stayed constant over time!


Information Geometry (Part 4)

29 October, 2010

Before moving on, I’d like to clear up a mistake I’d been making in all my previous posts on this subject.

(By now I’ve tried to fix those posts, because people often get information from the web in a hasty way, and I don’t want my mistake to spread. But you’ll still see traces of my mistake infecting the comments on those posts.)

So what’s the mistake? It’s embarrassingly simple, but also simple to fix. A Riemannian metric must be symmetric:

g_{ij} = g_{ji}

Now, I had defined the Fisher information metric to be the so-called ‘covariance matrix’:

g_{ij} = \langle (X_i - \langle X_i \rangle) \;(X_j- \langle X_j \rangle)\rangle

where X_i are some observable-valued functions on a manifold M, and the angle brackets mean “expectation value”, computed using a mixed state \rho that also depends on the point in M.

The covariance matrix is symmetric in classical mechanics, since then observables commute, so:

\langle AB \rangle = \langle BA \rangle

But it’s not symmetric is quantum mechanics! After all, suppose q is the position operator for a particle, and p is the momentum operator. Then according to Heisenberg

qp = pq + i

in units where Planck’s constant is 1. Taking expectation values, we get:

\langle qp \rangle = \langle pq \rangle + i

and in particular:

\langle qp \rangle \ne \langle pq \rangle

We can use this to get examples where g_{ij} is not symmetric.

However, it turns out that the real part of the covariance matrix is symmetric, even in quantum mechanics — and that’s what we should use as our Fisher information metric.

Why is the real part of the covariance matrix symmetric, even in quantum mechanics? Well, suppose \rho is any density matrix, and A and B are any observables. Then by definition

\langle AB \rangle = \mathrm{tr} (\rho AB)

so taking the complex conjugate of both sides

\langle AB\rangle^*  = \mathrm{tr}(\rho AB)^* = \mathrm{tr}((\rho A B)^*) = \mathrm{tr}(B^* A^* \rho^*)

where I’m using an asterisk both for the complex conjugate of a number and the adjoint of an operator. But our observables are self-adjoint, and so is our density matrix, so we get

\mathrm{tr}(B^* A^* \rho^*) = \mathrm{tr}(B A \rho) = \mathrm{tr}(\rho B A) = \langle B A \rangle

where in the second step we used the cyclic property of the trace. In short:

\langle AB\rangle^* = \langle BA \rangle

If we take real parts, we get something symmetric:

\mathrm{Re} \langle AB\rangle =  \mathrm{Re} \langle BA \rangle

So, if we redefine the Fisher information metric to be the real part of the covariance matrix:

g_{ij} = \mathrm{Re} \langle (X_i - \langle X_i \rangle) \; (X_j- \langle X_j \rangle)\rangle

then it’s symmetric, as it should be.

Last time I mentioned a general setup using von Neumann algebras, that handles the classical and quantum situations simultaneously. That applies here! Taking the real part has no effect in classical mechanics, so we don’t need it there — but it doesn’t hurt, either.

Taking the real part never has any effect when i = j, either, since the expected value of the square of an observable is a nonnegative number:

\langle (X_i - \langle X_i \rangle)^2 \rangle \ge 0

This has two nice consequences.

First, we get

g_{ii} = \langle (X_i - \langle X_i \rangle)^2 \rangle  \ge 0

and since this is true in any coordinate system, our would-be metric g is indeed nonnegative. It’ll be an honest Riemannian metric whenever it’s positive definite.

Second, suppose we’re working in the special case discussed in Part 2, where our manifold is an open subset of \mathbb{R}^n, and \mathbb{\rho} at the point x \in \mathbb{R}^n is the Gibbs state with \langle X_i \rangle = x_i. Then all the usual rules of statistical mechanics apply. So, we can compute the variance of the observable X_i using the partition function Z:

\langle (X_i - \langle X_i \rangle)^2 \rangle = \frac{\partial^2}{\partial \lambda_i^2} \ln Z

In other words,

g_{ii} =  \frac{\partial^2}{\partial \lambda_i^2} \ln Z

But since this is true in any coordinate system, we must have

g_{ij} =  \frac{\partial^2}{\partial \lambda_i \partial \lambda_j} \ln Z

(Here I’m using a little math trick: two symmetric bilinear forms whose diagonal entries agree in any basis must be equal. We’ve already seen that the left side is symmetric, and the right side is symmetric by a famous fact about mixed partial derivatives.)

However, I’m pretty sure this cute formula

g_{ij} =  \frac{\partial^2}{\partial \lambda_i \partial \lambda_j} \ln Z

only holds in the special case I’m talking about now, where points in \mathbb{R}^n are parametrizing Gibbs states in the obvious way. In general we must use

g_{ij} = \mathrm{Re} \langle (X_i - \langle X_i \rangle)(X_j- \langle X_j \rangle)\rangle

or equivalently,

g_{ij} = \mathrm{Re} \, \mathrm{tr} (\rho \; \frac{\partial \ln \rho}{\partial \lambda_i} \frac{\partial \ln \rho}{\partial \lambda_j})

Okay. So much for cleaning up Last Week’s Mess. Here’s something new. We’ve seen that whenever A and B are observables (that is, self-adjoint),

\langle AB\rangle^* = \langle BA \rangle

We got something symmetric by taking the real part:

\mathrm{Re} \langle AB\rangle =  \mathrm{Re} \langle BA \rangle

Indeed,

\mathrm{Re} \langle AB \rangle = \frac{1}{2} \langle AB + BA \rangle

But by the same reasoning, we get something antisymmetric by taking the imaginary part:

\mathrm{Im} \langle AB\rangle =  - \mathrm{Im} \langle BA \rangle

and indeed,

\mathrm{Im} \langle AB \rangle = \frac{1}{2i} \langle AB - BA \rangle

Commutators like AB-BA are important in quantum mechanics, so maybe we shouldn’t just throw out the imaginary part of the covariance matrix in our desperate search for a Riemannian metric! Besides the symmetric tensor on our manifold M:

g_{ij} = \mathrm{Re} \, \mathrm{tr} (\rho \; \frac{\partial \ln \rho}{\partial \lambda_i} \frac{\partial \ln \rho}{\partial \lambda_j})

we can also define a skew-symmetric tensor:

\omega_{ij} = \mathrm{Im} \,  \mathrm{tr} (\rho \; \frac{\partial \ln \rho}{\partial \lambda_i} \frac{\partial \ln \rho}{\partial \lambda_j})

This will vanish in the classical case, but not in the quantum case!

If you’ve studied enough geometry, you should now be reminded of things like ‘Kähler manifolds’ and ‘almost Kähler manifolds’. A Kähler manifold is a manifold that’s equipped with a symmetric tensor g and a skew-symmetric tensor \omega which fit together in the best possible way. An almost Kähler manifold is something similar, but not quite as nice. We should probably see examples of these arising in information geometry! And that could be pretty interesting.

But in general, if we start with any old manifold M together with a function \rho taking values in mixed states, we seem to be making M into something even less nice. It gets a symmetric bilinear form g on each tangent space, and a skew-symmetric bilinear form \omega, and they vary smoothly from point to point… but they might be degenerate, and I don’t see any reason for them to ‘fit together’ in the nice way we need for a Kähler or almost Kähler manifold.

However, I still think something interesting might be going on here. For one thing, there are other situations in physics where a space of states is equipped with a symmetric g and a skew-symmetric \omega. They show up in ‘dissipative mechanics’ — the study of systems whose entropy increases.

To conclude, let me remind you of some things I said in week295 of This Week’s Finds. This is a huge digression from information geometry, but I’d like to lay out the the puzzle pieces in public view, in case it helps anyone get some good ideas.

I wrote:

• Hans Christian Öttinger, Beyond Equilibrium Thermodynamics, Wiley, 2005.

I thank Arnold Neumaier for pointing out this book! It considers a fascinating generalization of Hamiltonian mechanics that applies to systems with dissipation: for example, electrical circuits with resistors, or mechanical systems with friction.

In ordinary Hamiltonian mechanics the space of states is a manifold and time evolution is a flow on this manifold determined by a smooth function called the Hamiltonian, which describes the energy of any state. In this generalization the space of states is still a manifold, but now time evolution is determined by two smooth functions: the energy and the entropy! In ordinary Hamiltonian mechanics, energy is automatically conserved. In this generalization that’s also true, but energy can go into the form of heat… and entropy automatically increases!

Mathematically, the idea goes like this. We start with a Poisson manifold, but in addition to the skew-symmetric Poisson bracket {F,G} of smooth functions on some manifold, we also have a symmetric bilinear bracket [F,G] obeying the Leibniz law

[F,GH] = [F,G]H + G[F,H]

and this positivity condition:

[F,F] ≥ 0

The time evolution of any function is given by a generalization of Hamilton’s equations:

dF/dt = {H,F} + [S,F]

where H is a function called the "energy" or "Hamiltonian", and S is a function called the "entropy". The first term on the right is the usual one. The new second term describes dissipation: as we shall see, it pushes the state towards increasing entropy.

If we require that

[H,F] = {S,F} = 0

for every function F, then we get conservation of energy, as usual in Hamiltonian mechanics:

dH/dt = {H,H} + [S,H] = 0

But we also get the second law of thermodynamics:

dS/dt = {H,S} + [S,S] ≥ 0

Entropy always increases!

Öttinger calls this framework “GENERIC” – an annoying acronym for “General Equation for the NonEquilibrium Reversible-Irreversible Coupling”. There are lots of papers about it. But I’m wondering if any geometers have looked into it!

If we didn’t need the equations [H,F] = {S,F} = 0, we could easily get the necessary brackets starting with a Kähler manifold. The imaginary part of the Kähler structure is a symplectic structure, say ω, so we can define

{F,G} = ω(dF,dG)

as usual to get Poisson brackets. The real part of the Kähler structure is a Riemannian structure, say g, so we can define

[F,G] = g(dF,dG)

This satisfies

[F,GH] = [F,G]H + G[F,H]

and

[F,F] ≥ 0

Don’t be fooled: this stuff is not rocket science. In particular, the inequality above has a simple meaning: when we move in the direction of the gradient of F, the function F increases. So adding the second term to Hamilton’s equations has the effect of pushing the system towards increasing entropy.

Note that I’m being a tad unorthodox by letting ω and g eat cotangent vectors instead of tangent vectors – but that’s no big deal. The big deal is this: if we start with a Kähler manifold and define brackets this way, we don’t get [H,F] = 0 or {S,F} = 0 for all functions F unless H and S are constant! That’s no good for applications to physics. To get around this problem, we would need to consider some sort of degenerate Kähler structure – one where ω and g are degenerate bilinear forms on the cotangent space.

Has anyone thought about such things? They remind me a little of "Dirac structures" and "generalized complex geometry" – but I don’t know enough about those subjects to know if they’re relevant here.

This GENERIC framework suggests that energy and entropy should be viewed as two parts of a single entity – maybe even its real and imaginary parts! And that in turn reminds me of other strange things, like the idea of using complex-valued Hamiltonians to describe dissipative systems, or the idea of “inverse temperature as imaginary time”. I can’t tell yet if there’s a big idea lurking here, or just a mess….


Information Geometry (Part 3)

25 October, 2010

So far in this series of posts I’ve been explaining a paper by Gavin Crooks. Now I want to go ahead and explain a little research of my own.

I’m not claiming my results are new — indeed I have no idea whether they are, and I’d like to hear from any experts who might know. I’m just claiming that this is some work I did last weekend.

People sometimes worry that if they explain their ideas before publishing them, someone will ‘steal’ them. But I think this overestimates the value of ideas, at least in esoteric fields like mathematical physics. The problem is not people stealing your ideas: the hard part is giving them away. And let’s face it, people in love with math and physics will do research unless you actively stop them. I’m reminded of this scene from the Marx Brothers movie where Harpo and Chico, playing wandering musicians, walk into a hotel and offer to play:

Groucho: What do you fellows get an hour?

Chico: Oh, for playing we getta ten dollars an hour.

Groucho: I see…What do you get for not playing?

Chico: Twelve dollars an hour.

Groucho: Well, clip me off a piece of that.

Chico: Now, for rehearsing we make special rate. Thatsa fifteen dollars an hour.

Groucho: That’s for rehearsing?

Chico: Thatsa for rehearsing.

Groucho: And what do you get for not rehearsing?

Chico: You couldn’t afford it.

So, I’m just rehearsing in public here — but I of course I hope to write a paper about this stuff someday, once I get enough material.

Remember where we were. We had considered a manifold — let’s finally give it a name, say M — that parametrizes Gibbs states of some physical system. By Gibbs state, I mean a state that maximizes entropy subject to constraints on the expected values of some observables. And we had seen that in favorable cases, we get a Riemannian metric on M! It looks like this:

g_{ij} = \mathrm{Re} \langle (X_i - \langle X_i \rangle)  (X_j - \langle X_j \rangle)  \rangle

where X_i are our observables, and the angle bracket means ‘expected value’.

All this applies to both classical or quantum mechanics. Crooks wrote down a beautiful formula for this metric in the classical case. But since I’m at the Centre for Quantum Technologies, not the Centre for Classical Technologies, I redid his calculation in the quantum case. The big difference is that in quantum mechanics, observables don’t commute! But in the calculations I did, that didn’t seem to matter much — mainly because I took a lot of traces, which imposes a kind of commutativity:

\mathrm{tr}(AB) = \mathrm{tr}(BA)

In fact, if I’d wanted to show off, I could have done the classical and quantum cases simultaneously by replacing all operators by elements of any von Neumann algebra equipped with a trace. Don’t worry about this much: it’s just a general formalism for treating classical and quantum mechanics on an equal footing. One example is the algebra of bounded operators on a Hilbert space, with the usual concept of trace. Then we’re doing quantum mechanics as usual. But another example is the algebra of suitably nice functions on a suitably nice space, where taking the trace of a function means integrating it. And then we’re doing classical mechanics!

For example, I showed you how to derive a beautiful formula for the metric I wrote down a minute ago:

g_{ij} = \mathrm{Re} \, \mathrm{tr}(\rho \; \frac{\partial \mathrm{ln} \rho }{\partial \lambda^i} \; \frac{\partial \mathrm{ln} \rho }{\partial \lambda^j} )

But if we want to do the classical version, we can say Hey, presto! and write it down like this:

g_{ij} = \int_\Omega p(\omega) \; \frac{\partial \mathrm{ln} p(\omega) }{\partial \lambda^i} \; \frac{\partial \mathrm{ln} p(\omega) }{\partial \lambda^j} \; d \omega

What did I do just now? I changed the trace to an integral over some space \Omega. I rewrote \rho as p to make you think ‘probability distribution’. And I don’t need to take the real part anymore, since is everything already real when we’re doing classical mechanics. Now this metric is the Fisher information metric that statisticians know and love!

In what follows, I’ll keep talking about the quantum case, but in the back of my mind I’ll be using von Neumann algebras, so everything will apply to the classical case too.

So what am I going to do? I’m going to fix a big problem with the story I’ve told so far.

Here’s the problem: so far we’ve only studied a special case of the Fisher information metric. We’ve been assuming our states are Gibbs states, parametrized by the expectation values of some observables X_1, \dots, X_n. Our manifold M was really just some open subset of \mathbb{R}^n: a point in here was a list of expectation values.

But people like to work a lot more generally. We could look at any smooth function \rho from a smooth manifold M to the set of density matrices for some quantum system. We can still write down the metric

g_{ij} = \mathrm{Re} \, \mathrm{tr}(\rho \; \frac{\partial \mathrm{ln} \rho }{\partial \lambda^i} \; \frac{\partial \mathrm{ln} \rho }{\partial \lambda^j} )

in this more general situation. Nobody can stop us! But it would be better if we could derive this formula, as before, starting from a formula like the one we had before:

g_{ij} = \mathrm{Re} \langle \, (X_i - \langle X_i \rangle) \, (X_j - \langle X_j \rangle)  \, \rangle

The challenge is that now we don’t have observables X_i to start with. All we have is a smooth function \rho from some manifold to some set of states. How can we pull observables out of thin air?

Well, you may remember that last time we had

\rho = \frac{1}{Z} e^{-\lambda^i X_i}

where \lambda^i were some functions on our manifold and

Z = \mathrm{tr}(e^{-\lambda^i X_i})

was the partition function. Let’s copy this idea.

So, we’ll start with our density matrix \rho, but then write it as

\rho = \frac{1}{Z} e^{-A}

where A is some self-adjoint operator and

Z = \mathrm{tr} (e^{-A})

(Note that A, like \rho, is really an operator-valued function on M. So, I should write something like A(x) to denote its value at a particular point x \in M, but I won’t usually do that. As usual, I expect some intelligence on your part!)

Now we can repeat some calculations I did last time. As before, let’s take the logarithm of \rho:

\mathrm{ln} \, \rho = -A - \mathrm{ln}\,  Z

and then differentiate it. Suppose \lambda^i are local coordinates near some point of M. Then

\frac{\partial}{\partial \lambda^i} \mathrm{ln}\, \rho = - \frac{\partial}{\partial \lambda^i} A - \frac{1}{Z} \frac{\partial }{\partial \lambda^i} Z

Last time we had nice formulas for both terms on the right-hand side above. To get similar formulas now, let’s define operators

X_i = \frac{\partial}{\partial \lambda^i} A

This gives a nice name to the first term on the right-hand side above. What about the second term? We can calculate it out:

\frac{1}{Z} \frac{\partial }{\partial \lambda^i} Z = \frac{1}{Z}  \frac{\partial }{\partial \lambda^i} \mathrm{tr}(e^{-A}) = \frac{1}{Z}  \mathrm{tr}(\frac{\partial }{\partial \lambda^i} e^{-A}) = - \frac{1}{Z}  \mathrm{tr}(e^{-A} \frac{\partial}{\partial \lambda^i} A)

where in the last step we use the chain rule. Next, use the definition of \rho and X_i, and get:

\frac{1}{Z} \frac{\partial }{\partial \lambda^i} Z = - \mathrm{tr}(\rho X_i) = - \langle X_i \rangle

This is just what we got last time! Ain’t it fun to calculate when it all works out so nicely?

So, putting both terms together, we see

\frac{\partial}{\partial \lambda^i} \mathrm{ln} \rho = - X_i + \langle X_i \rangle

or better:

X_i - \langle X_i \rangle = -\frac{\partial}{\partial \lambda^i} \mathrm{ln} \rho

This is a nice formula for the ‘fluctuation’ of the observables X_i, meaning how much they differ from their expected values. And it looks exactly like the formula we had last time! The difference is that last time we started out assuming we had a bunch of observables, X_i, and defined \rho to be the state maximizing the entropy subject to constraints on the expectation values of all these observables.
Now we’re starting with \rho and working backwards.

From here on out, it’s easy. As before, we can define g_{ij} to be the real part of the covariance matrix:

g_{ij} = \mathrm{Re} \langle (X_i - \langle X_i \rangle)  (X_j - \langle X_j \rangle)  \rangle

Using the formula

X_i - \langle X_i \rangle = -\frac{\partial}{\partial \lambda^i} \mathrm{ln} \rho

we get

g_{ij} = \mathrm{Re} \langle \frac{\partial \mathrm{ln} \rho}{\partial \lambda^i} \; \frac{\partial \mathrm{ln} \rho}{\partial \lambda^j} \rangle

or

g_{ij} = \mathrm{Re}\,\mathrm{tr}(\rho \; \frac{\partial \mathrm{ln} \rho}{\partial \lambda^i} \; \frac{\partial \mathrm{ln} \rho}{\partial \lambda^j})

Voilà!

When this matrix is positive definite at every point, we get a Riemanian metric on M. Last time I said this is what people call the ‘Bures metric’ — though frankly, now that I examine the formulas, I’m not so sure. But in the classical case, it’s called the Fisher information metric.

Differential geometers like to use \partial_i as a shorthand for \frac{\partial}{\partial_i}, so they’d write down our metric in a prettier way:

g_{ij} = \mathrm{Re} \, \mathrm{tr}(\rho \; \partial_i (\mathrm{ln} \, \rho) \; \partial_j (\mathrm{ln} \, \rho) )

Differential geometers like coordinate-free formulas, so let’s also give a coordinate-free formula for our metric. Suppose x \in M is a point in our manifold, and suppose v,w are tangent vectors to this point. Then

g(v,w) = \mathrm{Re} \, \langle v(\mathrm{ln}\, \rho) \; w(\mathrm{ln} \,\rho) \rangle \; = \; \mathrm{Re} \,\mathrm{tr}(\rho \; v(\mathrm{ln}\, \rho) \; w(\mathrm{ln}\, \rho))

Here \mathrm{ln}\, \rho is a smooth operator-valued function on M, and v(\mathrm{ln}\, \rho) means the derivative of this function in the v direction at the point x.

So, this is all very nice. To conclude, two more points: a technical one, and a more important philosophical one.

First, the technical point. When I said \rho could be any smooth function from a smooth manifold to some set of states, I was actually lying. That’s an important pedagogical technique: the brazen lie.

We can’t really take the logarithm of every density matrix. Remember, we take the log of a density matrix by taking the log of all its eigenvalues. These eigenvalues are ≥ 0, but if one of them is zero, we’re in trouble! The logarithm of zero is undefined.

On the other hand, there’s no problem taking the logarithm of our density-matrix-valued function \rho when it’s positive definite at each point of M. You see, a density matrix is positive definite iff its eigenvalues are all > 0. In this case it has a unique self-adjoint logarithm.

So, we must assume \rho is positive definite. But what’s the physical significance of this ‘positive definiteness’ condition? Well, any density matrix can be diagonalized using some orthonormal basis. It can then be seen as probabilistic mixture — not a quantum superposition! — of pure states taken from this basis. Its eigenvalues are the probabilities of finding the mixed state to be in one of these pure states. So, saying that all its eigenvalues are all > 0 amounts to saying that all the pure states in this orthonormal basis show up with nonzero probability! Intuitively, this means our mixed state is ‘really mixed’. For example, it can’t be a pure state. In math jargon, it means our mixed state is in the interior of the convex set of mixed states.

Second, the philosophical point. Instead of starting with the density matrix \rho, I took A as fundamental. But different choices of A give the same \rho. After all,

\rho = \frac{1}{Z} e^{-A}

where we cleverly divide by the normalization factor

Z = \mathrm{tr} (e^{-A})

to get \mathrm{tr} \, \rho = 1. So, if we multiply e^{-A} by any positive constant, or indeed any positive function on our manifold M, \rho will remain unchanged!

So we have added a little extra information when switching from \rho to A. You can think of this as ‘gauge freedom’, because I’m saying we can do any transformation like

A \mapsto A + f

where

f: M \to \mathbb{R}

is a smooth function. This doesn’t change \rho, so arguably it doesn’t change the ‘physics’ of what I’m doing. It does change Z. It also changes the observables

X_i = \frac{\partial}{\partial \lambda^i} A

But it doesn’t change their ‘fluctuations’

X_i - \langle X_i \rangle

so it doesn’t change the metric g_{ij}.

This gauge freedom is interesting, and I want to understand it better. It’s related to something very simple yet mysterious. In statistical mechanics the partition function Z begins life as ‘just a normalizing factor’. If you change the physics so that Z gets multiplied by some number, the Gibbs state doesn’t change. But then the partition function takes on an incredibly significant role as something whose logarithm you differentiate to get lots of physically interesting information! So in some sense the partition function doesn’t matter much… but changes in the partition function matter a lot.

This is just like the split personality of phases in quantum mechanics. On the one hand they ‘don’t matter’: you can multiply a unit vector by any phase and the pure state it defines doesn’t change. But on the other hand, changes in phase matter a lot.

Indeed the analogy here is quite deep: it’s the analogy between probabilities in statistical mechanics and amplitudes in quantum mechanics, the analogy between \mathrm{exp}(-\beta H) in statistical mechanics and \mathrm{exp}(-i t H / \hbar) in quantum mechanics, and so on. This is part of a bigger story about ‘rigs’ which I told back in the Winter 2007 quantum gravity seminar, especially in week13. So, it’s fun to see it showing up yet again… even though I don’t completely understand it here.

[Note: in the original version of this post, I omitted the real part in my definition g_{ij} = \mathrm{Re} \langle (X_i - \langle X_i \rangle)  (X_j - \langle X_j \rangle)  \rangle , giving a ‘Riemannian metric’ that was neither real nor symmetric in the quantum case. Most of the comments below are based on that original version, not the new fixed one.]


Information Geometry (Part 2)

23 October, 2010

Last time I provided some background to this paper:

• Gavin E. Crooks, Measuring thermodynamic length.

Now I’ll tell you a bit about what it actually says!

Remember the story so far: we’ve got a physical system that’s in a state of maximum entropy. I didn’t emphasize this yet, but that happens whenever our system is in thermodynamic equilibrium. An example would be a box of gas inside a piston. Suppose you choose any number for the energy of the gas and any number for its volume. Then there’s a unique state of the gas that maximizes its entropy, given the constraint that on average, its energy and volume have the values you’ve chosen. And this describes what the gas will be like in equilibrium!

Remember, by ‘state’ I mean mixed state: it’s a probabilistic description. And I say the energy and volume have chosen values on average because there will be random fluctuations. Indeed, if you look carefully at the head of the piston, you’ll see it quivering: the volume of the gas only equals the volume you’ve specified on average. Same for the energy.

More generally: imagine picking any list of numbers, and finding the maximum entropy state where some chosen observables have these numbers as their average values. Then there will be fluctuations in the values of these observables — thermal fluctuations, but also possibly quantum fluctuations. So, you’ll get a probability distribution on the space of possible values of your chosen observables. You should visualize this probability distribution as a little fuzzy cloud centered at the average value!

To a first approximation, this cloud will be shaped like a little ellipsoid. And if you can pick the average value of your observables to be whatever you’ll like, you’ll get lots of little ellipsoids this way, one centered at each point. And the cool idea is to imagine the space of possible values of your observables as having a weirdly warped geometry, such that relative to this geometry, these ellipsoids are actually spheres.

This weirdly warped geometry is an example of an ‘information geometry’: a geometry that’s defined using the concept of information. This shouldn’t be surprising: after all, we’re talking about maximum entropy, and entropy is related to information. But I want to gradually make this idea more precise.

Bring on the math!

We’ve got a bunch of observables X_1, \dots , X_n, and we’re assuming that for any list of numbers x_1, \dots , x_n, the system has a unique maximal-entropy state \rho for which the expected value of the observable X_i is x_i:

\langle X_i \rangle = x_i

This state \rho is called the Gibbs state and I told you how to find it when it exists. In fact it may not exist for every list of numbers x_1, \dots , x_n, but we’ll be perfectly happy if it does for all choices of

x = (x_1, \dots, x_n)

lying in some open subset of \mathbb{R}^n

By the way, I should really call this Gibbs state \rho(x) or something to indicate how it depends on x, but I won’t usually do that. I expect some intelligence on your part!

Now at each point x we can define a covariance matrix

\langle (X_i - \langle X_i \rangle)  (X_j - \langle X_j \rangle)  \rangle

If we take its real part, we get a symmetric matrix:

g_{ij} = \mathrm{Re} \langle (X_i - \langle X_i \rangle)  (X_j - \langle X_j \rangle)  \rangle

It’s also nonnegative — that’s easy to see, since the variance of a probability distribution can’t be negative. When we’re lucky this matrix will be positive definite. When we’re even luckier, it will depend smoothly on x. In this case, g_{ij} will define a Riemannian metric on our open set.

So far this is all review of last time. Sorry: I seem to have reached the age where I can’t say anything interesting without warming up for about 15 minutes first. It’s like when my mom tells me about an exciting event that happened to her: she starts by saying “Well, I woke up, and it was cloudy out…”

But now I want to give you an explicit formula for the metric g_{ij}, and then rewrite it in a way that’ll work even when the state \rho is not a maximal-entropy state. And this formula will then be the general definition of the ‘Fisher information metric’ (if we’re doing classical mechanics), or a quantum version thereof (if we’re doing quantum mechanics).

Crooks does the classical case — so let’s do the quantum case, okay? Last time I claimed that in the quantum case, our maximum-entropy state is the Gibbs state

\rho = \frac{1}{Z} e^{-\lambda^i X_i}

where \lambda^i are the ‘conjugate variables’ of the observables X_i, we’re using the Einstein summation convention to sum over repeated indices that show up once upstairs and once downstairs, and Z is the partition function

Z = \mathrm{tr} (e^{-\lambda^i X_i})

(To be honest: last time I wrote the indices on the conjugate variables \lambda^i as subscripts rather than superscripts, because I didn’t want some poor schlep out there to think that \lambda^1, \dots , \lambda^n were the powers of some number \lambda. But now I’m assuming you’re all grown up and ready to juggle indices! We’re doing Riemannian geometry, after all.)

Also last time I claimed that it’s tremendously fun and enlightening to take the derivative of the logarithm of Z. The reason is that it gives you the mean values of your observables:

\langle X_i \rangle = - \frac{\partial}{\partial \lambda^i} \ln Z

But now let’s take the derivative of the logarithm of \rho. Remember, \rho is an operator — in fact a density matrix. But we can take its logarithm as explained last time, and the usual rules apply, so starting from

\rho = \frac{1}{Z} e^{-\lambda^i X_i}

we get

\mathrm{ln}\, \rho = - \lambda^i X_i - \mathrm{ln} \,Z

Next, let’s differentiate both sides with respect to \lambda^i. Why? Well, from what I just said, you should be itching to differentiate \mathrm{ln}\, Z. So let’s give in to that temptation:

\frac{\partial  }{\partial \lambda^i} \mathrm{ln}  \rho = -X_i + \langle X_i \rangle

Hey! Now we’ve got a formula for the ‘fluctuation’ of the observable X_i — that is, how much it differs from its mean value:

X_i - \langle X_i \rangle = - \frac{\partial \mathrm{ln} \rho }{\partial \lambda^i}

This is incredibly cool! I should have learned this formula decades ago, but somehow I just bumped into it now. I knew of course that \mathrm{ln} \, \rho shows up in the formula for the entropy:

S(\rho) = \mathrm{tr} ( \rho \; \mathrm{ln} \, \rho)

But I never had the brains to think about \mathrm{ln}\, \rho all by itself. So I’m really excited to discover that it’s an interesting entity in its own right — and fun to differentiate, just like \mathrm{ln}\, Z.

Now we get our cool formula for g_{ij}. Remember, it’s defined by

g_{ij} = \mathrm{Re} \langle (X_i - \langle X_i \rangle)  (X_j - \langle X_j \rangle)  \rangle

But now that we know

X_i - \langle X_i \rangle = -\frac{\partial \mathrm{ln} \rho }{\partial \lambda^i}

we get the formula we were looking for:

g_{ij} = \mathrm{Re} \langle \frac{\partial \mathrm{ln} \rho }{\partial \lambda^i} \; \frac{\partial \mathrm{ln} \rho }{\partial \lambda^j}  \rangle

Beautiful, eh? And of course the expected value of any observable A in the state \rho is

\langle A \rangle = \mathrm{tr}(\rho A)

so we can also write the covariance matrix like this:

g_{ij} = \mathrm{Re}\, \mathrm{tr}(\rho \; \frac{\partial \mathrm{ln} \rho }{\partial \lambda^i} \; \frac{\partial \mathrm{ln} \rho }{\partial \lambda^j} )

Lo and behold! This formula makes sense whenever \rho is any density matrix depending smoothly on some parameters \lambda^i. We don’t need it to be a Gibbs state! So, we can work more generally.

Indeed, whenever we have any smooth function from a manifold to the space of density matrices for some Hilbert space, we can define g_{ij} by the above formula! And when it’s positive definite, we get a Riemannian metric on our manifold: the Bures information metric.

The classical analogue is the somewhat more well-known ‘Fisher information metric’. When we go from quantum to classical, operators become functions and traces become integrals. There’s nothing complex anymore, so taking the real part becomes unnecessary. So the Fisher information metric looks like this:

g_{ij} = \int_\Omega p(\omega) \; \frac{\partial \mathrm{ln} p(\omega) }{\partial \lambda^i} \; \frac{\partial \mathrm{ln} p(\omega) }{\partial \lambda^j} \; d \omega

Here I’m assuming we’ve got a smooth function p from some manifold M to the space of probability distributions on some measure space (\Omega, d\omega). Working in local coordinates \lambda^i on our manifold M, the above formula defines a Riemannian metric on M, at least when g_{ij} is positive definite. And that’s the Fisher information metric!

Crooks says more: he describes an experiment that would let you measure the length of a path with respect to the Fisher information metric — at least in the case where the state \rho(x) is the Gibbs state with \langle X_i \rangle = x_i. And that explains why he calls it ‘thermodynamic length’.

There’s a lot more to say about this, and also about another question: What use is the Fisher information metric in the general case where the states \rho(x) aren’t Gibbs states?

But it’s dinnertime, so I’ll stop here.

[Note: in the original version of this post, I omitted the real part in my definition g_{ij} = \mathrm{Re} \langle (X_i - \langle X_i \rangle)  (X_j - \langle X_j \rangle)  \rangle , giving a ‘Riemannian metric’ that was neither real nor symmetric in the quantum case. Most of the comments below are based on that original version, not the new fixed one.]


Information Geometry (Part 1)

22 October, 2010

I’d like to provide a bit of background to this interesting paper:

• Gavin E. Crooks, Measuring thermodynamic length.

which was pointed out by John F in our discussion of entropy and uncertainty.

The idea here should work for either classical or quantum statistical mechanics. The paper describes the classical version, so just for a change of pace let me describe the quantum version.

First a lightning review of quantum statistical mechanics. Suppose you have a quantum system with some Hilbert space. When you know as much as possible about your system, then you describe it by a unit vector in this Hilbert space, and you say your system is in a pure state. Sometimes people just call a pure state a ‘state’. But that can be confusing, because in statistical mechanics you also need more general ‘mixed states’ where you don’t know as much as possible. A mixed state is described by a density matrix, meaning a positive operator \rho with trace equal to 1:

\mathrm{tr}(\rho) = 1

The idea is that any observable is described by a self-adjoint operator A, and the expected value of this observable in the mixed state \rho is

\langle A \rangle = \mathrm{tr}(\rho A)

The entropy of a mixed state is defined by

S(\rho) = -\mathrm{tr}(\rho \; \mathrm{ln} \, \rho)

where we take the logarithm of the density matrix just by taking the log of each of its eigenvalues, while keeping the same eigenvectors. This formula for entropy should remind you of the one that Gibbs and Shannon used — the one I explained a while back.

Back then I told you about the ‘Gibbs ensemble’: the mixed state that maximizes entropy subject to the constraint that some observable have a given value. We can do the same thing in quantum mechanics, and we can even do it for a bunch of observables at once. Suppose we have some observables X_1, \dots, X_n and we want to find the mixed state \rho that maximizes entropy subject to these constraints:

\langle X_i \rangle = x_i

for some numbers x_i. Then a little exercise in Lagrange multipliers shows that the answer is the Gibbs state:

\rho = \frac{1}{Z} \mathrm{exp}(-\lambda_1 X_1 + \cdots + \lambda_n X_n)

Huh?

This answer needs some explanation. First of all, the numbers \lambda_1, \dots \lambda_n are called Lagrange multipliers. You have to choose them right to get

\langle X_i \rangle = x_i

So, in favorable cases, they will be functions of the numbers x_i. And when you’re really lucky, you can solve for the numbers x_i in terms of the numbers \lambda_i. We call \lambda_i the conjugate variable of the observable X_i. For example, the conjugate variable of energy is inverse temperature!

Second of all, we take the exponential of a self-adjoint operator just as we took the logarithm of a density matrix: just take the exponential of each eigenvalue.

(At least this works when our self-adjoint operator has only eigenvalues in its spectrum, not any continuous spectrum. Otherwise we need to get serious and use the functional calculus. Luckily, if your system’s Hilbert space is finite-dimensional, you can ignore this parenthetical remark!)

But third: what’s that number Z? It begins life as a humble normalizing factor. Its job is to make sure \rho has trace equal to 1:

Z = \mathrm{tr}(\mathrm{exp}(-\lambda_1 X_1 + \cdots + \lambda_n X_n))

However, once you get going, it becomes incredibly important! It’s called the partition function of your system.

As an example of what it’s good for, it turns out you can compute the numbers x_i as follows:

x_i = - \frac{\partial}{\partial \lambda_i} \mathrm{ln} Z

In other words, you can compute the expected values of the observables X_i by differentiating the log of the partition function:

\langle X_i \rangle = - \frac{\partial}{\partial \lambda_i} \mathrm{ln} Z

Or in still other words,

\langle X_i \rangle = - \frac{1}{Z} \; \frac{\partial Z}{\partial \lambda_i}

To believe this you just have to take the equations I’ve given you so far and mess around — there’s really no substitute for doing it yourself. I’ve done it fifty times, and every time I feel smarter.

But we can go further: after the ‘expected value’ or ‘mean’ of an observable comes its variance, which is the square of its standard deviation:

(\Delta A)^2 = \langle A^2 \rangle - \langle A \rangle^2

This measures the size of fluctuations around the mean. And in the Gibbs state, we can compute the variance of the observable X_i as the second derivative of the log of the partition function:

\langle X_i^2 \rangle - \langle X_i \rangle^2 =  \frac{\partial^2}{\partial^2 \lambda_i} \mathrm{ln} Z

Again: calculate and see.

But when we’ve got lots of observables, there’s something better than the variance of each one. There’s the covariance matrix of the whole lot of them! Each observable X_i fluctuates around its mean value x_i… but these fluctuations are not independent! They’re correlated, and the covariance matrix says how.

All this is very visual, at least for me. If you imagine the fluctuations as forming a blurry patch near the point (x_1, \dots, x_n), this patch will be ellipsoidal in shape, at least when all our random fluctuations are Gaussian. And then the shape of this ellipsoid is precisely captured by the covariance matrix! In particular, the eigenvectors of the covariance matrix will point along the principal axes of this ellipsoid, and the eigenvalues will say how stretched out the ellipsoid is in each direction!

To understand the covariance matrix, it may help to start by rewriting the variance of a single observable A as

(\Delta A)^2 = \langle (A - \langle A \rangle)^2 \rangle

That’s a lot of angle brackets, but the meaning should be clear. First we look at the difference between our observable and its mean value, namely

A - \langle A \rangle

Then we square this, to get something that’s big and positive whenever our observable is far from its mean. Then we take the mean value of the that, to get an idea of how far our observable is from the mean on average.

We can use the same trick to define the covariance of a bunch of observables X_i. We get an n \times n matrix called the covariance matrix, whose entry in the ith row and jth column is

\langle (X_i - \langle X_i \rangle)  (X_j - \langle X_j \rangle)  \rangle

If you think about it, you can see that this will measure correlations in the fluctuations of your observables.

An interesting difference between classical and quantum mechanics shows up here. In classical mechanics the covariance matrix is always symmetric — but not in quantum mechanics! You see, in classical mechanics, whenever we have two observables A and B, we have

\langle A B \rangle = \langle B A \rangle

since observables commute. But in quantum mechanics this is not true! For example, consider the position q and momentum p of a particle. We have

q p = p q + i

so taking expectation values we get

\langle q p \rangle = \langle p q \rangle + i

So, it’s easy to get a non-symmetric covariance matrix when our observables X_i don’t commute. However, the real part of the covariance matrix is symmetric, even in quantum mechanics. So let’s define

g_{ij} =  \mathrm{Re}  \langle (X_i - \langle X_i \rangle)  (X_j - \langle X_j \rangle)  \rangle

You can check that the matrix entries here are the second derivatives of the partition function:

g_{ij} = \frac{\partial^2}{\partial \lambda_i \partial \lambda_j} \mathrm{ln} Z

And now for the cool part: this is where information geometry comes in! Suppose that for any choice of values x_i we have a Gibbs state with

\langle X_i \rangle = x_i

Then for each point

x = (x_1, \dots , x_n) \in \mathbb{R}^n

we have a matrix

g_{ij} =  \mathrm{Re}  \langle (X_i - \langle X_i \rangle)  (X_j - \langle X_j \rangle)  \rangle = \frac{\partial^2}{\partial \lambda_i \partial \lambda_j} \mathrm{ln} Z

And this matrix is not only symmetric, it’s also positive. And when it’s positive definite we can think of it as an inner product on the tangent space of the point x. In other words, we get a Riemannian metric on \mathbb{R}^n. This is called the Fisher information metric.

I hope you can see through the jargon to the simple idea. We’ve got a space. Each point x in this space describes the maximum-entropy state of a quantum system for which our observables have specified mean values. But in each of these states, the observables are random variables. They don’t just sit at their mean value, they fluctuate! You can picture these fluctuations as forming a little smeared-out blob in our space. To a first approximation, this blob is an ellipsoid. And if we think of this ellipsoid as a ‘unit ball’, it gives us a standard for measuring the length of any little vector sticking out of our point. In other words, we’ve got a Riemannian metric: the Fisher information metric!

Now if you look at the Wikipedia article you’ll see a more general but to me somewhat scarier definition of the Fisher information metric. This applies whenever we’ve got a manifold whose points label arbitrary mixed states of a system. But Crooks shows this definition reduces to his — the one I just described — when our manifold is \mathbb{R}^n and it’s parametrizing Gibbs states in the way we’ve just seen.

More precisely: both Crooks and the Wikipedia article describe the classical story, but it parallels the quantum story I’ve been telling… and I think the quantum version is well-known. I believe the quantum version of the Fisher information metric is sometimes called the Bures metric, though I’m a bit confused about what the Bures metric actually is.

[Note: in the original version of this post, I omitted the real part in my definition g_{ij} = \mathrm{Re} \langle (X_i - \langle X_i \rangle)  (X_j - \langle X_j \rangle)  \rangle , giving a ‘Riemannian metric’ that was neither real nor symmetric in the quantum case. Most of the comments below are based on that original version, not the new fixed one.]


Follow

Get every new post delivered to your Inbox.

Join 3,095 other followers