## Information Geometry (Part 3)

So far in this series of posts I’ve been explaining a paper by Gavin Crooks. Now I want to go ahead and explain a little research of my own.

I’m not claiming my results are new — indeed I have no idea whether they are, and I’d like to hear from any experts who might know. I’m just claiming that this is some work I did last weekend.

People sometimes worry that if they explain their ideas before publishing them, someone will ‘steal’ them. But I think this overestimates the value of ideas, at least in esoteric fields like mathematical physics. The problem is not people stealing your ideas: the hard part is giving them away. And let’s face it, people in love with math and physics will do research unless you actively stop them. I’m reminded of this scene from the Marx Brothers movie where Harpo and Chico, playing wandering musicians, walk into a hotel and offer to play:

Groucho: What do you fellows get an hour?

Chico: Oh, for playing we getta ten dollars an hour.

Groucho: I see…What do you get for not playing?

Chico: Twelve dollars an hour.

Groucho: Well, clip me off a piece of that.

Chico: Now, for rehearsing we make special rate. Thatsa fifteen dollars an hour.

Groucho: That’s for rehearsing?

Chico: Thatsa for rehearsing.

Groucho: And what do you get for not rehearsing?

Chico: You couldn’t afford it.

So, I’m just rehearsing in public here — but I of course I hope to write a paper about this stuff someday, once I get enough material.

Remember where we were. We had considered a manifold — let’s finally give it a name, say $M$ — that parametrizes Gibbs states of some physical system. By Gibbs state, I mean a state that maximizes entropy subject to constraints on the expected values of some observables. And we had seen that in favorable cases, we get a Riemannian metric on $M$! It looks like this:

$g_{ij} = \mathrm{Re} \langle (X_i - \langle X_i \rangle) (X_j - \langle X_j \rangle) \rangle$

where $X_i$ are our observables, and the angle bracket means ‘expected value’.

All this applies to both classical or quantum mechanics. Crooks wrote down a beautiful formula for this metric in the classical case. But since I’m at the Centre for Quantum Technologies, not the Centre for Classical Technologies, I redid his calculation in the quantum case. The big difference is that in quantum mechanics, observables don’t commute! But in the calculations I did, that didn’t seem to matter much — mainly because I took a lot of traces, which imposes a kind of commutativity:

$\mathrm{tr}(AB) = \mathrm{tr}(BA)$

In fact, if I’d wanted to show off, I could have done the classical and quantum cases simultaneously by replacing all operators by elements of any von Neumann algebra equipped with a trace. Don’t worry about this much: it’s just a general formalism for treating classical and quantum mechanics on an equal footing. One example is the algebra of bounded operators on a Hilbert space, with the usual concept of trace. Then we’re doing quantum mechanics as usual. But another example is the algebra of suitably nice functions on a suitably nice space, where taking the trace of a function means integrating it. And then we’re doing classical mechanics!

For example, I showed you how to derive a beautiful formula for the metric I wrote down a minute ago:

$g_{ij} = \mathrm{Re} \, \mathrm{tr}(\rho \; \frac{\partial \mathrm{ln} \rho }{\partial \lambda^i} \; \frac{\partial \mathrm{ln} \rho }{\partial \lambda^j} )$

But if we want to do the classical version, we can say Hey, presto! and write it down like this:

$g_{ij} = \int_\Omega p(\omega) \; \frac{\partial \mathrm{ln} p(\omega) }{\partial \lambda^i} \; \frac{\partial \mathrm{ln} p(\omega) }{\partial \lambda^j} \; d \omega$

What did I do just now? I changed the trace to an integral over some space $\Omega$. I rewrote $\rho$ as $p$ to make you think ‘probability distribution’. And I don’t need to take the real part anymore, since is everything already real when we’re doing classical mechanics. Now this metric is the Fisher information metric that statisticians know and love!

In what follows, I’ll keep talking about the quantum case, but in the back of my mind I’ll be using von Neumann algebras, so everything will apply to the classical case too.

So what am I going to do? I’m going to fix a big problem with the story I’ve told so far.

Here’s the problem: so far we’ve only studied a special case of the Fisher information metric. We’ve been assuming our states are Gibbs states, parametrized by the expectation values of some observables $X_1, \dots, X_n$. Our manifold $M$ was really just some open subset of $\mathbb{R}^n$: a point in here was a list of expectation values.

But people like to work a lot more generally. We could look at any smooth function $\rho$ from a smooth manifold $M$ to the set of density matrices for some quantum system. We can still write down the metric

$g_{ij} = \mathrm{Re} \, \mathrm{tr}(\rho \; \frac{\partial \mathrm{ln} \rho }{\partial \lambda^i} \; \frac{\partial \mathrm{ln} \rho }{\partial \lambda^j} )$

in this more general situation. Nobody can stop us! But it would be better if we could derive this formula, as before, starting from a formula like the one we had before:

$g_{ij} = \mathrm{Re} \langle \, (X_i - \langle X_i \rangle) \, (X_j - \langle X_j \rangle) \, \rangle$

The challenge is that now we don’t have observables $X_i$ to start with. All we have is a smooth function $\rho$ from some manifold to some set of states. How can we pull observables out of thin air?

Well, you may remember that last time we had

$\rho = \frac{1}{Z} e^{-\lambda^i X_i}$

where $\lambda^i$ were some functions on our manifold and

$Z = \mathrm{tr}(e^{-\lambda^i X_i})$

was the partition function. Let’s copy this idea.

So, we’ll start with our density matrix $\rho$, but then write it as

$\rho = \frac{1}{Z} e^{-A}$

where $A$ is some self-adjoint operator and

$Z = \mathrm{tr} (e^{-A})$

(Note that $A$, like $\rho$, is really an operator-valued function on $M$. So, I should write something like $A(x)$ to denote its value at a particular point $x \in M$, but I won’t usually do that. As usual, I expect some intelligence on your part!)

Now we can repeat some calculations I did last time. As before, let’s take the logarithm of $\rho$:

$\mathrm{ln} \, \rho = -A - \mathrm{ln}\, Z$

and then differentiate it. Suppose $\lambda^i$ are local coordinates near some point of $M$. Then

$\frac{\partial}{\partial \lambda^i} \mathrm{ln}\, \rho = - \frac{\partial}{\partial \lambda^i} A - \frac{1}{Z} \frac{\partial }{\partial \lambda^i} Z$

Last time we had nice formulas for both terms on the right-hand side above. To get similar formulas now, let’s define operators

$X_i = \frac{\partial}{\partial \lambda^i} A$

This gives a nice name to the first term on the right-hand side above. What about the second term? We can calculate it out:

$\frac{1}{Z} \frac{\partial }{\partial \lambda^i} Z = \frac{1}{Z} \frac{\partial }{\partial \lambda^i} \mathrm{tr}(e^{-A}) = \frac{1}{Z} \mathrm{tr}(\frac{\partial }{\partial \lambda^i} e^{-A}) = - \frac{1}{Z} \mathrm{tr}(e^{-A} \frac{\partial}{\partial \lambda^i} A)$

where in the last step we use the chain rule. Next, use the definition of $\rho$ and $X_i$, and get:

$\frac{1}{Z} \frac{\partial }{\partial \lambda^i} Z = - \mathrm{tr}(\rho X_i) = - \langle X_i \rangle$

This is just what we got last time! Ain’t it fun to calculate when it all works out so nicely?

So, putting both terms together, we see

$\frac{\partial}{\partial \lambda^i} \mathrm{ln} \rho = - X_i + \langle X_i \rangle$

or better:

$X_i - \langle X_i \rangle = -\frac{\partial}{\partial \lambda^i} \mathrm{ln} \rho$

This is a nice formula for the ‘fluctuation’ of the observables $X_i$, meaning how much they differ from their expected values. And it looks exactly like the formula we had last time! The difference is that last time we started out assuming we had a bunch of observables, $X_i$, and defined $\rho$ to be the state maximizing the entropy subject to constraints on the expectation values of all these observables.
Now we’re starting with $\rho$ and working backwards.

From here on out, it’s easy. As before, we can define $g_{ij}$ to be the real part of the covariance matrix:

$g_{ij} = \mathrm{Re} \langle (X_i - \langle X_i \rangle) (X_j - \langle X_j \rangle) \rangle$

Using the formula

$X_i - \langle X_i \rangle = -\frac{\partial}{\partial \lambda^i} \mathrm{ln} \rho$

we get

$g_{ij} = \mathrm{Re} \langle \frac{\partial \mathrm{ln} \rho}{\partial \lambda^i} \; \frac{\partial \mathrm{ln} \rho}{\partial \lambda^j} \rangle$

or

$g_{ij} = \mathrm{Re}\,\mathrm{tr}(\rho \; \frac{\partial \mathrm{ln} \rho}{\partial \lambda^i} \; \frac{\partial \mathrm{ln} \rho}{\partial \lambda^j})$

Voilà!

When this matrix is positive definite at every point, we get a Riemanian metric on $M$. Last time I said this is what people call the ‘Bures metric’ — though frankly, now that I examine the formulas, I’m not so sure. But in the classical case, it’s called the Fisher information metric.

Differential geometers like to use $\partial_i$ as a shorthand for $\frac{\partial}{\partial_i}$, so they’d write down our metric in a prettier way:

$g_{ij} = \mathrm{Re} \, \mathrm{tr}(\rho \; \partial_i (\mathrm{ln} \, \rho) \; \partial_j (\mathrm{ln} \, \rho) )$

Differential geometers like coordinate-free formulas, so let’s also give a coordinate-free formula for our metric. Suppose $x \in M$ is a point in our manifold, and suppose $v,w$ are tangent vectors to this point. Then

$g(v,w) = \mathrm{Re} \, \langle v(\mathrm{ln}\, \rho) \; w(\mathrm{ln} \,\rho) \rangle \; = \; \mathrm{Re} \,\mathrm{tr}(\rho \; v(\mathrm{ln}\, \rho) \; w(\mathrm{ln}\, \rho))$

Here $\mathrm{ln}\, \rho$ is a smooth operator-valued function on $M$, and $v(\mathrm{ln}\, \rho)$ means the derivative of this function in the $v$ direction at the point $x$.

So, this is all very nice. To conclude, two more points: a technical one, and a more important philosophical one.

First, the technical point. When I said $\rho$ could be any smooth function from a smooth manifold to some set of states, I was actually lying. That’s an important pedagogical technique: the brazen lie.

We can’t really take the logarithm of every density matrix. Remember, we take the log of a density matrix by taking the log of all its eigenvalues. These eigenvalues are ≥ 0, but if one of them is zero, we’re in trouble! The logarithm of zero is undefined.

On the other hand, there’s no problem taking the logarithm of our density-matrix-valued function $\rho$ when it’s positive definite at each point of $M$. You see, a density matrix is positive definite iff its eigenvalues are all > 0. In this case it has a unique self-adjoint logarithm.

So, we must assume $\rho$ is positive definite. But what’s the physical significance of this ‘positive definiteness’ condition? Well, any density matrix can be diagonalized using some orthonormal basis. It can then be seen as probabilistic mixture — not a quantum superposition! — of pure states taken from this basis. Its eigenvalues are the probabilities of finding the mixed state to be in one of these pure states. So, saying that all its eigenvalues are all > 0 amounts to saying that all the pure states in this orthonormal basis show up with nonzero probability! Intuitively, this means our mixed state is ‘really mixed’. For example, it can’t be a pure state. In math jargon, it means our mixed state is in the interior of the convex set of mixed states.

Second, the philosophical point. Instead of starting with the density matrix $\rho$, I took $A$ as fundamental. But different choices of $A$ give the same $\rho$. After all,

$\rho = \frac{1}{Z} e^{-A}$

where we cleverly divide by the normalization factor

$Z = \mathrm{tr} (e^{-A})$

to get $\mathrm{tr} \, \rho = 1$. So, if we multiply $e^{-A}$ by any positive constant, or indeed any positive function on our manifold $M$, $\rho$ will remain unchanged!

So we have added a little extra information when switching from $\rho$ to $A$. You can think of this as ‘gauge freedom’, because I’m saying we can do any transformation like

$A \mapsto A + f$

where

$f: M \to \mathbb{R}$

is a smooth function. This doesn’t change $\rho$, so arguably it doesn’t change the ‘physics’ of what I’m doing. It does change $Z$. It also changes the observables

$X_i = \frac{\partial}{\partial \lambda^i} A$

But it doesn’t change their ‘fluctuations’

$X_i - \langle X_i \rangle$

so it doesn’t change the metric $g_{ij}$.

This gauge freedom is interesting, and I want to understand it better. It’s related to something very simple yet mysterious. In statistical mechanics the partition function $Z$ begins life as ‘just a normalizing factor’. If you change the physics so that $Z$ gets multiplied by some number, the Gibbs state doesn’t change. But then the partition function takes on an incredibly significant role as something whose logarithm you differentiate to get lots of physically interesting information! So in some sense the partition function doesn’t matter much… but changes in the partition function matter a lot.

This is just like the split personality of phases in quantum mechanics. On the one hand they ‘don’t matter’: you can multiply a unit vector by any phase and the pure state it defines doesn’t change. But on the other hand, changes in phase matter a lot.

Indeed the analogy here is quite deep: it’s the analogy between probabilities in statistical mechanics and amplitudes in quantum mechanics, the analogy between $\mathrm{exp}(-\beta H)$ in statistical mechanics and $\mathrm{exp}(-i t H / \hbar)$ in quantum mechanics, and so on. This is part of a bigger story about ‘rigs’ which I told back in the Winter 2007 quantum gravity seminar, especially in week13. So, it’s fun to see it showing up yet again… even though I don’t completely understand it here.

[Note: in the original version of this post, I omitted the real part in my definition $g_{ij} = \mathrm{Re} \langle (X_i - \langle X_i \rangle) (X_j - \langle X_j \rangle) \rangle$, giving a ‘Riemannian metric’ that was neither real nor symmetric in the quantum case. Most of the comments below are based on that original version, not the new fixed one.]

### 58 Responses to Information Geometry (Part 3)

1. phorgyphynance says:

I’m really enjoying this because of the obvious applications to finance.

Now, if only you could wave the wand of category theory and make this all pop out by magic :)

Can we dualize this and get something like information theoretic 1-forms? I’d like to see something like

$g^{i j} = g(d x^i,d x^j).$

This is what I was talking about earlier. A stochastic process has a Brownian motion term and a deterministic term

$d x = \sigma dW + \mu d t.$

This is probably related somehow.

• John Baez says:

There certainly will be metric $g^{ij}$, at least when $g_{ij}$ is nondegenerate as I’m assuming. But what does it mean?

I’ve been doing a lot of local coordinate calculations using observables

$X_i = \partial_i (\mathrm{ln} \, \rho)$

But note that in a global approach, we get observable-valued functions by differentiating the operator-valued function $\mathrm{ln} \, \rho$ along any vector field $v$. Say:

$X(v) = v (\mathrm{ln} \, \rho)$

Then our metric is

$g(v, w) = \langle X(v) X(w) \rangle$

So: we get observables from vector fields… what do we get from 1-forms? Not sure what the best answer is.

By the way, you posted your comment before I was done writing my article! I accidentally hit ‘Post’ — it’s annoyingly easy to do. You might want to reread my post now that it’s done, especially to see the Marx Brothers reference, but also maybe a bit of extra physics.

• phorgyphynance says:

Wait a second. Which should REALLY be the variance? $g_{ii} = \sigma^2$ or $g^{ii} = \sigma^2$? I think it should be $g^{ii}$ in order to relate to stochastic processes.

• John Baez says:

I’m getting observables from tangent vectors, and the metric from the covariance matrix on observables, so

$g_{ij} = g(\partial_i, \partial_j)$

is the covariance of the i th observable and the j th one, in local coordinates.

• phorgyphynance says:

Sorry for being dense, but it’s not obvious to me you have tangent vectors $g(\partial_i,\partial_j)$ projecting out elements of the covariance matrix. It almost looks like you’ve got components of a 1-form

$d(\ln \rho) = \partial_i(\ln\rho) d\lambda^i = (X_i-\langle X_i \rangle) d\lambda^i.$

Then

$d \ln\rho(\partial_j) = \partial_j(\ln \rho) = X_j - \langle X_j\rangle.$

Maybe there is some Ito trick laying in here somewhere so that you get something like

$\langle d\lambda^i d\lambda^j\rangle = g^{ij} d t$

and

\begin{aligned} \langle d(\ln \rho) d(\ln \rho)\rangle &= \langle (X_i-\langle X_i \rangle)(X_j-\langle X_j \rangle) d\lambda^i \bullet d\lambda^j \rangle \\ &= \langle (X_i-\langle X_i \rangle)(X_j-\langle X_j \rangle)\rangle g^{ij} d t \\ &= g_{ij} g^{ij} dt \\ &= dt.\end{aligned}

Ok! A little latex in the morning is good for the soul.

Now that I write that out, some things have become clearer. I hope the above sparks some thoughts from those who know more than me about this stuff.

Anyway, I have found peace with the idea that

$\langle (X_i-\langle X_i \rangle)(X_j-\langle X_j \rangle) \rangle = g_{ij}.$

Progress! But in the process, I have also come to peace with $g^{ij}$. This is my current thought

$\langle (\lambda^i-\langle \lambda^i \rangle)(\lambda^j-\langle \lambda^j \rangle) \rangle = g^{ij}.$

These should (probably) be related by

$d\lambda^i = g^{ij} dX_j$ and $dX_j = g_{ij} d\lambda^i.$

And there is peace in the world (of my head) again.

• phorgyphynance says:

$\langle (\lambda^i - \langle \lambda^i\rangle)(\lambda^j - \langle \lambda^j\rangle) \rangle = g^{ij}$

but the Ito thing (if there is anything to it) is kind of cute and leads directly to

$\langle dX_i dX_j \rangle = g_{ij} dt.$

• John Baez says:

Phorgyphynance writes:

$\langle (\lambda^i - \langle \lambda^i\rangle)(\lambda^j - \langle \lambda^j\rangle) \rangle = g^{ij}$

Good, you shouldn’t be, because it doesn’t make sense!

First of all, $\lambda^i$ are just names for the local coordinates on our manifold $M$. They’re not observables, so it doesn’t make much sense to take their expectation values.

True, you can think of a number as an observable that always takes the same value: that is, an observable with zero variance. So a coordinate function can be seen as a special case of an observable-valued function… but one with the special property that

$\lambda^i = \langle \lambda^i \rangle$

So, by the only reasonable interpretation of the left-hand side of your equation, we get

$\langle (\lambda^i - \langle \lambda^i\rangle)(\lambda^j - \langle \lambda^j\rangle) \rangle = 0$

Certainly nothing to do with the metric!

The following formulas you wrote also do not make sense:

$d\lambda^i = g^{ij} dX_j$

$dX_j = g_{ij} d\lambda^i$

$\langle dX_i dX_j \rangle = g_{ij} dt$

Sorry to be rude, but I get the feeling that you’re making wild guesses in a rush, instead of waiting until everything is crystal clear. There’s not even anything called $dt$ in my formalism! It has nothing to do with the Ito calculus, at least not yet.

• phorgyphynance says:

Ouch.

You do have stochastic processes whether you choose to recognize them yet or not. Once you recognize the stochastic process, you must recognize $dt$.

Perhaps the notation needs some work, e.g. maybe we should write

$d\Lambda^i = g^{ij} dX_j$

with $\langle \Lambda^i\rangle = \lambda^i$ or something, but I’m confident the basic ideas I’m laying out can be made solid with some effort. And I’m also confident they are relevant to what you are doing, but if you don’t want me thinking out loud on your blog, that is understandable. I didn’t see the harm.

• John Baez says:

Sorry, I guess I have a low tolerance for equations that don’t parse: by avoiding these like the plague, I avoid certain kinds of mistakes. They make me grumpy. I have trained myself to be like that.

I wouldn’t be surprised if there’s a connection between this stuff and stochastic processes, though.

• phorgyphynance says:

And just for the record, when I say I’m thinking out loud, I don’t mean to imply that I haven’t put serious thought into this before. The first time I became aware of the practical implications of the fact that the covariance matrix was like a metric tensor on a manifold was in the first half of 2005 when I was on Wall Street and I’ve given several presentations on the subject since then. Believe me, people are very familiar with these connections you’re reinventing.

The thing I haven’t personally thought of before is the relation to partition functions. That is very cool. The mathematical connection between what I’m talking about and what you’re doing is obvious to me, but instead of working out the details in the comments here, I’ll try to sort things out and let them bake a while on my own blog as well as possibly on my personal web space on the nLab.

And I’ll try to forget the comment about parsing of equations :P You will eventually see that everything I’ve written will parse (typos and minor notational adjustments aside) and is in fact fairly standard material.

• John Baez says:

I’m sorry to have hurt your feelings. I hope you realize what I’m doing. I’m trying to understand the subject called information geometry. A lot is known about it; I doubt I’m doing anything really new yet, but I feel the need to explain it and redevelop it in a slightly different way to understand it. I’m not trying to make connections to stochastic calculus, though they probably exist and are probably very interesting. I already have my hands full just trying to understand some basic concepts like the Fisher information metric.

I felt the need to point out that some equations you wrote made no sense given the definitions I had laid down. I did so in a rather rude way, because they actually made my brain hurt. But if you change something or other, you may wind up stating some interesting and/or well-known facts.

• John Baez says:

Eric writes:

Sorry for being dense, but it’s not obvious to me you have tangent vectors $g(\partial_i,\partial_j)$ projecting out elements of the covariance matrix.

I’m not sure what you mean here: $g(\partial_i,\partial_j) = g_{ij}$ is not a tangent vector, but $\partial_i$ and $\partial_j$ are. Maybe that’s what you were trying to say.

Anyway, starting from the definition of the metric in terms of a covariance matrix

$g_{ij} = \langle (X_i-\langle X_i \rangle) (X_j-\langle X_j \rangle)$

I showed that

$g_{ij} = \mathrm{tr}(\rho \; \partial_i (\ln\, \rho) \; \partial_j (\ln\, \rho))$

and you can read the proof above.

It almost looks like you’ve got components of a 1-form

$d(\ln \, \rho) = \partial_i(\ln \,\rho) d\lambda^i = (X_i-\langle X_i \rangle) d\lambda^i.$

Sure, and that’s a nice way of looking at it. Then, by the usual yoga of 1-forms and tangent vectors, this implies

$\partial_i (\ln \,\rho) = X_i - \langle X_i \rangle .$

But beware: $d(\ln \, \rho)$ is an observable-valued 1-form. I.e., it’s an operator-valued 1-form in the case of quantum statistical mechanics, and a 1-form taking values in functions on the phase space $\Omega$ in the case of classical statistical mechanics.

(If you prefer the language of probability theory to that of classical probability theory, say “random variable” instead of “function on phase space”, and emphasize that $\Omega$ is a measure space. It’s just different terminology for the same thing, at least for what I’m doing here.)

• Tim van Beek says:

Eric wrote:

This is what I was talking about earlier. A stochastic process has a Brownian motion term and a deterministic term

$d x = \sigma dW + \mu d t$

This is probably related somehow.

There is a source of confusion I’d like to get out of the way, but we’ll need some definitions first:

Let’s say we talk about one dimensional stochastic processes in a continuous variable t. Let $W_t$ be Brownian motion, that is the only stochastic process with stationary independent increments with mean 0, starting at 0 ($P\{W_0 = 0\} = 1$), and concentrated on continuous paths.

The problem is that one cannot make sense of $dW_t$ as a differential form, AFAIK. It makes sense only as a short hand notation of the integral equation

$X(T) = \int_0^T \mu dt + \int_0^T \sigma dW_t$

where we have yet to choose an appropriate integral definition, let’s choose the Itô integral for now. (Note that not all stochastic processes have such a representation, the martingale representation theorem tells us that exactly all adapted martingales have one).

The problem is that the paths of Brownian motion are a.s. not of bounded variation, therefore the integral may not exist pathwise as a Riemann-Stieltjes integral.

What we can do however is considering the stochastic processes that are solutions of a family of Itô stochastic integrals described by a finite set of parameters, e.g.

$X(T) = \int_0^T \mu(x) dt + \int_0^T \sigma dW_t$

where $\mu(x)$ is a real polynomial of a degree of at most 4, and $\sigma$ is any real number. Every such stochastic process defines a probability distribution on, say, $C[0, T_0]$, the continuous functions of the interval $[0, T_0]$. These probability distributions form a stochastic manifold where one could, in principle, calculate the Fisher metric (ugh, I don’t think that I can do that).

• Eric says:

The problem is that one cannot make sense of $dW_t$ as a differential form, AFAIK.

You’re absolutely correct when viewing things from the traditional perspective. However, there IS a way to view the stochastic process as a 1-form, but you need to consider it in the context of noncommutative geometry.

I’m pretty sure I can claim to be the first person to ever apply noncommutative geometry to finance :)

Have a look at this paper I wrote back in 2002:

Noncommutative Geometry and Stochastic Calculus: Applications in Mathematical Finance

There, we find that the stochastic processes are indeed 1-forms and the Ito formula follows from noncommutativity of 0-forms and 1-forms, i.e.

$[d x,x] = d t.$

This is reminiscent of the common heuristic used when defining Ito formula in elementary math finance texts

$d x\bullet d x = d t.$

Then I let the idea rest for 2 years while I was at MIT Lincoln Lab, but came back to it in 2004 (just prior to moving to Wall Street) with a finite version suitable for numerical modeling:

Financial Modelling Using Discrete Stochastic Calculus

I sometimes feel a little apologetic for bringing up math finance here, but I hope it is clear how these techniques (could possibly) apply more generally including this information geometry stuff.

• Tim van Beek says:

I see: Maybe I should be more careful when I talk about “differential forms”.

In classical differential geometry, given a real, smooth maifold $M$, the differential $df$ of a real smooth function $f$ lives in the cotangential bundle of $M$.

We can integrate $df$ along a path and get a real number.

Transforming this to Connes’ quantized calculus, $f$ becomes a selfadjoint operator, $df$ becomes the commutator of $f$ with $F$, where $F$ is a fixed selfadjoint operator of square one. The integral becomes the trace. We can still integrate a differential of order one and get a real number.

Now I don’t see a way to fit “\$latex dW_t” into Connes’ quantized calculus :-)

It is not a function, an operator, or an differential of order one and we cannot integrate it to get a real number (or a complex one) :-)

Of course you are free to define it to be a basis vector of some abstract vector space and introduce additional algebraic structures that mimic Itô calculus.

• phorgyphynance says:

Hi Tim,

Now I don’t see a way to fit “$dW_t$” into Connes’ quantized calculus :-)

If you have a look at the paper I wrote with Urs

Discrete differential geometry on causal graphs

you’ll find that there is a particular class of spaces we call “diamonds” for which $d f$ is the commutator of $f$ with the “graph operator” $G$.

There is a particularly nice “diamond” we examine in Section 2.9 (which I expanded in the “Discrete Stochastic Calculus” paper), i.e. the binary tree or 2-diamond. The continuum stochastic calculus is the continuum limit of this discrete calculus on a binary tree. This is the bridge you’re looking for.

• phorgyphynance says:

One more note:

Of course you are free to define it to be a basis vector of some abstract vector space and introduce additional algebraic structures that mimic Itô calculus.

This is not what happens though. For a given directed graph, there is a corresponding calculus. There is no choice in the matter at all.

The calculus that corresponds automatically to the binary tree is stochastic calculus. There is no arbitrariness.

The story how we came to this is kind of funny. Urs and I were having fun playing around with this stuff and we asked the inverse question. If I give you a graph, you can determine the corresponding calculus. What if I hand you a calculus, can you reverse engineer things and give me a graph that corresponds to this calculus? Just as Urs left for a bike ride across Europe we asked what graph corresponds to stochastic calculus. When he returned, we had both arrived at the answer: the binary tree is the graph that corresponds to stochastic calculus. It is obvious in retrospect.

• Tim van Beek says:

Eric pointed out this paper:

If you have a look at the paper I wrote with Urs

Discrete differential geometry on causal graphs

Thanks! I hope I’ll have time to read it next weekend (the Monday is a holiday), I’ll report back then :-)

• phorgyphynance says:

Hi Tim,

I’ve written a pretty simple explanation of this on my blog at

Discrete stochastic calculus and commutators

If you’re interested, please feel free to discuss it over there. I love discussing this stuff so feel free to lob over any questions and I’m happy to do my best to answer them.

2. Tim van Beek says:

I’m getting dizzy from all these interconnections.

I looked for additional reading material: Curiously I did not find an unified treatment using von Neumann algebras, although it seems to be pretty simple and elegant (then again everything looks simple and elegant if explained by John). Instead there is an introduction of quantum information theory using finite Hilbert spaces in

Amari, Shun-ichi; Nagaoka, Hiroshi: Methods of information geometry (available on google books, too).

And to make the story even more interconnected, people use information geometry to study black hole thermodynamics, see e.g. Jan E. Aman, Narit Pidokrajt: Geometry of Higher-Dimensional Black Hole Thermodynamics.

• John Baez says:

Thanks for the references! I wanted to invent some of my own stuff before getting brainwashed by the existing literature, but I’ll look at these and see what they say.

I’m getting dizzy from all these interconnections.

Well, when it comes to this sort of thing, if it doesn’t make you dizzy it must not be done yet.

3. John F says:

Is there sufficient gauge freedom to fix problems in A and/or states, or are the problems where the physics is? Can a dynamics of the gauge enable modelling wavefunction collapse?

• John Baez says:

John F wrote:

Is there sufficient gauge freedom to fix problems in A and/or states, or are the problems where the physics is?

I don’t understand what ‘problems’ you’re referring to, but right now I’m happy that I can use the gauge freedom in A to choose A so that

$\mathrm{tr}(e^{-A}) = 1$

that is, $Z = 1$. This implies

$\rho = e^{-A}$

and it also implies that our observables

$X_i = \partial_i A$

have vanishing expectation values:

$\langle X_i \rangle = - \frac{1}{Z} \partial_i Z = 0$

This makes them boring if you like nonzero expectation values, but their covariances are still plenty interesting: they are the metric $g_{ij}$.

This is the simplest gauge choice.

Can a dynamics of the gauge enable modelling wavefunction collapse?

Sorry, I can’t make sense of that.

• John F says:

By problems I mostly meant zeroes of the weights. I guess this may not require anomalies or singularities in A, but maybe at least caustics. FWIW Berry (again) did nice work in multispectral holograms, e.g.
http://www.pnas.org/content/93/6/2614.full.pdf

Sometimes it seems like everything is gauges, phases, conformal terms, etc; sometimes nothing.

• John Baez says:

Okay — yeah, I spent an hour yesterday trying to figure out what to do about the metric

$g_{ij} = \mathrm{tr}(\rho \; \partial_i(\mathrm{ln}\, \rho) \; \partial_j(\mathrm{ln}\, \rho))$

when the density matrix $\rho$ has some zero eigenvalues. This question is important because we can think of pure states as mixed states with a lot of zero eigenvalues! I don’t think changing the gauge on $A$ helps at all, since the above formula for the metric is explicitly gauge-independent.

Right now I’d guess the metric $g_{ij}$ becomes singular at pure states, so we need a different metric if we want to include pure states in our story. And right now my most promising lead comes from Uhlmann’s papers. He seems to be studying metrics on the space of density matrices in a very thorough, general way, synthesizing and extending other people’s work in a systematic framework.

• John F says:

Ok, Uhlmann’s 1995 paper is helping me understand his 1993 paper, after also reading his ’85 and ’86 (etc.) For some reason this reminds me of an old joke I’ve been wanting to repeat recently but haven’t found a venue for, even if it doesn’t quite fit. A preacher at my church was clowning and mentioned he was sure we’d agree that each of his sermons was better than the next.

Anyway two questions. 1) Do you agree with (Ulmann 1995) “in deviating from the pure to the mixed states … coherence and correlations will not be destroyed suddenly but gradually, continuously.”? 2) He uses lots of positive square roots (one specifically in Equation 37). Do they *have* to be positive?

• phorgyphynance says:

Here is a thought…

You have discovered a gauge freedom. This gauge freedom simply shifts the mean. I am pretty confident this is a manifestation of the Girsanov theorem.

• phorgyphynance says:

Ack! I posted the Wikipedia link before reading it assuming the page was decent, but the page actually sucks.

Given a stochastic process

$dX = \sigma dW + \mu dt,$

Girsanov’s theorem says that changing the measure modifies the above to

$dX = \sigma d\tilde W + \tilde\mu dt,$

i.e. the covariance structure is unchanged, but the change of measure changes the mean $\mu$.

I strongly suspect your gauge freedom represents a change of measure.

• phorgyphynance says:

PS: The trick you used to gauge transform the mean to zero is the same trick used in finance to convert a stochastic process into a martingale. So it seems that renormalizing the partition function via the gauge freedom turns your observable into a martingale.

• John Baez says:

My gauge freedom does indeed shift the mean; doing the gauge transform

$A \mapsto A + f$

changes the observable-valued function $X_i = \partial_i A$ as follows:

$X_i \mapsto X_i + \partial_i f$

At each point $x \in M$, this just takes the observable $X_i(x)$ and adds to it the number $\partial_i f(x)$. So, its mean gets shifted by $\partial_i f(x)$, but ‘nothing else changes’.

However, there is no time variable in my formalism, so all remarks about stochastic processes, martingales, Ito’s formula, etc. are irrelevant here — or, optimistically speaking, ‘premature’.

Maybe you can introduce a time-dependence into my formalism and make those remarks meaningful somehow. For example, maybe you can take

$M = \mathbb{R}$

and call the coordinate on this manifold $t$ for ‘time’. Then what I’ve been calling ‘observable-valued functions’ can be renamed ‘time-dependent random variables’, at least in the classical case.

4. Jamie says:

Something inside me aches when I see beautiful mathematical physics laid out in front of me, but know I just don’t have the time to learn about it. And I would really like to, because thermodynamics was the one part of undergraduate physics that never really sat well with me.

Wow, sometimes I miss being a PhD student — no crummy grant applications to deal with, no pressure of collaborators waiting for me to write things. Three years ago, I would have been all over this.

• John Baez says:

Hi, Jamie! That’s sad, I’ve almost always felt I’ve had plenty of time to learn new stuff. The only really bad stretch was a couple years ago when I had way too many papers to finish writing.

The problem is working with coauthors. When I work by myself I write up ideas as I go, so the paper is done when it’s done. With coauthor it’s different: it’s lots of fun when we’re together, dreaming up ideas and working out the details, but not fun at all later on when I’m slowly writing them up. The real problem is feeling guilty that my procrastination may be hurting someone else’s CV — especially when it’s a grad student or postdoc, desperate for a job.

I’m trying to avoid new collaborative papers for this reason, and so far I’m succeeding. This has freed up a lot of time for learning new stuff, and explaining it here. And overall, I think that’s a better use of my time.

• Tim van Beek says:

The first three posts on information geometry don’t assume much background, especially not much in statistical physics – although John used some physics mumbo jumbo that may scare mathematicians that see it for the first time.

But it’s possible to translate all of it to a pure mathematical language, which is easier to do of course in the presence of specific questions like “I don’t understand symbol X on line Z”.

• John Baez says:

I know Jamie; he’s good at mathematical physics, so I don’t think he’d have trouble understanding what I wrote. I think he’s just too busy.

5. phorgyphynance says:

Last night, I had some some fun working out all the formulas appearing in this series. John was right. I do feel smarter after doing that :)

However, I still do not have a perfect handle on the nature of all variables appearing. For instance, I got stuck when writing down

$d(\ln \rho).$

Should this be expanded in terms of $d\lambda^i$, $dX_i$, $dx$, $dt$,…?

In Crooks’ paper, I see the $\lambda^i$ are functions of $t$ and $X_i$ are functions of $x$.

Should

$d(\ln \rho)$

be expanded using Newtonian calculus or stochastic calculus (via Ito)?

• John Baez says:

1) the formalism I’m presenting in this post,

2) the formalism in Gavin Crooks’ first paper, or

3) the formalism in Gavin Crooks’ second paper?

They’re different. I can’t tell you the rules of the game until you tell me which game you’re playing.

For example:

1) There are no variables $x$ or $t$ in anything I explained in this post. I use $\lambda^i$ to coordinatize an arbitrary manifold $M$. $\rho$ and $X_i$ are smooth operator-valued functions on this manifold. I’m using plain old derivatives, no Ito calculus.

2) In Crooks’ first paper and my first post explaining it, $x$ is used to denote a vector: a list of expectation values of a fixed set of observables $X_i$. The variables $x_i$ thus serve to coordinatize an open set in $\mathbb{R}^n$. The probability measure $p$ is a function of $x$, but the observables $X_i$ are not. The expectation values $\langle X_i \rangle$ equal the coordinate functions $x_i$ on this open set. The $\lambda^i$ are some other functions on this open set. The variable $t$ is used not for time — he’s doing equilibrium thermodynamics — but to parametrize a path in this open set. He’s using plain old derivatives, no Ito calculus.

3) I will not attempt to describe the rules of the game in Crooks’ second paper, since I haven’t read it carefully yet.

If you don’t mind, I will stick to formalism 1). In this formalism,

$d (\mathrm{ln} \,\rho) = \frac{\partial \mathrm{ln}\rho}{\partial \lambda^i} d\lambda^i = - (X_i - \langle X_i \rangle) d\lambda^i$

The second step is the main calculation I did in this post.

• phorgyphynance says:

Thanks! This helps.

I haven’t looked at Crooks’ second paper yet and I thought 1) and 2) were intended to be the same (or at least consistent).

The variable $t$ is used not for time

Then why did he say

The configurational probability distribution is given by the Gibbs ensemble, [16]

$p(x|\lambda) = \frac{1}{Z} e^{-\beta H(x,\lambda)} = \frac{1}{Z} e^{-\lambda^i(t) X_i(x)}$

where $x$ is the configuration, $t$ is time,…

? :)

I see now that $\lambda^i$ are like spatial coordinates so 1-forms should have coordinate bases $d\lambda^i$, but I’m still not quite sure whether

$d(\ln\rho)$

should contain a $dt$ term.

This may not be relevant, but it is curious enough that I think I’ll share it. The Ito differential would look something like

$d(\ln\rho) =\left (\partial_\mu \ln\rho\right) d\lambda^\mu +\left[ (\partial_t + \frac{g^{\mu\nu}}2 \partial_\mu\partial_\nu) \ln\rho\right] dt.$

The spatial component is the same as you had it, but note the second term of the temporal component. Since

$\partial_\mu \partial_\nu \ln\rho = g_{\mu\nu}$

you have a term

$g^{\mu\nu} g_{\mu\nu},$

which seems curious. I need to learn the relation (if any) between operator-valued functions and stochastic processes, but I’ll work on that elsewhere.

6. John Baez says:

Phorgyphynance wrote:

I thought 1) and 2) were intended to be the same (or at least consistent).

The current blog entry generalizes Crooks’ first paper in ways that require a significant shift in viewpoint. I wrote:

Here’s the problem: so far [i.e., in our summary of Crooks’ work] we’ve only studied a special case of the Fisher information metric. We’ve been assuming our states are Gibbs states, parametrized by the expectation values of some observables $X_1, \dots, X_n$. Our manifold $M$ was really just some open subset of $\mathbb{R}^n$: a point $x = (x_1, \dots, x_n)$ in here was a list of expectation values.

But people like to work a lot more generally. We could look at any smooth function $\rho$ from a smooth manifold $M$ to the set of density matrices for some quantum system.

[…]

The challenge is that now we don’t have observables $X_i$ to start with. All we have is a smooth function $\rho$ from some manifold to some set of states. How can we pull observables out of thin air?

And the answer is to let $\lambda_i$ be arbitrary local coordinates on $M$, not Lagrange multipliers as Crooks was using. Then, define

$X_i = \partial_i A$

where $A$ is a certain observable-valued function on $M$, related to $\rho$ and the partition function $Z$ as follows:

$\rho = \frac{1}{Z} e^{-A}$

Note that these $X_i$ are not fixed observables, as they were in Crooks’ formalism! Now they are observable-valued functions on the manifold $M$.

We need these changes to think of the Fisher information metric as always coming from a covariance matrix.

So, you gotta be careful here.

• phorgyphynance says:

Thanks for explaining. I hope it is clear that I am very excited about what you are doing and trying my best to understand it. I’ve found that simple “geometry” from the covariance matrix is already very useful in applications and now seeing a deeper level coming from a density matrix is very very cool. I hope to incorporate it into my work.

7. […] post is in response to a question from Tim van Beek over at the Azimuth Project blog hosted by my friend John Baez regarding my […]

8. John Baez says:

John wrote:

The variable $t$ is used not for time

Phorgyphynance wrote:

Then why did he say

The configurational probability distribution is given by the Gibbs ensemble, [16]

$p(x|\lambda) = \frac{1}{Z} e^{-\beta H(x,\lambda)} = \frac{1}{Z} e^{-\lambda^i(t) X_i(x)}$

where $x$ is the configuration, $t$ is time,…

? :)

I think he uses $t$ to mean time only in this one paragraph. I hadn’t even noticed that.

Of course, Gavin Crooks can say for himself what he’s done, but here’s my take on it:

In the bulk of the paper he’s doing equilibrium thermodynamics, time plays no real role, and he uses $t$ as a parameter for a path in his manifold of thermodynamic states. The main point of the paper — which I didn’t even get around to discussing — is to provide an operational procedure for measuring the arclength of such a path. This involves changing the thermodynamic state, but so slowly that we may consider it as remaining in equilibrium all along.

So, I stand by what I said. But if you want to make some stuff time-dependent, to get some $d t$ terms to show up, you can do that.

9. Tim van Beek says:

John wrote in the main post:

We can’t really take the logarithm of every density matrix. Remember, we take the log of a density matrix by taking the log of all its eigenvalues. These eigenvalues are ≥ 0, but if one of them is zero, we’re in trouble! The logarithm of zero is undefined…

Two simple questions:

Isn’t the assumption that the densitiy matrix is positive definite correct for the grand canonical ensemble in “physically realistic” situations? I mean, every state contributes, because we fix the mean energy and mean particle count only, so that every state of the system – regardless of the energy and particle count needed – has a nonzero probability…

On the other hand, do we really need to assume that the density matrix is positive definite to take the logarithm? Let’s say we have a point $x$ on our manifold such that $\rho_x$ has a basis of eigenvectors, and we keep all eigenvectors with eigenvalue $\neq 0$, let this set be $E := \{e_1,... \}$. Then we can write

$\rho_x = \sum \mu_i(x)$

(I hope I get the latex correct, what I intend to write is the representation of the density matrix according to the spectral theorem of compact operators, as here on the nLab).

Now we can define $\log(\rho)$ eigenvector-wise, as you did in the first post on information geometry, setting $\log(0) := 0$.

We can further differentiate this logarithm iff we assume that $x$ has a neighborhood such that $E$ is invariant on this neighborhood, that is the set of eigenvectors with nonzero eigenvalue does not change.

Maybe there is some way to define the differential even in the case that $E$ does change, I don’t know…but is it possible to relax the condition on $\rho$ from being positive definite to having an invariant $E$?

• John Baez says:

Tim wrote:

Isn’t the assumption that the density matrix is positive definite correct for the grand canonical ensemble in “physically realistic” situations?

Great question! Yes, that’s true.

I mean, every state contributes, because we fix the mean energy and mean particle count only, so that every state of the system – regardless of the energy and particle count needed – has a nonzero probability…

Right. Or if we just consider the canonical ensemble, where each state with energy E shows up with probability proportional to exp(-E/kT), we’ll get a nonzero contribution from every state…

except in the unphysical but nonetheless incredibly interesting limiting case where T = 0. Then the system goes down to its ground state, or a mixture of its ground states if there’s more than one.

It would be very nice to be able to understand the metric I’m discussing in this limiting case. Why? Well, I explained a metric on pure states in my post on the geometry of quantum phase transitions. It’s called the ‘fidelity metric’ or ‘Fubini-Study metric’, and it’s very nice. It would be cool to relate that metric to the one I’m discussing here!

Now we can define $\log(\rho)$ eigenvector-wise, as you did in the first post on information geometry, setting $\log(0) := 0$.

Just so nobody gets the wrong idea: I did not set $\log(0) = 0$.

(I know you’re not saying I did: you’re just saying I defined $\log(\rho)$ eigenvector-wise. But people reading your sentence out of context might be confused. I don’t want ’em to think I’m even dumber than I actually am.)

I don’t think formally setting $\log(0)$ equal to zero will help me with the question I’m interested in. I want to understand what the metric on the interior of the set of mixed states does as we approach the boundary. I would like it to extend smoothly or at least continuously.

$g(v,w) = \mathrm{tr}(\rho \, v(\ln \rho) \, w(\ln \rho))$

Now I’m using this to define a Riemannian metric on the interior of the set of mixed states. (The pullback of this metric to my manifold $M$ is the metric I was talking about in my post.) Now $v$ and $w$ are tangent vectors to some density matrix $\rho$.

Everything is fine when none of the eigenvalues of $\rho$ are zero. What happens when some eigenvalues approach zero? We have one factor of $\rho$ to help us out — but we’ve got two factors involving a derivative of $\ln(\rho)$. So, I’d expect behavior like

$\lim_{x \downarrow 0} x \; \frac{1}{x}\; \frac{1}{x}$

which is singular.

But this is just a hand-waving argument.

We can further differentiate this logarithm iff we assume that x has a neighborhood such that E is invariant on this neighborhood, that is the set of eigenvectors with nonzero eigenvalue does not change.

Yes, true. But that’s not very helpful for the kind of thing I’m interested in, e.g. extending the metric on the interior of the set of mixed states to the boundary. On the boundary we have pure states, and as we move around on that boundary, the eigenvectors your talking about change.

I think these papers should help a lot:

• Anna Jencova, Geodesic distances on density matrices.

• A. Uhlmann, Density operators as an arena for differential geometry, Rep. Math. Phys. 33 (1993), 253–263.

• A. Uhlmann, Geometric phases and related structures, Rep. Math. Phys. 36 (1995), 461–481.

I could also do some calculations. It ain’t rocket science, it’s basically just matrix multiplication and some calculus. But sometimes those things can be pretty tiring.

• Tim van Beek says:

I’ have to at least skim those papers, but don’t have any time to do it now, so here is an unqualified response:

But this is just a hand-waving argument.

I’m very much convinced that the metric gets singular on the boundary, but I’m not sure about what that really means.

One interpretation, of course, is that it is possible to find out for sure if your system is in a mixed state instead of a pure state. In the classical case, points on a statistical manifold with finite Fisher distance cannot be distinguisehd for sure, using a finite set of measurements.

So, while one interesting question is “how do we get rid of singularities?”, another one may be “what statistical resp. physical interpretation does a singularity of the metric have?”.

Pure states may be described, in this context, as points with lower complexity in the sense that they are specified by fewer parameters than mixed states.

• Tim van Beek says:

I wrote:

In the classical case, points on a statistical manifold with finite Fisher distance cannot be distinguisehd for sure, using a finite set of measurements.

Ugh, that’s wrong, unless we assume the classical analogue of an invariant set of eigenvectors: Namely, that the probability distributions of our statistical manifold all have the same support.

• Aaron says:

Shouldn’t all this talk of setting $\log(0) = 0$ really be setting $0 \log 0 = 0$ (which really is the limit $\lim_{x \rightarrow 0} x \log x$).

• John Baez says:

Setting $0 \ln 0 = 0$ is fine, and that’s what people normally do when defining the entropy of a mixed state to be

$S(\rho) = \mathrm{tr}(\rho \ln \rho)$

When the density matrix $\rho$ has an eigenvalue equal to 0, they define the operator $\rho \ln \rho$ to equal 0 when applied to that eigenvector.

However, here Tim and I are trying to interpret the expression

$\mathrm{tr} (\rho \frac{d \ln \rho}{d\lambda^i} \frac{d \ln \rho}{d\lambda^j})$

and this is trickier. Indeed it appears, in general, to be ill-defined when $\rho$ has zero eignenvalues.

• Tim van Beek says:

John wrote:

However, here Tim and I are trying to interpret the expression…

Yes, that’s the problem…we need to define the differential of $\log(\rho)$. We could try to define the logarithm of $\rho$ to be $\log^+(\rho)$ eigenvector-wise, so that we don’t need $\rho$ to be positive definite. $\log^+(x)$ is defined to be 0 for $x < 0$ and $= \log(x)$ otherwise.

But if we have a path $\gamma(t)$ in our manifold such that there is an eigenvector with an eigenvalue $\mu(t)$ with $\mu(t > 0) > 0$ and $\mu(t \leq 0) = 0$ I very much doubt we can find a way to define the differential of $\log^+(\rho)$ along this path at $\gamma(0)$.

(Gee, hope the latex works out.)

• John Baez says:

Tim: I fixed a bunch of typos in your latex, but the big problem was this: in systems that mix latex and html, like this blog and also the n-Category Café, you have to be incredibly careful about < and > signs, since these play an important role in html.

In the n-Café you have to be smart enough to use the latex codes \lt and \gt instead of < and > — otherwise you’ll get in trouble.

In this blog you can’t even use \lt and \gt — apparently they get translated into < and > and they then cause trouble!

What you have to do here is use the html codes &lt; and &gt;, even inside latex expressions!

Having told you this, I now expect that you and everyone reading this will remember it forever and never make that mistake again.

10. John Baez says:

I made a big mistake in this blog entry — and possibly the preceding one!

It can be seen most easily here:

$g_{ij} = \mathrm{tr}(\rho \; \partial_i (\mathrm{ln} \, \rho) \; \partial_j (\mathrm{ln} \, \rho) )$

A Riemannian metric must be symmetric:

$g_{ij} = g_{ji}$

On the other hand, the expression on the right-hand side is not symmetric:

$\mathrm{tr}(\rho \; \partial_i (\mathrm{ln} \, \rho) \; \partial_j (\mathrm{ln} \, \rho) ) \ne \mathrm{tr}(\rho \; \partial_j (\mathrm{ln} \, \rho) \; \partial_i (\mathrm{ln} \, \rho) )$

except in the classical case where the observables all commute.

Remember, while I was using quantum notation, my setup was supposed to work for both classical and quantum mechanics. Alas, it only works in the classical case.

You might think the the cyclic property of the trace saves the day:

$\mathrm{tr}(AB) = \mathrm{tr}(BA)$

But this property does not mean everything commutes inside a trace:

$\mathrm{tr}(ABC) \ne \mathrm{tr}(BAC)$

So what did I do wrong?

I think it was simply defining the ‘metric’ by

$g_{ij} = \langle (X_i - \langle X_i \rangle) (X_j - \langle X_j \rangle) \rangle$

This is symmetric in the classical case, but not the quantum case!

So, I need to go back to the drawing board, at least when it comes to the quantum case.

By the way, for a while I thought my mistake lay here:

$\frac{1}{Z} \mathrm{tr}(\frac{\partial }{\partial \lambda^i} e^{-A}) = - \frac{1}{Z} \mathrm{tr}(e^{-A} \frac{\partial}{\partial \lambda^i} A)$

where in the last step we use the chain rule.

In fact we don’t have

$\frac{\partial }{\partial \lambda^i} e^{-A} = -e^{-A} \frac{\partial}{\partial \lambda^i} A$

unless $A$ and $\frac{\partial}{\partial \lambda^i} A$ commute. However, I believe the cyclic property of the trace is enough to show

$\mathrm{tr}(\frac{\partial }{\partial \lambda^i} e^{-A}) = -\mathrm{tr}(e^{-A} \frac{\partial}{\partial \lambda^i} A)$

So, it’s possible that all my calculations are correct, and the only problem is that I’m working with a bizarre asymmetric (and possibly degenerate) version of a ‘Riemannian metric’.

But I need to think more.

• phorgyphynance says:

In your more general framework, could you still define

$g_{\mu\nu} = -\partial_\mu \partial_\nu \ln Z$

?

• phorgyphynance says:

In the original case of Fisher information metric, we had

$A = \lambda^\mu X_\mu$

so that

$\partial_\mu\partial_\nu A = 0$

and we have

$g_{\mu\nu} = -\partial_\mu\partial_\nu \ln Z = \partial_\mu\partial_\nu \ln \rho.$

In the more general case you’re considering, maybe you could have

$\partial_\mu\partial_\nu A \ne 0$

in which case, it may make more sense to define

$g_{\mu\nu} =\partial_\mu\partial_\nu \ln \rho.$

If we then use the gauge freedom to choose $A$ such that $Z = 1$, then this reduces to

$g_{\mu\nu} = -\partial_\mu\partial_\nu A,$

which is kind of neat and gives me a sense of deja vu.

• John Baez says:

$g_{ij} = - \partial_i \partial_j \ln Z$

This is a great idea! It’s automatically symmetric. To see when it’s positive definite, and understand what it means, I will compute it in terms of the observables $X_i$ that I defined.

$g_{ij} = - \partial_i \partial_j \ln \rho$

This is not a great idea! This doesn’t parse if $g_{ij}$ is supposed to be a tensor (and hopefully a Riemannian metric). Why? Because $\ln\rho$ is an observable-valued function on $M$. For example, in the quantum case — the only case I’m having problems with — it’s an operator-valued function.

So then $\partial_i \partial_j \ln \rho$ is an operator-valued rank 2 symmetric tensor on $M$, and while that’s probably good for something, it’s not what I’m looking for.

In the classical case $\partial_i \partial_j \ln \rho$ is an random-variable-valued rank 2 symmetric tensor, which is again not what I’m after.

So I prefer the first idea.

I’ve also figured out a bunch of stuff myself, but I’ll let it settle for a while before I write about it.

• phorgyphynance says:

Ok. You’re probably right, but

$\partial_\mu\partial_\nu \ln \rho$

“feels” better to me for some reason. The fact that it is operator-valued doesn’t bother me too much. For one reason, the metric in my paper with Urs was also a self-adjoint operator.

I’m looking forward to the next part of this series :)

11. John Baez says:

John wrote:

I made a big mistake in this blog entry — and possibly the preceding one!

In fact this mistake affected all three blog entries. Luckily, it’s easy to fix.

The problem was that this matrix is not symmetric in the quantum case:

$g_{ij} = \mathrm{tr}(\rho \; \partial_i (\mathrm{ln} \, \rho) \; \partial_j (\mathrm{ln} \, \rho))$

The solution is to take the real part! So now I’ve redefined it:

$g_{ij} = \mathrm{Re} \, \mathrm{tr}(\rho \; \partial_i (\mathrm{ln} \, \rho) \; \partial_j (\mathrm{ln} \, \rho))$

I’ve tried to fix this problem everywhere it shows up in my blog entries — but not in the comments.

I’ve explained this in more detail in part 4, so if you have any questions, that’s the place to ask ’em!

12. […] Part 1     • Part 2     • Part 3     • Part 4     • Part […]

13. Now I want to describe how the Fisher information metric is related to relative entropy […]

14. About zero eigenvalues in the density matrix…
— When there aren’t any, then it all seems to boil down to the statement that “all ensembles are gibbs ensembles”. In retrospect, this seems obvious, yet it also seems rather remarkable, because no one ever seems to come out and say it much (well, in physics, they do, but not in other contexts). So I think I learned something new here. (My definition of a “gibbs ensemble” here is “those ensembles which don’t have a zero in the density matrix”, which seems to be a correct definition, right? Or did I miss something?)

— When there are zero eigenvalues, then I have two knee-jerk reactions:
The first, naive one is “well, gee, you should leave those states out of your Hilbert space; it was an error to include them in the first place”.

That attitude fails because the density depends on lambda, and perhaps, as one moves around on the manifold, the density matrix goes from being positive-definite in “most” locations, to having zero eigenvalues in some. But if that’s the case, this too strikes me as being remarkable, at least from the physics point of view.

So, envisioning lambda as a proxy for the temperature, or the fugacity, or whatever, this is saying that, for some values of lambda, there is a state that mixes well into the ensemble, and for other lambdas, it fails to mix at all. I don’t know of any physical system that behaves like this (but maybe my experience is limited). Its like saying that, while manipulating the temperature, there is one state that suddenly completely stops interacting with the rest of the system. Wow! This seems like one heck of a discontinuity, suggesting some phase-transition-like behavior. Call me naive, but describing a model of some kind that exhibits this behaviour seems to be publication-worthy, to me. Unless there’s a “well-known” one you know…

At any rate, it still suggests a way for proceeding forward: you carve up the manifold into pieces, the edges of which are those values of lambda where the density matrix has a zero eigenvalue. As one crosses those edges, one should discard (or ad back in) the detached pure state(s) from your Hilbert space, and otherwise proceed with the usual Gibbs-state calculations.

It would be even more remarkable and amazing if one couldn’t carve up the manifold, say, for example, because the points (values of lambda) for which the density matrix had zeros and where it didn’t were “dense” in each other (in the general topology sense, like rationals dense in reals), or if there was an accumulation point. Of course, mathematically, I suppose this is possible, but as a physical situation, this would seem to be truly remarkable…