Today, I want to describe how the Fisher information metric is related to relative entropy. I’ve explained both these concepts separately (click the links for details); now I want to put them together.

But first, let me explain what this whole series of blog posts is about. Information geometry, obviously! But what’s that?

Information geometry is the geometry of ‘statistical manifolds’. Let me explain that concept twice: first vaguely, and then precisely.

Vaguely speaking, a statistical manifold is a manifold whose points are hypotheses about some situation. For example, suppose you have a coin. You could have various hypotheses about what happens when you flip it. For example: you could hypothesize that the coin will land heads up with probability , where is any number between 0 and 1. This makes the interval into a statistical manifold. Technically this is a manifold *with boundary*, but that’s okay.

Or, you could have various hypotheses about the IQ’s of American politicians. For example: you could hypothesize that they’re distributed according to a Gaussian probability distribution with mean and standard deviation . This makes the space of pairs into a statistical manifold. Of course we require , which gives us a manifold with boundary. We might also want to assume , which would give us a manifold *with corners*, but that’s okay too. We’re going to be pretty relaxed about what counts as a ‘manifold’ here.

If we have a manifold whose points are hypotheses about some situation, we say the manifold ‘parametrizes’ these hypotheses. So, the concept of statistical manifold is fundamental to the subject known as parametric statistics.

Parametric statistics is a huge subject! You could say that information geometry is the application of geometry to this subject.

But now let me go ahead and make the idea of ‘statistical manifold’ more precise. There’s a classical and a quantum version of this idea. I’m working at the Centre of Quantum Technologies, so I’m being paid to be quantum—but today I’m in a classical mood, so I’ll only describe the classical version. Let’s say a **classical statistical manifold** is a smooth function from a manifold to the space of probability distributions on some measure space .

We should think of as a space of **events**. In our first example, it’s just : we flip a coin and it lands either heads up or tails up. In our second it’s : we measure the IQ of an American politician and get some real number.

We should think of as a space of **hypotheses**. For each point , we have a probability distribution on . This is hypothesis about the events in question: for example “when I flip the coin, there’s 55% chance that it will land heads up”, or “when I measure the IQ of an American politician, the answer will be distributed according to a Gaussian with mean 0 and standard deviation 100.”

Now, suppose someone hands you a classical statistical manifold . Each point in is a hypothesis. Apparently some hypotheses are more similar than others. It would be nice to make this precise. So, you might like to define a metric on that says how ‘far apart’ two hypotheses are. People know lots of ways to do this; the challenge is to find ways that have clear meanings.

Last time I explained the concept of relative entropy. Suppose we have two probability distributions on , say and . Then the **entropy of relative to ** is the amount of information you gain when you start with the hypothesis but then discover that you should switch to the new improved hypothesis . It equals:

You could try to use this to define a distance between points and in our statistical manifold, like this:

This is definitely an important function. Unfortunately, as I explained last time, it doesn’t obey the axioms that a distance function should! Worst of all, it doesn’t obey the triangle inequality.

Can we ‘fix’ it? Yes, we can! And when we do, we get the Fisher information metric, which is actually a *Riemannian* metric on . Suppose we put local coordinates on some patch of containing the point . Then the **Fisher information metric** is given by:

You can think of my whole series of articles so far as an attempt to understand this funny-looking formula. I’ve shown how to get it from a few different starting-points, most recently back in Part 3. But now let’s get it starting from relative entropy!

Fix any point in our statistical manifold and choose local coordinates for which this point is the origin, . The amount of information we gain if move to some other point is the relative entropy . But what’s this like when is really close to ? We can imagine doing a Taylor series expansion of to answer this question.

Surprisingly, to first order the answer is always zero! Mathematically:

In plain English: if you change your mind slightly, you learn a negligible amount — *not* an amount proportional to how much you changed your mind.

This must have some profound significance. I wish I knew what. Could it mean that people are reluctant to change their minds except in big jumps?

Anyway, if you think about it, this fact makes it obvious that can’t obey the triangle inequality. could be pretty big, but if we draw a curve from and , and mark closely spaced points on this curve, then is zero to first order, so it must be of order , so if the triangle inequality were true we’d have

for all , which is a contradiction.

In plain English: if you change your mind in one big jump, the amount of information you gain is more than the sum of the amounts you’d gain if you change your mind in lots of little steps! This seems pretty darn strange, but the paper I mentioned in part 1 helps:

• Gavin E. Crooks, Measuring thermodynamic length.

You’ll see he takes a curve and chops it into lots of little pieces as I just did, and explains what’s going on.

Okay, so what about second order? What’s

Well, this is the punchline of this blog post: *it’s the Fisher information metric:*

And since the Fisher information metric is a Riemannian metric, we can then apply the usual recipe and define distances in a way that obeys the triangle inequality. Crooks calls this distance **thermodynamic length** in the special case that he considers, and he explains its physical meaning.

Now let me prove that

and

This can be somewhat tedious if you do it by straighforwardly grinding it out—I know, I did it. So let me show you a better way, which requires more conceptual acrobatics but less brute force.

The trick is to work with the **universal** statistical manifold for the measure space . Namely, we take to be the space of *all* probability distributions on ! This is typically an *infinite-dimensional* manifold, but that’s okay: we’re being relaxed about what counts as a manifold here. In this case, we don’t need to write for the probability distribution corresponding to the point . In this case, a point of just *is* a probability distribution on , so we’ll just call it .

If we can prove the formulas for this universal example, they’ll automatically follow for every other example, by abstract nonsense. Why? Because *any* statistical manifold with measure space is the same as a manifold with a smooth map to the *universal* statistical manifold! So, geometrical structures on the universal one ‘pull back’ to give structures on all the rest. The Fisher information metric and the function can be defined as pullbacks in this way! So, to study them, we can just study the universal example.

(If you’re familiar with ‘classifying spaces for bundles’ or other sorts of ‘classifying spaces’, all this should seem awfully familiar. It’s a standard math trick.)

So, let’s prove that

by proving it in the universal example. Given any probability distribution , and taking a nearby probability distribution , we can write

where is some small function. We only need to show that is zero to first order in . And this is pretty easy. By definition:

or in other words,

We can calculate this to first order in and show we get zero. But let’s actually work it out to second order, since we’ll need that later:

so

so

Why does this vanish to first order in ? It’s because and are both probability distributions and , so

but also

so subtracting we see

So, vanishes to first order in . *Voilà!*

Next let’s prove the more interesting formula:

which relates relative entropy to the Fisher information metric. Since both sides are symmetric matrices, it suffices to show their diagonal entries agree in any coordinate system:

Devoted followers of this series of posts will note that I keep using this trick, which takes advantage of the polarization identity.

To prove

it’s enough to consider the universal example. We take the origin to be some probability distribution and take to be a nearby probability distribution which is pushed a tiny bit in the th coordinate direction. As before we write . We look at the second-order term in our formula for :

Using the usual second-order Taylor’s formula, which has a built into it, we can say

On the other hand, our formula for the Fisher information metric gives

The right hand sides of the last two formulas look awfully similar! And indeed they agree, because we can show that

How? Well, we assumed that is what we get by taking and pushing it a little bit in the th coordinate direction; we have also written that little change as

for some small function . So,

and thus:

and thus:

so

as desired.

This argument may seem a little hand-wavy and nonrigorous, with words like ‘a little bit’. If you’re used to taking arguments involving infinitesimal changes and translating them into calculus (or differential geometry), it should make sense. If it doesn’t, I apologize. It’s easy to make it more rigorous, but only at the cost of more annoying notation, which doesn’t seem good in a blog post.

#### Boring technicalities

If you’re actually the kind of person who reads a section called ‘boring technicalities’, I’ll admit to you that my calculations don’t make sense if the integrals diverge, or we’re dividing by zero in the ratio . To avoid these problems, here’s what we should do. Fix a -finite measure space . Then, define the **universal statistical manifold** to be the space consisting of all probability measures that are equivalent to , in the usual sense of measure theory. By Radon-Nikodym, we can write any such measure as where . Moreover, given two of these guys, say and , they are absolutely continuous with respect to each other, so we can write

where the ratio is well-defined almost everywhere and lies in . This is enough to guarantee that we’re never dividing by zero, and I think it’s enough to make sure all my integrals converge.

We do still need to make into some sort of infinite-dimensional manifold, to justify all the derivatives. There are various ways to approach this issue, all of which start from the fact that is a Banach space, which is about the nicest sort of infinite-dimensional manifold one could imagine. Sitting in is the hyperplane consisting of functions with

and this is a Banach manifold. To get we need to take a subspace of that hyperplane. If this subspace were open then would be a Banach manifold in its own right. I haven’t checked this yet, for various reasons.

For one thing, there’s a nice theory of ‘diffeological spaces’, which generalize manifolds. Every Banach manifold is a diffeological space, and every subset of a diffeological space is again a diffeological space. For many purposes we don’t need our ‘statistical manifolds’ to be manifolds: diffeological spaces will do just fine. This is one reason why I’m being pretty relaxed here about what counts as a ‘manifold’.

For another, I know that people have worked out a lot of this stuff, so I can just look things up when I need to. And so can you! This book is a good place to start:

• Paolo Gibilisco, Eva Riccomagno, Maria Piera Rogantin and Henry P. Wynn, *Algebraic and Geometric Methods in Statistics*, Cambridge U. Press, Cambridge, 2009.

I find the chapters by Raymond Streater especially congenial. For the technical issue I’m talking about now it’s worth reading section 14.2, “Manifolds modelled by Orlicz spaces”, which tackles the problem of constructing a universal statistical manifold in a more sophisticated way than I’ve just done. And in chapter 15, “The Banach manifold of quantum states”, he tackles the quantum version!

Interesting stuff!

(On the copy-editing front: it’s missing a “latex” in $L^1(\Omega, d\omega)$.)

Thanks. I just did that to see if anyone would read that far.

I hope everyone got the nasty joke in paragraph 5.

And you can retrieve the Fisher information metric from other divergences, as discussed on p. 4 of Snoussi’s The geometry of prior selection.

Thanks! I’m guessing the `δ-divergence’ defined on top of this page (in equation 2) is the Rényi version of relative entropy.

This is a nice paper! I’m still not at the point of understanding all this stuff, but I’m getting a lot closer.

It would be interesting to know under which conditions Fisher’s metric defines a flat Levi-Civita connection.

Nice question!

We can make a bit of progress on this in the simple case where our space of events has a finite number of points, say

Then the universal statistical manifold is the space of all probability distributions on . This is the simplex:

So for example for it’s a triangle.

But the Fisher metric does not make this simplex flat! Instead, it turns out to be isometric to a piece of a sphere:

So, for example, when it’s shaped like an eighth of a sphere.

Knowing this, we can begin to understand when the Fisher metric on a general statistical manifold is flat (at least when is finite.) We need to map that manifold into the universal statistical manifold so that its image is a flat submanifold sitting inside this piece of a sphere!

mmm, I’ll have to think about this, It’s not straightforward to me that Fisher’s curvature should be that of a sphere…

(as I process information I elaborate it on a new blog, you might want to take a look at it from time to time)

One important concept for interpretation is that on any statistical manifold events shouldn’t be just points but rather collections of points of various densities. One reason is independent events are interpenetrating point clouds.

John F wrote:

Just to clear up any possible miscommunication: in my setup, I’m calling points in the statistical manifold ‘hypotheses’, not ‘events’.

There are two spaces floating around here: the statistical manifold whose points are

hypotheses, and the measure space whose points areevents. Each point in gives rise to a probability distribution on . So, each hypothesis is a probability distribution of events.E.g., when , a typical hypothesis is “the coin will land heads-up with probability 60%”.

Your setup is confusing if you are used to ordinary (say Bayesian) statistical inference! I think it translates like this.

A Bayesian prior is a distribution over hypotheses. You don’t jump from one hypothesis to another, but weight them differently as the likelihood changes as you get more data. Usually, as you get more data, the posterior becomes more like a multivariate gaussian. To estimate the (inverse of the) covariance matrix, you can take second derivatives of -log(posterior), evaluated at the mode of the posterior. The result is the Fisher information matrix (not metric).

In statistical inference, you are therefore interested in how the curvature of log(posterior) changes at a single point as more data becomes available. You could say: data tells the log(posterior) how to curve.

It’s not miscommunication, just stupidity on my part. I thought we were discussing how far apart two distributions are, instead it is “how far apart two hypotheses are”?

I can kind of imagine smoothly mapping points in M to distributions in Ω. I have trouble imagining smoothly mapping independent event distributions to points. As Graham suggests I think in practice there is fuzziness with points in M being special cases of pointy distributions in M.

But even with points, what about inference as parameter estimation? Probably we need homework examples. Is this an example:

Suppose during the week I’m likely to have to stop by a store to get milk, and/or bread, and/or dog food, and/or Coke, depending on the day and time -say for my family more likely to need dog food later in the week. Then an event is me picking up some set of those, and an hypothesis is the set of probabilities of me picking each up? Then in this case different times have different hypotheses, so you can guess the time from what I got.

Tomate discusses Fisher information, starting from a rework of John Baez’s post

Information Geometry (Part 7), but then moving on to new ideas.In France, “information Geometry” is included in a larger mathematical domain “Geometric Science of Information”, that is debated in Brillouin Seminar launched in 2009 :

http://repmus.ircam.fr/brillouin/past-events

http://repmus.ircam.fr/brillouin/home

You can register for Brillouin Seminar News :

http://listes.ircam.fr/wws/info/informationgeometry

Recently, Brillouin Seminar has organized :

– In 2011, a French-Indian Workshop on “Matrix Information Geometries” at Ecole Polytechnique. Proceedings will be published by Springer in 2012. Slides and abstracts are available on website :

http://www.informationgeometry.org/MIG/

– In 2012, a Symposium on “Information Geometry and Optimal Transport Theory” at Institut Henri Poincaré in Paris with GDR CNRS MSPC. All slides are available on the website :

http://www.ceremade.dauphine.fr/~peyre/mspc/mspc-thales-12/

You can find a very recent French PhD Dissertation in English on this subject by Yang Le and supervised by Marc Arnaudon :

Medians of probability measures in Riemannian manifolds

http://tel.archives-ouvertes.fr/tel-00664188

Treating the Fisher information metric as an infinitesimal Kullback-Leibler is interesting, but it begs the question: Given probability distributions P and Q (parameterized by some lambda), what is the geodesic distance between the two?

Well, I suppose this is a rhetorical question, its of course

for some path connecting P and Q, we know this from textbooks. But what is this, intuitively?

The point is that since we have a metric, we then also have geodesics connecting any two points (assuming simply connected, convex manifold, etc). Since the KL doesn’t satisfy the triangle inequality, it clearly cannot be measuring the geodesic. So what is it measuring?

Well, the partial answer is that KL does not need to make an assumption of a parameter lambda, or of a “path” connecting P and Q, and so is “general” in this way. But still, I have no particular intuition here…

Err, well, clearly I mangled things a bit above, and can’t figure out how to edit to fix it. So, please “understand what I mean, not what I say”.

[Moderator's note: I fixed your comment. There is no ability for you to retroactively edit comments. On this blog, as on most WordPress blogs, you need to type$latex x = y$

to get the equationx = y

The ‘latex’ must occur directly after the dollar sign, with no space in between. There must be a space after it. And double dollar signs don’t work, at least on this blog.]Oh, hmm, right. Silly me. The geodesic length is the square root of the Jensen-Shannon divergence, as pointed out by Gavin Crooks in the quoted article. I’m not masochistic enough to try to verify this explicitly, although it would be a good exercise…

Next question: in principle, one could calculate the Levi-civita symbols, the curvatute tensor, the ricci tensor, “simply” by plugging in the expression of the fisher metric, and turning the crank. My question then is: does one get lucky, do these expressions simplify, or do they remain a big mess?

Again, since Jensen-Shannon does follow the geodesic, and since its a symmetrized form of Kullback-Leibler, I guess this mens that KL somehow follows a path that just starts in the wrong direction, and then curves around? Well, of course it does, but intuitively, its..?

Hi, Linas! Good to see you here! You wrote:

Since I forget what the Jensen–Shannon divergence is, I imagine many other readers will too, so let’s remind ourselves by peeking at the Wikipedia article…

Oh, okay. It’s a well-known way of making the relative entropy (also called the ‘Kullback-Leibler divergence’ by people who enjoy obscure technical terms) into an actual metric on the space of probability distributions.

Given two probability distributions and on a measure space the

entropy of relative toisTo make this symmetrical, we define a new probability distribution

that’s the midpoint of and in some obvious naive sense, and define the

Jensen–Shannon distanceto beUnlike the relative entropy, this is obviously symmetric:

And unlike the relative entropy, but less obviously, it also obeys the triangle inequality!

Nonetheless, I always thought this idea was a bit of a ‘trick’, and thus not worth studying.

For a second you made me think the Jensen–Shannon distance was just the square of the geodesic distance as measured by the Fisher information metric. And that would mean it’s

notjust a trick!But unfortunately that’s not true… and probably not even what you were trying to say. The truth is here, and it’s less pretty.

Yellow caution alert: Over the last few days, I quadrupled the size of the wikipedia article on the Fisher information metric, synthesizing from the Crooks paper, some of your posts, and other bits and pieces culled via google. i.e. you’re reading what I wrote…

(Dislaimer: I learn things by expanding WP articles. They’re not peer-reviewed. Mistakes can creep in. They’re sometimes sloppy and disorganized; it doesn’t pay to be a perfectionist on WP).

Linas wrote:

In the end it’s all incredibly beautiful. I think the right path was sketched earlier here. First, observe that the the space of probability distributions on an -element set is an -simplex. Then, show tha the Fisher information metric makes this simplex isometric to a portion of a round -sphere!

Once we know this, it’s clear that the curvature will be very nice and simple.

But last week Simon Willerton and I were trying to check that the simplex with its Fisher information metric really

isisometric to a portion of a sphere. We got stuck, mainly because the ‘obvious’ map from the simplex to the sphere, namely the one that sends a pointsuch that and to the point

on the unit sphere, didn’t seem to work. We couldn’t guess the right one, and we couldn’t find a mistake in our calculation. I tried to cheat by looking up the answer, but I couldn’t quickly find it, and then I got distracted.

It’s all well-known stuff, to those who know it well. In a recent paper, Gromov hints that this relation between the Fisher information metric on the simplex and the round metric on the sphere is secretly a link between probability theory and quantum theory (where states lie on a sphere, but in a

complexHilbert space). But he doesn’t elaborate, and he might have been joking.Perhaps Gromov is hinting at the random gaussian unitary ensemble? I’ve only gotten 3-4 page into his paper.

I imagine there should be all sorts of ‘secret links’, not just for simplexes and hilbert spaces, but for generic homogenous spaces. At the risk of stating ‘obvious’ things you understand better than I … Whenever one’s got some collection of operators acting on some space, you can ask what happens if you start picking out operators ‘at random’ (after specifying a measure that defines what ‘random’ means). By starting with something that’s homogenous, you get gobs of symmetry, so measures and metrics are … um, ahem, cough cough ‘sphere-like’. Generalizations of legendre polynomials and clebsch-gordon coefficients of good-ol su(2) to various wild hypergeometric series. There’s a vast ocean of neato results coming out of this: there’s also gobs of hyperbolic behaviour, lots of ergodic trajectories, and so analogs of the Riemann zeta that magically seem to obey the Riemann hypothesis (anything ergodic/hyperbolic seems to generalize something or other from number theory, this seems to be a general rule). I’m sure Gromov is quite aware of these things, given his history…

… wish there were more hours in the day, it feels like I’m in a candy store, sometimes…

Sometime jokes turn out to be true.

I got motivated again to cheat and look up the answer, and it’s very fascinating! The trick is to think of an amplitude as being like

the square rootof a probability—as usual in quantum mechanics!We can map a probability distribution

to a point on the sphere of radius 1, namely

And then the Fisher information metric on the simplex gets sent to the usual round metric on a portion of the sphere…

up to a constant factor.If you want the metrics to match exactly, you should map the simplex to the sphere of radius 2, via

I’m not sure how important that is: perhaps it simply means the Fisher information metric was defined slightly wrong, and should include a factor of 1/2 out front.

But what’s the deep inner meaning of this relation between probability distributions and points on the sphere? It seems to be saying that the ‘right’ way to measure distances between probability distributions is to treat them as coming from quantum states, and measure the distance between those!

Turns out that trick you provide above leads directly to the Fubini-Study metric. Set the phase part of the complex wave-function to zero, apply the Fubini-Study metric, and you get (four times) the Fisher information metric. The Bures metric is identical to the Fubini-Study metric, except that its normally written for mixed states, while FB is normally expressed w/ pure states. The differences in notation obscures a lot. Wikipedia knows all, including references.

The part that awes me is that log p and the wave-function phase alpha put together gives a symplectic form. This is surely somehow important, but I don’t know why.

Very intriguing stuff! I’m glad you’re trying to improve the Wikipedia article on Fisher information by adding some of these known but obscure facts. Someday some kid will read this stuff, put all the puzzle pieces together, and unify classical mechanics, quantum mechanics and probability theory in a new way. Unless we get there first, of course.

The other part that is weird is that log p materializes as if it were a vector in a tangent space. This suggests that the shannon entropy has some geometric interpretation, but I can’t tell what it is.

Huh. Will ponder. If I manage to clean this up, will insert into the WP article :-) A quickie google search for ‘fisher information metric’ and ‘curvature’ bring up papers complicated enough to suggest that few are aware of of this: I guess a brute-force attack must get mired.