Today, I want to describe how the Fisher information metric is related to relative entropy. I’ve explained both these concepts separately (click the links for details); now I want to put them together.
But first, let me explain what this whole series of blog posts is about. Information geometry, obviously! But what’s that?
Information geometry is the geometry of ‘statistical manifolds’. Let me explain that concept twice: first vaguely, and then precisely.
Vaguely speaking, a statistical manifold is a manifold whose points are hypotheses about some situation. For example, suppose you have a coin. You could have various hypotheses about what happens when you flip it. For example: you could hypothesize that the coin will land heads up with probability , where is any number between 0 and 1. This makes the interval into a statistical manifold. Technically this is a manifold with boundary, but that’s okay.
Or, you could have various hypotheses about the IQ’s of American politicians. For example: you could hypothesize that they’re distributed according to a Gaussian probability distribution with mean and standard deviation . This makes the space of pairs into a statistical manifold. Of course we require , which gives us a manifold with boundary. We might also want to assume , which would give us a manifold with corners, but that’s okay too. We’re going to be pretty relaxed about what counts as a ‘manifold’ here.
If we have a manifold whose points are hypotheses about some situation, we say the manifold ‘parametrizes’ these hypotheses. So, the concept of statistical manifold is fundamental to the subject known as parametric statistics.
Parametric statistics is a huge subject! You could say that information geometry is the application of geometry to this subject.
But now let me go ahead and make the idea of ‘statistical manifold’ more precise. There’s a classical and a quantum version of this idea. I’m working at the Centre of Quantum Technologies, so I’m being paid to be quantum—but today I’m in a classical mood, so I’ll only describe the classical version. Let’s say a classical statistical manifold is a smooth function from a manifold to the space of probability distributions on some measure space .
We should think of as a space of events. In our first example, it’s just : we flip a coin and it lands either heads up or tails up. In our second it’s : we measure the IQ of an American politician and get some real number.
We should think of as a space of hypotheses. For each point , we have a probability distribution on . This is hypothesis about the events in question: for example “when I flip the coin, there’s 55% chance that it will land heads up”, or “when I measure the IQ of an American politician, the answer will be distributed according to a Gaussian with mean 0 and standard deviation 100.”
Now, suppose someone hands you a classical statistical manifold . Each point in is a hypothesis. Apparently some hypotheses are more similar than others. It would be nice to make this precise. So, you might like to define a metric on that says how ‘far apart’ two hypotheses are. People know lots of ways to do this; the challenge is to find ways that have clear meanings.
Last time I explained the concept of relative entropy. Suppose we have two probability distributions on , say and . Then the entropy of relative to is the amount of information you gain when you start with the hypothesis but then discover that you should switch to the new improved hypothesis . It equals:
You could try to use this to define a distance between points and in our statistical manifold, like this:
This is definitely an important function. Unfortunately, as I explained last time, it doesn’t obey the axioms that a distance function should! Worst of all, it doesn’t obey the triangle inequality.
Can we ‘fix’ it? Yes, we can! And when we do, we get the Fisher information metric, which is actually a Riemannian metric on . Suppose we put local coordinates on some patch of containing the point . Then the Fisher information metric is given by:
You can think of my whole series of articles so far as an attempt to understand this funny-looking formula. I’ve shown how to get it from a few different starting-points, most recently back in Part 3. But now let’s get it starting from relative entropy!
Fix any point in our statistical manifold and choose local coordinates for which this point is the origin, . The amount of information we gain if move to some other point is the relative entropy . But what’s this like when is really close to ? We can imagine doing a Taylor series expansion of to answer this question.
Surprisingly, to first order the answer is always zero! Mathematically:
In plain English: if you change your mind slightly, you learn a negligible amount — not an amount proportional to how much you changed your mind.
This must have some profound significance. I wish I knew what. Could it mean that people are reluctant to change their minds except in big jumps?
Anyway, if you think about it, this fact makes it obvious that can’t obey the triangle inequality. could be pretty big, but if we draw a curve from and , and mark closely spaced points on this curve, then is zero to first order, so it must be of order , so if the triangle inequality were true we’d have
for all , which is a contradiction.
In plain English: if you change your mind in one big jump, the amount of information you gain is more than the sum of the amounts you’d gain if you change your mind in lots of little steps! This seems pretty darn strange, but the paper I mentioned in part 1 helps:
• Gavin E. Crooks, Measuring thermodynamic length.
You’ll see he takes a curve and chops it into lots of little pieces as I just did, and explains what’s going on.
Okay, so what about second order? What’s
Well, this is the punchline of this blog post: it’s the Fisher information metric:
And since the Fisher information metric is a Riemannian metric, we can then apply the usual recipe and define distances in a way that obeys the triangle inequality. Crooks calls this distance thermodynamic length in the special case that he considers, and he explains its physical meaning.
Now let me prove that
This can be somewhat tedious if you do it by straighforwardly grinding it out—I know, I did it. So let me show you a better way, which requires more conceptual acrobatics but less brute force.
The trick is to work with the universal statistical manifold for the measure space . Namely, we take to be the space of all probability distributions on ! This is typically an infinite-dimensional manifold, but that’s okay: we’re being relaxed about what counts as a manifold here. In this case, we don’t need to write for the probability distribution corresponding to the point . In this case, a point of just is a probability distribution on , so we’ll just call it .
If we can prove the formulas for this universal example, they’ll automatically follow for every other example, by abstract nonsense. Why? Because any statistical manifold with measure space is the same as a manifold with a smooth map to the universal statistical manifold! So, geometrical structures on the universal one ‘pull back’ to give structures on all the rest. The Fisher information metric and the function can be defined as pullbacks in this way! So, to study them, we can just study the universal example.
(If you’re familiar with ‘classifying spaces for bundles’ or other sorts of ‘classifying spaces’, all this should seem awfully familiar. It’s a standard math trick.)
So, let’s prove that
by proving it in the universal example. Given any probability distribution , and taking a nearby probability distribution , we can write
where is some small function. We only need to show that is zero to first order in . And this is pretty easy. By definition:
or in other words,
We can calculate this to first order in and show we get zero. But let’s actually work it out to second order, since we’ll need that later:
Why does this vanish to first order in ? It’s because and are both probability distributions and , so
so subtracting we see
So, vanishes to first order in . Voilà!
Next let’s prove the more interesting formula:
which relates relative entropy to the Fisher information metric. Since both sides are symmetric matrices, it suffices to show their diagonal entries agree in any coordinate system:
Devoted followers of this series of posts will note that I keep using this trick, which takes advantage of the polarization identity.
it’s enough to consider the universal example. We take the origin to be some probability distribution and take to be a nearby probability distribution which is pushed a tiny bit in the th coordinate direction. As before we write . We look at the second-order term in our formula for :
Using the usual second-order Taylor’s formula, which has a built into it, we can say
On the other hand, our formula for the Fisher information metric gives
The right hand sides of the last two formulas look awfully similar! And indeed they agree, because we can show that
How? Well, we assumed that is what we get by taking and pushing it a little bit in the th coordinate direction; we have also written that little change as
for some small function . So,
This argument may seem a little hand-wavy and nonrigorous, with words like ‘a little bit’. If you’re used to taking arguments involving infinitesimal changes and translating them into calculus (or differential geometry), it should make sense. If it doesn’t, I apologize. It’s easy to make it more rigorous, but only at the cost of more annoying notation, which doesn’t seem good in a blog post.
If you’re actually the kind of person who reads a section called ‘boring technicalities’, I’ll admit to you that my calculations don’t make sense if the integrals diverge, or we’re dividing by zero in the ratio . To avoid these problems, here’s what we should do. Fix a -finite measure space . Then, define the universal statistical manifold to be the space consisting of all probability measures that are equivalent to , in the usual sense of measure theory. By Radon-Nikodym, we can write any such measure as where . Moreover, given two of these guys, say and , they are absolutely continuous with respect to each other, so we can write
where the ratio is well-defined almost everywhere and lies in . This is enough to guarantee that we’re never dividing by zero, and I think it’s enough to make sure all my integrals converge.
We do still need to make into some sort of infinite-dimensional manifold, to justify all the derivatives. There are various ways to approach this issue, all of which start from the fact that is a Banach space, which is about the nicest sort of infinite-dimensional manifold one could imagine. Sitting in is the hyperplane consisting of functions with
and this is a Banach manifold. To get we need to take a subspace of that hyperplane. If this subspace were open then would be a Banach manifold in its own right. I haven’t checked this yet, for various reasons.
For one thing, there’s a nice theory of ‘diffeological spaces’, which generalize manifolds. Every Banach manifold is a diffeological space, and every subset of a diffeological space is again a diffeological space. For many purposes we don’t need our ‘statistical manifolds’ to be manifolds: diffeological spaces will do just fine. This is one reason why I’m being pretty relaxed here about what counts as a ‘manifold’.
For another, I know that people have worked out a lot of this stuff, so I can just look things up when I need to. And so can you! This book is a good place to start:
• Paolo Gibilisco, Eva Riccomagno, Maria Piera Rogantin and Henry P. Wynn, Algebraic and Geometric Methods in Statistics, Cambridge U. Press, Cambridge, 2009.
I find the chapters by Raymond Streater especially congenial. For the technical issue I’m talking about now it’s worth reading section 14.2, “Manifolds modelled by Orlicz spaces”, which tackles the problem of constructing a universal statistical manifold in a more sophisticated way than I’ve just done. And in chapter 15, “The Banach manifold of quantum states”, he tackles the quantum version!