Relative Entropy (Part 2)

In the first part of this mini-series, I describe how various ideas important in probability theory arise naturally when you start doing linear algebra using only the nonnegative real numbers.

But after writing it, I got an email from a rather famous physicist saying he got “lost at line two”. So, you’ll be happy to hear that the first part is not a prerequisite for the remaining parts! I wrote it just to intimidate that guy.

Tobias Fritz and I have proved a theorem characterizing the concept of relative entropy, which is also known as ‘relative information’, ‘information gain’ or—most terrifying and least helpful of all—‘Kullback-Leibler divergence’. In this second part I’ll introduce two key players in this theorem. The first, \mathrm{FinStat}, is a category where:

• an object consists of a system with finitely many states, and a probability distribution on those states

and

• a morphism consists of a deterministic ‘measurement process’ mapping states of one system to states of another, together with a ‘hypothesis’ that lets the observer guess a probability distribution of states of the system being measured, based on what they observe.

The second, \mathrm{FP}, is a subcategory of \mathrm{FinStat}. It has all the same objects, but only morphisms where the hypothesis is ‘optimal’. This means that if the observer measures the system many times, and uses the probability distribution of their observations together with their hypothesis to guess the probability distribution of states of the system, they get the correct answer (in the limit of many measurements).

In this part all I will really do is explain precisely what \mathrm{FinStat} and \mathrm{FP} are. But to whet your appetite, let me explain how we can use them to give a new characterization of relative entropy!

Suppose we have any morphism in \mathrm{FinStat}. In other words: suppose we have a deterministic measurement process, together with a hypothesis that lets the observer guess a probability distribution of states of the system being measured, based on what they observe.

Then we have two probability distributions on the states of the system being measured! First, the ‘true’ probability distribution. Second, the probability that the observer will guess based on their observations.

Whenever we have two probability distributions on the same set, we can compute the entropy of the first relative to to the second. This describes how surprised you’ll be if you discover the probability distribution is really the first, when you thought it was the second.

So: any morphism in \mathrm{FinStat} will have a relative entropy. It will describe how surprised the observer will be when they discover the true probability distribution, given what they had guessed.

But this amount of surprise will be zero if their hypothesis was ‘optimal’ in the sense I described. So, the relative entropy will vanish on morphisms in \mathrm{FP}.

Our theorem says this fact almost characterizes the concept of relative entropy! More precisely, it says that any convex-linear lower semicontinuous functor

F : \mathrm{FinStat} \to [0,\infty]

that vanishes on the subcategory \mathrm{FP} must equal some constant times the relative entropy.

Don’t be scared! This should not make sense to you yet, since I haven’t said how I’m thinking of [0,+\infty] as a category, nor what a ‘convex-linear lower semicontinuous functor’ is, nor how relative entropy gives one. I will explain all that later. I just want you to get a vague idea of where I’m going.

Now let me explain the categories \mathrm{FinStat} and \mathrm{FP}. We need to warm up a bit first.

FinStoch

A stochastic map f : X \leadsto Y is different from an ordinary function, because instead of assigning a unique element of Y to each element of X, it assigns a probability distribution on Y to each element of X. So you should imagine it as being like a function ‘with random noise added’, so that f(x) is not a specific element of Y, but instead has a probability of taking on different values. This is why I’m using a weird wiggly arrow to denote a stochastic map.

More formally:

Definition. Given finite sets X and Y, a stochastic map f : X \leadsto Y assigns a real number f_{yx} to each pair x \in X, y \in Y, such that fixing any element x, the numbers f_{yx} form a probability distribution on Y. We call f_{yx} the probability of y given x.

In more detail:

f_{yx} \ge 0 for all x \in X, y \in Y.

and

\displaystyle{ \sum_{y \in Y} f_{yx} = 1} for all x \in X.

Note that we can think of f : X \leadsto Y as a Y \times X-shaped matrix of numbers. A matrix obeying the two properties above is called stochastic. This viewpoint is nice because it reduces the problem of composing stochastic maps to matrix multiplication. It’s easy to check that multiplying two stochastic matrices gives a stochastic matrix. So, composing stochastic maps gives a stochastic map.

We thus get a category:

Definition. Let \mathrm{FinStoch} be the category of finite sets and stochastic maps between them.

In case you’re wondering why I’m restricting attention to finite sets, it’s merely because I want to keep things simple. I don’t want to worry about whether sums or integrals converge.

FinProb

Now take your favorite 1-element set and call it 1. A function p : 1 \to X is just a point of X. But a stochastic map p : 1 \leadsto X is something more interesting: it’s a probability distribution on X.

Why? Because it gives a probability distribution on X for each element of 1, but that set has just one element.

Last time I introduced the rather long-winded phrase finite probability measure space to mean a finite set with a probability distribution on it. But now we’ve seen a very quick way to describe such a thing within \mathrm{FinStoch}:

And this gives a quick way to think about a measure-preserving function between finite probability measure spaces! It’s just a commutative triangle like this:

Note that the horizontal arrow f:  X \to Y is not wiggly. The straight arrow means it’s an honest function, not a stochastic map. But a function is a special case of a stochastic map! So it makes sense to compose a straight arrow with a wiggly arrow—and the result is, in general, a wiggly arrow. So, it makes sense to demand that this triangle commutes, and this says that the function f: X \to Y is measure-preserving.

Let me work through the details, in case they’re not clear.

First: how is a function a special case of a stochastic map? Here’s how. If we start with a function f: X \to Y, we get a matrix of numbers

f_{yx} = \delta_{y,f(x)}

where \delta is the Kronecker delta. So, each element x \in X gives a probability distribution that’s zero except at f(x).

Given this, we can work out what this commuting triangle really says:

If use p_x to stand for the probability distribution that p: 1 \leadsto X puts on X, and similarly for q_y, the commuting triangle says

\displaystyle{ q_y = \sum_{x \in X} \delta_{y,f(x)} p_x}

or in other words:

\displaystyle{ q_y = \sum_{x \in X : f(x) = y} p_x }

or if you like:

\displaystyle{ q_y = \sum_{x \in f^{-1}(y)} p_x }

In this situation people say q is p pushed forward along f, and they say f is a measure-preserving function.

So, we’ve used \mathrm{FinStoch} to describe another important category:

Definition. Let \mathrm{FinProb} be the category of finite probability measure spaces and measure-preserving functions between them.

I can’t resist mentioning another variation:

A commuting triangle like this is a measure-preserving stochastic map. In other words, p gives a probability measure on X, q gives a probability measure on Y, and f: X \leadsto Y is a stochastic map with

\displaystyle{ q_y = \sum_{x \in X} f_{yx} p_x }

FinStat

The category we really need for relative entropy is a bit more subtle. An object is a finite probability measure space:

but a morphism looks like this:

The whole diagram doesn’t commute, but the two equations I wrote down hold. The first equation says that f: X \to Y is a measure-preserving function. In other words, this triangle, which we’ve seen before, commutes:

The second equation says that f \circ s is the identity, or in math jargon, s is a section for f.

But what does that really mean?

The idea is that X is the set of ‘states’ of some system, while Y is a set of possible ‘observations’ you might make. The function f is a ‘measurement process’. You ‘measure’ the system using f, and if the system is in the the state x you get the observation f(x). The probability distribution p says the probability that the system is any given state, while q says the probability that you get any given observation when you do your measurement.

Note: are assuming for now that that there’s no random noise in the observation process! That’s why f is a function instead of a stochastic map.

But what about s? That’s the fun part: s describes your ‘hypothesis’ about the system’s state given a particular measurement! If you measure the system and get a result y \in Y, you guess it’s in the state x with probability s_{xy}.

And we don’t want this hypothesis to be really dumb: that’s what

f \circ s = 1_Y

says. You see, this equation says that

\sum_{x \in X} \delta_{y', f(x)} s_{xy} = \delta_{y' y}

or in other words:

\sum_{x \in f^{-1}(y')} s_{xy} = \delta_{y' y}

If you think about it, this implies s_{xy} = 0 unless f(x) = y.

So, if you make an observation y, you will guess the system is in state x with probability zero unless f(x) = y. In short, you won’t make a really dumb guess about the system’s state.

Here’s how we compose morphisms:

We get a measure-preserving function g \circ f : X \to Z and a stochastic map going back, s \circ t : Z \to Z. You can check that these obey the required equations:

g \circ f \circ p = r

g \circ f \circ s \circ t = 1_Z

So, we get a category:

Definition. Let \mathrm{FinStat} be the category where an object is a finite probability measure space:

a morphism is a diagram obeying these equations:

and composition is defined as above.

FP

As we’ve just seen, a morphism in \mathrm{FinStat} consists of a ‘measurement process’ f and a ‘hypothesis’ s:

But sometimes we’re lucky and our hypothesis is optimal, in the sense that

s \circ q = p

Conceptually, this says that if you take the probability distribution q on our observations and use it to guess a probability distribution for the system’s state using our hypothesis s, you get the correct answer: p.

Mathematically, it says that this diagram commutes:

In other words, s is a measure-preserving stochastic map.

There’s a subcategory of \mathrm{FinStat} with all the same objects, but only these ‘optimal’ morphisms. It’s important, but the name we have for it is not very exciting:

Definition. Let \mathrm{FP} be the subcategory of \mathrm{FinStat} where an object is a finite probability measure space

and a morphism is a diagram obeying these equations:

Why do we call this category \mathrm{FP}? Because it’s a close relative of \mathrm{FinProb}, where a morphism, you’ll remember, looks like this:

The point is that for a morphism in \mathrm{FP}, the conditions on s are so strong that they completely determine it unless there are observations that happen with probability zero—that is, unless there are y \in Y with q_y = 0. To see this, note that

s \circ q = p

actually says

\sum_{y \in Y} s_{xy} q_y = p_x

for any choice of x \in X. But we’ve already seen s_{xy} = 0 unless f(x) = y, so the sum has just one term, and the equation says

s_{x,f(x)} q_{f(x)} = p_x

We can solve this for s_{x,f(x)}, so s is completely determined… unless q_{f(x)} = 0.

This covers the case when y = f(x). We also can’t figure out s_{x,y} if y isn’t in the image of f.

So, to be utterly precise, s is determined by p, q and f unless there’s an element y \in Y that has q_y = 0. Except for this special case, a morphism in \mathrm{FP} is just a morphism in \mathrm{FinProb}. But in this special case, a morphism in \mathrm{FP} has a little extra information: an arbitrary probability distribution on the inverse image of each point y with this property.

In short, \mathrm{FP} is the same as \mathrm{FinProb} except that our observer’s ‘optimal hypothesis’ must provide a guess about the state of the system given an observation, even in cases of observations that occur with probability zero.

I’m going into these nitpicky details for two reasons. First, we’ll need \mathrm{FP} for our characterization of relative entropy. But second, Tom Leinster already ran into this category in his work on entropy and category theory! He discussed it here:

• Tom Leinster, An operadic introduction to entropy.

Despite the common theme of entropy, he arrived at it from a very different starting-point.

Conclusion

So, I hope that next time I can show you something like this:

and you’ll say “Oh, that’s a probability distribution on the states of some system!” Intuitively, you should think of the wiggly arrow p as picking out a ‘random element’ of the set X.

I hope I can show you this:

and you’ll say “Oh, that’s a deterministic measurement process, sending a probability distribution on the states of the measured system to a probability distribution on observations!”

I hope I can show you this:

and you’ll say “Oh, that’s a deterministic measurement process, together with a hypothesis about the system’s state, given what is observed!”

And I hope I can show you this:

and you’ll say “Oh, that’s a deterministic measurement process, together with an optimal hypothesis about the system’s state, given what is observed!”

I don’t count on it… but I can hope.

Postscript

And speaking of unrealistic hopes, if I were really optimistic I would hope you noticed that \mathrm{FinStoch} and \mathrm{FinProb}, which underlie the more fancy categories I’ve discussed today, were themselves constructed starting from linear algebra over the nonnegative numbers [0,\infty) in Part 1. That ‘foundational’ work is not really needed for what we’re doing now. However, I like the fact that we’re ultimately getting the concept of relative entropy starting from very little: just linear algebra, using only nonnegative numbers!

11 Responses to Relative Entropy (Part 2)

  1. arch1 says:

    Thanks John, this was admirably clear & self contained; I think I got most of it using only an elementary understanding of sets, mappings, probability distributions, and summation and composition notation (though I might be deluding myself, as I’m still fuzzy on what a category is and on exactly what a morphism is; and I still don’t think I know what it means for a diagram to ‘commute':-).

    “unless there are observations that happen with probability zero—that is, unless there are y \in X“: Do you mean Y instead of X?

    • John Baez says:

      arch1 wrote:

      “unless there are observations that happen with probability zero—that is, unless there are y \in X”: Do you mean Y instead of X?

      Yes, thanks—I’ll fix that. Deliberately calling an element of Xy” would be heinous mathematical crime… here in Singapore I’d probably get caned for it.

      I’m really glad you found the exposition clear! I’m warming up to write a paper about this, so I need to know what works and what doesn’t.

      I’ll tell you what it means for a diagram to commute. Here’s a simple example:

      A diagram commutes whenever, given two ways to get from one object to another by following a chain of arrows, those two ways are equal. In this case we have two ways to get from 1 to Y. There's a direct way using just q and indirect way using first p and then f. But they are equal—that's what the equation below the diagram says. If I were talking to category theorists, I could skip the equation and just say "the diagram commutes".

      This pays off in situations where we have big complicated diagrams like this:

      (from the proof of the ‘zig-zag lemma’ on Wikipedia) or this:


      (a typical sort of thing you see on an algebraic topologist’s whiteboard, drawn by Patrick Orson.) Instead of writing down dozens of equations, you get a nice visual depiction of what’s going on, and you can learn to reason very quickly with these diagrams.

      Unfortunately my post is not a great introduction to commutative diagrams, for two reasons.

      First, I’m heavily using two kinds of arrows, straight ones and wiggly ones. This is a bit nonstandard; it means my diagrams involve not just one category but two: the category of finite sets and stochastic maps (wiggly arrows), and a subcategory of that, the category of finite sets and functions (straight arrows). If you’re just getting started at this game, you’d want to start with one kind of arrow.

      Second, lots of my diagrams don’t commute! They don’t completely commute, so I have to say which equations do hold, by writing them below the diagram, like here:

      As an exercise you can try to guess an equation not listed here that would hold if the diagram did commute.

      • Steve says:

        Thanks for the broader overview of those diagrams! As long as we’re correcting small typos, I’m thinking that by “It’s each to check that multiplying two stochastic matrices gives a stochastic matrix,” you meant “It’s easy to check that multiplying two stochastic matrices gives a stochastic matrix.”

  2. arch1 says:

    Thanks John. (The two arrow types don’t cause me problems if I remind myself that, as you said, functions are special cases of stochastic maps.)

    I think that the equation ‘q then s yields the same distribution on X as p does’ answers your exercise (in other words, as you imply above, if its diagram commutes a hypothesis is optimal).

    • John Baez says:

      Yes, that’s one of the equations that would hold if the whole diagram above commuted. There’s also another:

      s \circ f = 1_X

      And this is something we really don’t want, usually. Taken together with

      f \circ s = 1_Y

      that would imply a very strong condition on our function f.

      Can anyone see what this condition amounts to?

  3. John Baez says:

    I had written:

    This covers the case when y = f(x). We also can’t figure out s_{x,y} if y isn’t in the image of f.

    But now I see that second case never comes up! In this situation

    we can show that the function f is onto. The paper I’m writing with Tobias will explain this…

  4. In the last couple days I’ve returned to working on a paper with Tobias Fritz where we give a Bayesian characterization of the concept of ‘relative entropy’. This summer I wrote two blog articles about this paper:

    Relative Entropy (Part 1): how various structures important in probability theory arise naturally when you do linear algebra using only the nonnegative real numbers.

    Relative Entropy (Part 2): a category related to statistical inference, \mathrm{FinStat}, and how relative entropy defines a functor on this category.

    But then Tobias Fritz noticed a big problem.

  5. […] but I don’t have the knowledge and training to develop this idea — you’d need someone like John Baez for […]

  6. Tobias Fritz, a postdoc at the Perimeter Institute, is working with me on category-theoretic aspects of information theory. We published a paper on entropy with Tom Leinster, and we’ve got a followup on relative entropy that’s almost done. I should be working on it right this instant! But for now, read the series of posts here on Azimuth: Relative Entropy Part 1, Part 2 and Part 3. […]

  7. Now Tobias Fritz and I have finally finished our paper on this subject:

    A Bayesian characterization of relative entropy.

You can use HTML in your comments. You can also use LaTeX, like this: $latex E = m c^2 $. The word 'latex' comes right after the first dollar sign, with a space after it.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 2,818 other followers