Holidays are great. There’s nothing I need to do! Everybody is celebrating! So, I can finally get some real work done.

In the last couple of days I’ve finished a paper with Jamie Vicary on wormholes and entanglement… subject to his approval and corrections. More on that later. And now I’ve returned to working on a paper with Tobias Fritz where we give a Bayesian characterization of the concept of ‘relative entropy’. This summer I wrote two blog articles about this paper:

• Relative Entropy (Part 1): how various structures important in probability theory arise naturally when you do linear algebra using only the nonnegative real numbers.

• Relative Entropy (Part 2): a category related to statistical inference, and how relative entropy defines a functor on this category.

But then Tobias Fritz noticed a big problem. Our characterization of relative entropy was inspired by this paper:

• D. Petz, Characterization of the relative entropy of states of matrix algebras, *Acta Math. Hungar. * **59** (1992), 449–455.

Here Petz sought to characterize relative entropy both in the ‘classical’ case we are concerned with and in the more general ‘quantum’ setting. Our original goal was merely to express his results in a more category-theoretic framework! Unfortunately Petz’s proof contained a significant flaw. Tobias noticed this and spent a lot of time fixing it, with no help from me.

Our paper is now self-contained, and considerably longer. My job now is to polish it up and make it pretty. What follows is the introduction, which should explain the basic ideas.

### A Bayesian characterization of relative entropy

This paper gives a new characterization of the concept of relative entropy, also known as ‘relative information’, ‘information gain’ or ‘Kullback-Leibler divergence’. Whenever we have two probability distributions and on the same finite set we can define the entropy of relative to :

Here we set

equal to when unless is also zero, in which case we set it equal to 0. Relative entropy thus takes values in

Intuitively speaking, is the expected amount of information gained when we discover the probability distribution is really when we had thought it was We should think of as a ‘prior’ and as a ‘posterior’. When we take to be the uniform distribution on relative entropy reduces to the ordinary Shannon entropy, up to an additive constant. The advantage of relative entropy is that it makes the role of the prior explicit.

Since Bayesian probability theory emphasizes the role of the prior, relative entropy naturally lends itself to a Bayesian interpretation: it measures how much information we gain *given a certain prior*. Our goal here is to make this precise in a mathematical characterization of relative entropy. We do this using a category where:

• an object consists of a finite set and a probability distribution on that set;

• a morphism consists of a measure-preserving function from to together with a probability distribution on $X$ for each element with the property that unless

We can think of an object of as a system with some finite set of **states** together with a probability distribution on its states. A morphism

then consists of two parts. First, there is a deterministic **measurement process** mapping states of the system being measured, to states of the measurement apparatus, The condition that be measure-preserving says that, after the measurement, the probability that the apparatus be in any state is the sum of the probabilities of all states of leading to that outcome:

Second, there is a **hypothesis** : an assumption about the probability that the system being measured is in the state given any measurement outcome

Suppose we have any morphism

in From this we obtain two probability distributions on the states of the system being measured. First, we have the probability distribution given by

This is our **prior**, given our hypothesis and the probability distribution of measurement results. Second we have the ‘true’ probability distribution which would be the **posterior** if we updated our prior using complete direct knowledge of the system being measured.

It follows that any morphism in has a relative entropy associated to it. This is the expected amount of information we gain when we update our prior to the posterior

In fact, this way of assigning relative entropies to morphisms defines a functor

where we use to denote the category with one object, the numbers as morphisms, and addition as composition. More precisely, if

is any morphism in we define

where the prior is defined as in the equation (1).

The fact that is a functor is nontrivial and rather interesting. It says that given any composable pair of measurement processes:

the relative entropy of their composite is the sum of the relative entropies of the two parts:

We prove that is a functor. However, we go much further: we *characterize* relative entropy by saying that up to a constant multiple, is the *unique* functor from to obeying three reasonable conditions.

The first condition is that vanishes on morphisms where the hypothesis is **optimal**. By this, we mean that Equation (1) gives a prior equal to the ‘true’ probability distribution on the states of the system being measured.

The second condition is that is lower semicontinuous. The set of probability distibutions on a finite set naturally has the topology of an -simplex when has elements. The set has an obvious topology where it’s homeomorphic to a closed interval. However, with these topologies, the relative entropy does not define a continuous function

The problem is that

and we define to be when and but when So, it turns out that is only **lower semicontinuous**, meaning that if are sequences of probability distributions on with and then

We give the set of morphisms in its most obvious topology, and show that with this topology, maps morphisms to morphisms in a lower semicontinuous way.

The third condition is that is convex linear. We describe how to take convex linear combinations of morphisms in and then the functor is convex linear in the sense that it maps any convex linear combination of morphisms in to the corresponding convex linear combination of numbers in Intuitively, this means that if we take a coin with probability of landing heads up, and flip it to decide whether to perform one measurement process or another, the expected information gained is times the expected information gain of the first process plus times the expected information gain of the second process.

Here, then, is our main theorem:

**Theorem.** Any lower semicontinuous, convex-linear functor

that vanishes on every morphism with an optimal hypothesis must equal some constant times the relative entropy. In other words, there exists some constant such that

for any any morphism in

### Remarks

If you’re a maniacally thorough reader of this blog, with a photographic memory, you’ll recall that our theorem now says ‘lower semicontinuous’, where in Part 2 of this series I’d originally said ‘continuous’.

I’ve fixed that blog article now… but it was Tobias who noticed this mistake. In the process of fixing our proof to address this issue, he eventually noticed that the proof of Petz’s theorem, which we’d been planning to use in our work, was also flawed.

Now I just need to finish polishing the rest of the paper!

As long as you’re giving aliases, include “cross-entropy”.

Actually it looks like cross-entropy is different:

rather than

What do the new results tell us about Probability Learning experiments?

I don’t know! Not much unless someone understands our results, knows about those experiments, and works hard to make a connection. No such person exists yet.

Typo: missing period after “given a certain prior”.

Thanks, fixed! Happy forthcoming New Year!

It seems to me that you doing something like rephrasing Shannon’s original characterization of entropy in category theory terms. Is that fair, or am I totally off base?

There are lots of characterizations of entropy and relative entropy. The original Shannon–Khinchine characterization of Shannon entropy includes a condition that entropy is maximized when all outcomes are equally likely. This seems awkward to formulate in category-theoretic terms. It’s easier to use characterizations that involve only

equationstogether with acontinuityrequirement.A guy named Faddeev characterized entropy using only equations and a continuity requirement. In a previous paper written with a mutual friend, Tobias and I reformulated Faddeev’s characterization in category-theoretic terms:

• John Baez, Tobias Fritz and Tom Leinster, A characterization of entropy in terms of information loss.

In our new paper we started by trying to phrase Petz’s characterization of relative entropy in category-theoretic terms. Petz’s theorem turned out to be flawed, so Tobias had to fix it. But we still owe our inspiration to him, and

we got a characterization involving only equations and a lower semicontinuity requirement. (For what it’s worth, lower semicontinuity can be thought of as continuity with respect to a funny topology on the real line.)

For a comparison of Faddeev’s characterization of entropy and the Shannon–Khinchine characterization, try this:

• Shigeru Furuichi, On uniqueness theorems for Tsallis entropy and Tsallis relative entropy.

This covers something more general called ‘Tsallis entropy’, but if you take the limit you get back to Shannon entropy.

Our paper with Tom Leinster covered the more general Tsallis case, but now that we’re doing relative entropy we’re having enough trouble just handling the Shannon case!

Thanks! (Is the weird topology you’re referring to Scott topology or upper topology or…?)

It’s the upper topology, where the open sets are of the form .

I said ‘real line’, but that wasn’t quite right: relative entropy can be infinite, so we should really think of it as taking values in , and a function from a topological space taking values in is lower semicontinuous iff it’s continuous when is given the upper topology.

Tobias Fritz, a postdoc at the Perimeter Institute, is working with me on category-theoretic aspects of information theory. We published a paper on entropy with Tom Leinster, and we’ve got a followup on relative entropy that’s almost done. I should be working on it right this instant! But for now, read the series of posts here on Azimuth: Relative Entropy Part 1, Part 2 and Part 3. […]

Now Tobias Fritz and I have finally finished our paper on this subject:

• A Bayesian characterization of relative entropy.

Two comments: first, relative entropy does not reduce to ordinary entropy with a uniform prior, even up to a constant. It reduces to the negative of ordinary entropy up to a constant. (This is why I’m tempted to call “relative entropy” the negative of what you wrote.)

Second, it doesn’t follow from the definition that relative entropy is nonnegative. The problem is that you can’t a priori control the signs of the logarithms. It’s a nontrivial fact that this is true.

Qiaochu wrote:

Yeah, the missing minus sign annoys me too. Lots of people call this quantity relative entropy, but in a later paper I wrote:

Most of our results involve a quantity that is variously known as ‘relative information’, ‘relative entropy’, ‘information gain’ or the ‘Kullback–Leibler divergence’. We shall use the first term. Given two probability distributions and on a finite set their

relative information, or more precisely theinformation of relative toisWe use the word ‘information’ instead of ‘entropy’ because one expects entropy to increase with time, and the theorems we present will say that decreases with time under various conditions. The reason is that the Shannon entropy

contains a minus sign that is missing from the definition of relative information.

Qiaochu wrote:

I’m not sure what you’re asserting is true.

However, when and are probability distributions on a finite set, I believe that

though it may be infinite: we define but when , to handle cases involving the logarithm of zero.

I believe the same sort of inequality holds whenever and are probability measures and is absolutely continuous with respect to If you have a counterexample, I’d like to see it!

You might like the argument on page 16 here: it gives non-negativity for a generalization of relative information to ‘non-normalized probability distributions’ on finite sets.