## Relative Entropy (Part 3)

Holidays are great. There’s nothing I need to do! Everybody is celebrating! So, I can finally get some real work done.

In the last couple of days I’ve finished a paper with Jamie Vicary on wormholes and entanglement… subject to his approval and corrections. More on that later. And now I’ve returned to working on a paper with Tobias Fritz where we give a Bayesian characterization of the concept of ‘relative entropy’. This summer I wrote two blog articles about this paper. But then Tobias Fritz noticed a big problem. Our characterization of relative entropy was inspired by this paper:

• D. Petz, Characterization of the relative entropy of states of matrix algebras, Acta Math. Hungar. 59 (1992), 449–455.

Here Petz sought to characterize relative entropy both in the ‘classical’ case we are concerned with and in the more general ‘quantum’ setting. Our original goal was merely to express his results in a more category-theoretic framework! Unfortunately Petz’s proof contained a significant flaw. Tobias noticed this and spent a lot of time fixing it, with no help from me.

Our paper is now self-contained, and considerably longer. My job now is to polish it up and make it pretty. What follows is the introduction, which should explain the basic ideas.

### A Bayesian characterization of relative entropy

This paper gives a new characterization of the concept of relative entropy, also known as ‘relative information’, ‘information gain’ or ‘Kullback-Leibler divergence’. Whenever we have two probability distributions $p$ and $q$ on the same finite set $X,$ we can define the entropy of $q$ relative to $p$:

$\displaystyle{ S(q,p) = \sum_{x\in X} q_x \ln\left( \frac{q_x}{p_x} \right) }$

Here we set

$\displaystyle{ q_x \ln\left( \frac{q_x}{p_x} \right) }$

equal to $\infty$ when $p_x = 0,$ unless $q_x$ is also zero, in which case we set it equal to 0. Relative entropy thus takes values in $[0,\infty].$

Intuitively speaking, $S(q,p)$ is the expected amount of information gained when we discover the probability distribution is really $q,$ when we had thought it was $p.$ We should think of $p$ as a ‘prior’ and $q$ as a ‘posterior’. When we take $p$ to be the uniform distribution on $X,$ relative entropy reduces to the ordinary Shannon entropy, up to an additive constant. The advantage of relative entropy is that it makes the role of the prior explicit.

Since Bayesian probability theory emphasizes the role of the prior, relative entropy naturally lends itself to a Bayesian interpretation: it measures how much information we gain given a certain prior. Our goal here is to make this precise in a mathematical characterization of relative entropy. We do this using a category $\mathrm{FinStat}$ where:

• an object $(X,q)$ consists of a finite set $X$ and a probability distribution $x \mapsto q_x$ on that set;

• a morphism $(f,s) : (X,q) \to (Y,r)$ consists of a measure-preserving function $f$ from $X$ to $Y,$ together with a probability distribution $x \mapsto s_{x y}$ on $X$ for each element $y \in Y,$ with the property that $s_{xy} = 0$ unless $f(x) = y.$

We can think of an object of $\mathrm{FinStat}$ as a system with some finite set of states together with a probability distribution on its states. A morphism

$(f,s) : (X,q) \to (Y,r)$

then consists of two parts. First, there is a deterministic measurement process $f : X \to Y$ mapping states of the system being measured, $X,$ to states of the measurement apparatus, $Y.$ The condition that $f$ be measure-preserving says that, after the measurement, the probability that the apparatus be in any state $y \in Y$ is the sum of the probabilities of all states of $X$ leading to that outcome:

$\displaystyle{ r_y = \sum_{x \in f^{-1}(y)} q_x }$

Second, there is a hypothesis $s$: an assumption about the probability $s_{xy}$ that the system being measured is in the state $x \in X$ given any measurement outcome $y \in Y.$

Suppose we have any morphism

$(f,s) : (X,q) \to (Y,r)$

in $\mathrm{FinStat}.$ From this we obtain two probability distributions on the states of the system being measured. First, we have the probability distribution $p$ given by

$\displaystyle{ p_x = \sum_{y \in Y} s_{xy} r_y } \qquad \qquad (1)$

This is our prior, given our hypothesis and the probability distribution of measurement results. Second we have the ‘true’ probability distribution $q,$ which would be the posterior if we updated our prior using complete direct knowledge of the system being measured.

It follows that any morphism in $\mathrm{FinStat}$ has a relative entropy $S(q,p)$ associated to it. This is the expected amount of information we gain when we update our prior $p$ to the posterior $q.$

In fact, this way of assigning relative entropies to morphisms defines a functor

$F_0 : \mathrm{FinStat} \to [0,\infty]$

where we use $[0,\infty]$ to denote the category with one object, the numbers $0 \le x \le \infty$ as morphisms, and addition as composition. More precisely, if

$(f,s) : (X,q) \to (Y,r)$

is any morphism in $\mathrm{FinStat},$ we define

$F_0(f,s) = S(q,p)$

where the prior $p$ is defined as in the equation (1).

The fact that $F_0$ is a functor is nontrivial and rather interesting. It says that given any composable pair of measurement processes:

$(X,q) \stackrel{(f,s)}{\longrightarrow} (Y,r) \stackrel{(g,t)}{\longrightarrow} (Z,u)$

the relative entropy of their composite is the sum of the relative entropies of the two parts:

$F_0((g,t) \circ (f,s)) = F_0(g,t) + F_0(f,s) .$

We prove that $F_0$ is a functor. However, we go much further: we characterize relative entropy by saying that up to a constant multiple, $F_0$ is the unique functor from $\mathrm{FinStat}$ to $[0,\infty]$ obeying three reasonable conditions.

The first condition is that $F_0$ vanishes on morphisms $(f,s) : (X,q) \to (Y,r)$ where the hypothesis $s$ is optimal. By this, we mean that Equation (1) gives a prior $p$ equal to the ‘true’ probability distribution $q$ on the states of the system being measured.

The second condition is that $F_0$ is lower semicontinuous. The set $P(X)$ of probability distibutions on a finite set $X$ naturally has the topology of an $(n-1)$-simplex when $X$ has $n$ elements. The set $[0,\infty]$ has an obvious topology where it’s homeomorphic to a closed interval. However, with these topologies, the relative entropy does not define a continuous function

$\begin{array}{rcl} S : P(X) \times P(X) &\to& [0,\infty] \\ (q,p) &\mapsto & S(q,p) . \end{array}$

The problem is that

$\displaystyle{ S(q,p) = \sum_{x\in X} q_x \ln\left( \frac{q_x}{p_x} \right) }$

and we define $q_x \ln(q_x/p_x)$ to be $\infty$ when $p_x = 0$ and $q_x > 0$ but $0$ when $p_x = q_x = 0.$ So, it turns out that $S$ is only lower semicontinuous, meaning that if $p^i , q^i$ are sequences of probability distributions on $X$ with $p^i \to p$ and $q^i \to q$ then

$S(q,p) \le \liminf_{i \to \infty} S(q^i, p^i)$

We give the set of morphisms in $\mathrm{FinStat}$ its most obvious topology, and show that with this topology, $F_0$ maps morphisms to morphisms in a lower semicontinuous way.

The third condition is that $F_0$ is convex linear. We describe how to take convex linear combinations of morphisms in $\mathrm{FinStat},$ and then the functor $F_0$ is convex linear in the sense that it maps any convex linear combination of morphisms in $\mathrm{FinStat}$ to the corresponding convex linear combination of numbers in $[0,\infty].$ Intuitively, this means that if we take a coin with probability $P$ of landing heads up, and flip it to decide whether to perform one measurement process or another, the expected information gained is $P$ times the expected information gain of the first process plus $1-P$ times the expected information gain of the second process.

Here, then, is our main theorem:

Theorem. Any lower semicontinuous, convex-linear functor

$F : \mathrm{FinStat} \to [0,\infty]$

that vanishes on every morphism with an optimal hypothesis must equal some constant times the relative entropy. In other words, there exists some constant $c \in [0,\infty]$ such that

$F(f,s) = c F_0(f,s)$

for any any morphism $(f,s) : (X,p) \to (Y,q)$ in $\mathrm{FinStat}.$

If you’re a maniacally thorough reader of this blog, with a photographic memory, you’ll recall that our theorem now says ‘lower semicontinuous’, where in Part 2 of this series I’d originally said ‘continuous’.

I’ve fixed that blog article now… but it was Tobias who noticed this mistake. In the process of fixing our proof to address this issue, he eventually noticed that the proof of Petz’s theorem, which we’d been planning to use in our work, was also flawed.

Here’s the paper we eventually wrote:

• John Baez and Tobias Fritz, A Bayesian characterization of relative entropy, Theory and Applications of Categories 29 (2014), 421–456.

And here’s my whole series of blog articles about it:

Relative Entropy (Part 1): how various structures important in probability theory arise naturally when you do linear algebra using only the nonnegative real numbers.

Relative Entropy (Part 2): a category related to statistical inference, $\mathrm{FinStat},$ and how relative entropy defines a functor on this category.

Relative Entropy (Part 3): how to characterize relative entropy as a functor from $\mathrm{FinStat}$ to $[0,\infty].$

Relative Entropy (Part 4): wrap-up, and an invitation to read more about the underlying math at the n-Category Café.

### 15 Responses to Relative Entropy (Part 3)

1. As long as you’re giving aliases, include “cross-entropy”.

• John Baez says:

Actually it looks like cross-entropy is different:

$H(q,p) = - \displaystyle{ \sum_{x \in X} q_x \ln p_x }$

rather than

$S(q,p) = \sum_{x\in X} q_x \ln\left( \frac{q_x}{p_x} \right)$

• lee bloomquist says:

What do the new results tell us about Probability Learning experiments?

• John Baez says:

I don’t know! Not much unless someone understands our results, knows about those experiments, and works hard to make a connection. No such person exists yet.

2. Blake Stacey says:

Typo: missing period after “given a certain prior”.

3. It seems to me that you doing something like rephrasing Shannon’s original characterization of entropy in category theory terms. Is that fair, or am I totally off base?

• John Baez says:

There are lots of characterizations of entropy and relative entropy. The original Shannon–Khinchine characterization of Shannon entropy includes a condition that entropy is maximized when all outcomes are equally likely. This seems awkward to formulate in category-theoretic terms. It’s easier to use characterizations that involve only equations together with a continuity requirement.

A guy named Faddeev characterized entropy using only equations and a continuity requirement. In a previous paper written with a mutual friend, Tobias and I reformulated Faddeev’s characterization in category-theoretic terms:

• John Baez, Tobias Fritz and Tom Leinster, A characterization of entropy in terms of information loss.

In our new paper we started by trying to phrase Petz’s characterization of relative entropy in category-theoretic terms. Petz’s theorem turned out to be flawed, so Tobias had to fix it. But we still owe our inspiration to him, and
we got a characterization involving only equations and a lower semicontinuity requirement. (For what it’s worth, lower semicontinuity can be thought of as continuity with respect to a funny topology on the real line.)

For a comparison of Faddeev’s characterization of entropy and the Shannon–Khinchine characterization, try this:

• Shigeru Furuichi, On uniqueness theorems for Tsallis entropy and Tsallis relative entropy.

This covers something more general called ‘Tsallis entropy’, but if you take the $q \to 1$ limit you get back to Shannon entropy.

Our paper with Tom Leinster covered the more general Tsallis case, but now that we’re doing relative entropy we’re having enough trouble just handling the Shannon case!

• Thanks! (Is the weird topology you’re referring to Scott topology or upper topology or…?)

• John Baez says:

It’s the upper topology, where the open sets are of the form $(a,\infty]$.

I said ‘real line’, but that wasn’t quite right: relative entropy can be infinite, so we should really think of it as taking values in $[0,\infty]$, and a function from a topological space taking values in $[0,\infty]$ is lower semicontinuous iff it’s continuous when $[0,\infty]$ is given the upper topology.

4. Tobias Fritz, a postdoc at the Perimeter Institute, is working with me on category-theoretic aspects of information theory. We published a paper on entropy with Tom Leinster, and we’ve got a followup on relative entropy that’s almost done. I should be working on it right this instant! But for now, read the series of posts here on Azimuth: Relative Entropy Part 1, Part 2 and Part 3. […]

5. Now Tobias Fritz and I have finally finished our paper on this subject:

6. Qiaochu Yuan says:

Two comments: first, relative entropy does not reduce to ordinary entropy with a uniform prior, even up to a constant. It reduces to the negative of ordinary entropy up to a constant. (This is why I’m tempted to call “relative entropy” the negative of what you wrote.)

Second, it doesn’t follow from the definition that relative entropy is nonnegative. The problem is that you can’t a priori control the signs of the logarithms. It’s a nontrivial fact that this is true.

• John Baez says:

Qiaochu wrote:

[…] relative entropy does not reduce to ordinary entropy with a uniform prior, even up to a constant. It reduces to the negative of ordinary entropy up to a constant. (This is why I’m tempted to call “relative entropy” the negative of what you wrote.)

Yeah, the missing minus sign annoys me too. Lots of people call this quantity relative entropy, but in a later paper I wrote:

Most of our results involve a quantity that is variously known as ‘relative information’, ‘relative entropy’, ‘information gain’ or the ‘Kullback–Leibler divergence’. We shall use the first term. Given two probability distributions $p$ and $q$ on a finite set $X,$ their relative information, or more precisely the information of $p$ relative to $q,$ is

$\displaystyle{ I(p,q) = \sum_{i \in X} p_i \ln\left(\frac{p_i}{q_i}\right) . }$

We use the word ‘information’ instead of ‘entropy’ because one expects entropy to increase with time, and the theorems we present will say that $I(p,q)$ decreases with time under various conditions. The reason is that the Shannon entropy

$\displaystyle{ S(p) = -\sum_{i \in X} p_i \ln p_i }$

contains a minus sign that is missing from the definition of relative information.

• John Baez says:

Qiaochu wrote:

Second, it doesn’t follow from the definition that relative entropy is nonnegative. The problem is that you can’t a priori control the signs of the logarithms. It’s a nontrivial fact that this is true.

I’m not sure what you’re asserting is true.

However, when $p$ and $q$ are probability distributions on a finite set, I believe that

$\displaystyle{ \sum_{i \in X} p_i \ln\left(\frac{p_i}{q_i}\right) \ge 0 }$

though it may be infinite: we define $0 \ln 0 = 0$ but $p \ln 0 = +\infty$ when $p > 0$, to handle cases involving the logarithm of zero.

I believe the same sort of inequality holds whenever $p$ and $q$ are probability measures and $p$ is absolutely continuous with respect to $q.$ If you have a counterexample, I’d like to see it!

You might like the argument on page 16 here: it gives non-negativity for a generalization of relative information to ‘non-normalized probability distributions’ on finite sets.

This site uses Akismet to reduce spam. Learn how your comment data is processed.