## Relative Entropy (Part 3)

Holidays are great. There’s nothing I need to do! Everybody is celebrating! So, I can finally get some real work done.

In the last couple of days I’ve finished a paper with Jamie Vicary on wormholes and entanglement… subject to his approval and corrections. More on that later. And now I’ve returned to working on a paper with Tobias Fritz where we give a Bayesian characterization of the concept of ‘relative entropy’. This summer I wrote two blog articles about this paper:

Relative Entropy (Part 1): how various structures important in probability theory arise naturally when you do linear algebra using only the nonnegative real numbers.

Relative Entropy (Part 2): a category related to statistical inference, $\mathrm{FinStat},$ and how relative entropy defines a functor on this category.

But then Tobias Fritz noticed a big problem. Our characterization of relative entropy was inspired by this paper:

• D. Petz, Characterization of the relative entropy of states of matrix algebras, Acta Math. Hungar. 59 (1992), 449–455.

Here Petz sought to characterize relative entropy both in the ‘classical’ case we are concerned with and in the more general ‘quantum’ setting. Our original goal was merely to express his results in a more category-theoretic framework! Unfortunately Petz’s proof contained a significant flaw. Tobias noticed this and spent a lot of time fixing it, with no help from me.

Our paper is now self-contained, and considerably longer. My job now is to polish it up and make it pretty. What follows is the introduction, which should explain the basic ideas.

### A Bayesian characterization of relative entropy

This paper gives a new characterization of the concept of relative entropy, also known as ‘relative information’, ‘information gain’ or ‘Kullback-Leibler divergence’. Whenever we have two probability distributions $p$ and $q$ on the same finite set $X,$ we can define the entropy of $q$ relative to $p$:

$S(q,p) = \sum_{x\in X} q_x \ln\left( \frac{q_x}{p_x} \right)$

Here we set

$q_x \ln\left( \frac{q_x}{p_x} \right)$

equal to $\infty$ when $p_x = 0,$ unless $q_x$ is also zero, in which case we set it equal to 0. Relative entropy thus takes values in $[0,\infty].$

Intuitively speaking, $S(q,p)$ is the expected amount of information gained when we discover the probability distribution is really $q,$ when we had thought it was $p.$ We should think of $p$ as a ‘prior’ and $q$ as a ‘posterior’. When we take $p$ to be the uniform distribution on $X,$ relative entropy reduces to the ordinary Shannon entropy, up to an additive constant. The advantage of relative entropy is that it makes the role of the prior explicit.

Since Bayesian probability theory emphasizes the role of the prior, relative entropy naturally lends itself to a Bayesian interpretation: it measures how much information we gain given a certain prior. Our goal here is to make this precise in a mathematical characterization of relative entropy. We do this using a category $\mathrm{FinStat}$ where:

• an object $(X,q)$ consists of a finite set $X$ and a probability distribution $x \mapsto q_x$ on that set;

• a morphism $(f,s) : (X,q) \to (Y,r)$ consists of a measure-preserving function $f$ from $X$ to $Y,$ together with a probability distribution $x \mapsto s_{x y}$ on $X$ for each element $y \in Y,$ with the property that $s_{xy} = 0$ unless $f(x) = y.$

We can think of an object of $\mathrm{FinStat}$ as a system with some finite set of states together with a probability distribution on its states. A morphism

$(f,s) : (X,q) \to (Y,r)$

then consists of two parts. First, there is a deterministic measurement process $f : X \to Y$ mapping states of the system being measured, $X,$ to states of the measurement apparatus, $Y.$ The condition that $f$ be measure-preserving says that, after the measurement, the probability that the apparatus be in any state $y \in Y$ is the sum of the probabilities of all states of $X$ leading to that outcome:

$\displaystyle{ r_y = \sum_{x \in f^{-1}(y)} q_x }$

Second, there is a hypothesis $s$: an assumption about the probability $s_{xy}$ that the system being measured is in the state $x \in X$ given any measurement outcome $y \in Y.$

Suppose we have any morphism

$(f,s) : (X,q) \to (Y,r)$

in $\mathrm{FinStat}.$ From this we obtain two probability distributions on the states of the system being measured. First, we have the probability distribution $p$ given by

$\displaystyle{ p_x = \sum_{y \in Y} s_{xy} r_y } \qquad \qquad (1)$

This is our prior, given our hypothesis and the probability distribution of measurement results. Second we have the ‘true’ probability distribution $q,$ which would be the posterior if we updated our prior using complete direct knowledge of the system being measured.

It follows that any morphism in $\mathrm{FinStat}$ has a relative entropy $S(q,p)$ associated to it. This is the expected amount of information we gain when we update our prior $p$ to the posterior $q.$

In fact, this way of assigning relative entropies to morphisms defines a functor

$F_0 : \mathrm{FinStat} \to [0,\infty]$

where we use $[0,\infty]$ to denote the category with one object, the numbers $0 \le x \le \infty$ as morphisms, and addition as composition. More precisely, if

$(f,s) : (X,q) \to (Y,r)$

is any morphism in $\mathrm{FinStat},$ we define

$F_0(f,s) = S(q,p)$

where the prior $p$ is defined as in the equation (1).

The fact that $F_0$ is a functor is nontrivial and rather interesting. It says that given any composable pair of measurement processes:

$(X,q) \stackrel{(f,s)}{\longrightarrow} (Y,r) \stackrel{(g,t)}{\longrightarrow} (Z,u)$

the relative entropy of their composite is the sum of the relative entropies of the two parts:

$F_0((g,t) \circ (f,s)) = F_0(g,t) + F_0(f,s) .$

We prove that $F_0$ is a functor. However, we go much further: we characterize relative entropy by saying that up to a constant multiple, $F_0$ is the unique functor from $\mathrm{FinStat}$ to $[0,\infty]$ obeying three reasonable conditions.

The first condition is that $F_0$ vanishes on morphisms $(f,s) : (X,q) \to (Y,r)$ where the hypothesis $s$ is optimal. By this, we mean that Equation (1) gives a prior $p$ equal to the ‘true’ probability distribution $q$ on the states of the system being measured.

The second condition is that $F_0$ is lower semicontinuous. The set $P(X)$ of probability distibutions on a finite set $X$ naturally has the topology of an $(n-1)$-simplex when $X$ has $n$ elements. The set $[0,\infty]$ has an obvious topology where it’s homeomorphic to a closed interval. However, with these topologies, the relative entropy does not define a continuous function

$\begin{array}{rcl} S : P(X) \times P(X) &\to& [0,\infty] \\ (q,p) &\mapsto & S(q,p) . \end{array}$

The problem is that

$\displaystyle{ S(q,p) = \sum_{x\in X} q_x \ln\left( \frac{q_x}{p_x} \right) }$

and we define $q_x \ln(q_x/p_x)$ to be $\infty$ when $p_x = 0$ and $q_x > 0$ but $0$ when $p_x = q_x = 0.$ So, it turns out that $S$ is only lower semicontinuous, meaning that if $p^i , q^i$ are sequences of probability distributions on $X$ with $p^i \to p$ and $q^i \to q$ then

$S(q,p) \le \liminf_{i \to \infty} S(q^i, p^i)$

We give the set of morphisms in $\mathrm{FinStat}$ its most obvious topology, and show that with this topology, $F_0$ maps morphisms to morphisms in a lower semicontinuous way.

The third condition is that $F_0$ is convex linear. We describe how to take convex linear combinations of morphisms in $\mathrm{FinStat},$ and then the functor $F_0$ is convex linear in the sense that it maps any convex linear combination of morphisms in $\mathrm{FinStat}$ to the corresponding convex linear combination of numbers in $[0,\infty].$ Intuitively, this means that if we take a coin with probability $P$ of landing heads up, and flip it to decide whether to perform one measurement process or another, the expected information gained is $P$ times the expected information gain of the first process plus $1-P$ times the expected information gain of the second process.

Here, then, is our main theorem:

Theorem. Any lower semicontinuous, convex-linear functor

$F : \mathrm{FinStat} \to [0,\infty]$

that vanishes on every morphism with an optimal hypothesis must equal some constant times the relative entropy. In other words, there exists some constant $c \in [0,\infty]$ such that

$F(f,s) = c F_0(f,s)$

for any any morphism $(f,s) : (X,p) \to (Y,q)$ in $\mathrm{FinStat}.$

### Remarks

If you’re a maniacally thorough reader of this blog, with a photographic memory, you’ll recall that our theorem now says ‘lower semicontinuous’, where in Part 2 of this series I’d originally said ‘continuous’.

I’ve fixed that blog article now… but it was Tobias who noticed this mistake. In the process of fixing our proof to address this issue, he eventually noticed that the proof of Petz’s theorem, which we’d been planning to use in our work, was also flawed.

Now I just need to finish polishing the rest of the paper!

### 12 Responses to Relative Entropy (Part 3)

1. As long as you’re giving aliases, include “cross-entropy”.

• John Baez says:

Actually it looks like cross-entropy is different:

$H(q,p) = - \displaystyle{ \sum_{x \in X} q_x \ln p_x }$

rather than

$S(q,p) = \sum_{x\in X} q_x \ln\left( \frac{q_x}{p_x} \right)$

• lee bloomquist says:

What do the new results tell us about Probability Learning experiments?

• John Baez says:

I don’t know! Not much unless someone understands our results, knows about those experiments, and works hard to make a connection. No such person exists yet.

2. Blake Stacey says:

Typo: missing period after “given a certain prior”.

3. It seems to me that you doing something like rephrasing Shannon’s original characterization of entropy in category theory terms. Is that fair, or am I totally off base?

• John Baez says:

There are lots of characterizations of entropy and relative entropy. The original Shannon–Khinchine characterization of Shannon entropy includes a condition that entropy is maximized when all outcomes are equally likely. This seems awkward to formulate in category-theoretic terms. It’s easier to use characterizations that involve only equations together with a continuity requirement.

A guy named Faddeev characterized entropy using only equations and a continuity requirement. In a previous paper written with a mutual friend, Tobias and I reformulated Faddeev’s characterization in category-theoretic terms:

• John Baez, Tobias Fritz and Tom Leinster, A characterization of entropy in terms of information loss.

In our new paper we started by trying to phrase Petz’s characterization of relative entropy in category-theoretic terms. Petz’s theorem turned out to be flawed, so Tobias had to fix it. But we still owe our inspiration to him, and
we got a characterization involving only equations and a lower semicontinuity requirement. (For what it’s worth, lower semicontinuity can be thought of as continuity with respect to a funny topology on the real line.)

For a comparison of Faddeev’s characterization of entropy and the Shannon–Khinchine characterization, try this:

• Shigeru Furuichi, On uniqueness theorems for Tsallis entropy and Tsallis relative entropy.

This covers something more general called ‘Tsallis entropy’, but if you take the $q \to 1$ limit you get back to Shannon entropy.

Our paper with Tom Leinster covered the more general Tsallis case, but now that we’re doing relative entropy we’re having enough trouble just handling the Shannon case!

• Thanks! (Is the weird topology you’re referring to Scott topology or upper topology or…?)

• John Baez says:

It’s the upper topology, where the open sets are of the form $(a,\infty]$.

I said ‘real line’, but that wasn’t quite right: relative entropy can be infinite, so we should really think of it as taking values in $[0,\infty]$, and a function from a topological space taking values in $[0,\infty]$ is lower semicontinuous iff it’s continuous when $[0,\infty]$ is given the upper topology.

4. Tobias Fritz, a postdoc at the Perimeter Institute, is working with me on category-theoretic aspects of information theory. We published a paper on entropy with Tom Leinster, and we’ve got a followup on relative entropy that’s almost done. I should be working on it right this instant! But for now, read the series of posts here on Azimuth: Relative Entropy Part 1, Part 2 and Part 3. […]

5. Now Tobias Fritz and I have finally finished our paper on this subject: