Holidays are great. There’s nothing I need to do! Everybody is celebrating! So, I can finally get some real work done.
In the last couple of days I’ve finished a paper with Jamie Vicary on wormholes and entanglement… subject to his approval and corrections. More on that later. And now I’ve returned to working on a paper with Tobias Fritz where we give a Bayesian characterization of the concept of ‘relative entropy’. This summer I wrote two blog articles about this paper. But then Tobias Fritz noticed a big problem. Our characterization of relative entropy was inspired by this paper:
• D. Petz, Characterization of the relative entropy of states of matrix algebras, Acta Math. Hungar. 59 (1992), 449–455.
Here Petz sought to characterize relative entropy both in the ‘classical’ case we are concerned with and in the more general ‘quantum’ setting. Our original goal was merely to express his results in a more category-theoretic framework! Unfortunately Petz’s proof contained a significant flaw. Tobias noticed this and spent a lot of time fixing it, with no help from me.
Our paper is now self-contained, and considerably longer. My job now is to polish it up and make it pretty. What follows is the introduction, which should explain the basic ideas.
A Bayesian characterization of relative entropy
This paper gives a new characterization of the concept of relative entropy, also known as ‘relative information’, ‘information gain’ or ‘Kullback-Leibler divergence’. Whenever we have two probability distributions and
on the same finite set
we can define the entropy of
relative to
:
Here we set
equal to when
unless
is also zero, in which case we set it equal to 0. Relative entropy thus takes values in
Intuitively speaking, is the expected amount of information gained when we discover the probability distribution is really
when we had thought it was
We should think of
as a ‘prior’ and
as a ‘posterior’. When we take
to be the uniform distribution on
relative entropy reduces to the ordinary Shannon entropy, up to an additive constant. The advantage of relative entropy is that it makes the role of the prior explicit.
Since Bayesian probability theory emphasizes the role of the prior, relative entropy naturally lends itself to a Bayesian interpretation: it measures how much information we gain given a certain prior. Our goal here is to make this precise in a mathematical characterization of relative entropy. We do this using a category where:
• an object consists of a finite set
and a probability distribution
on that set;
• a morphism consists of a measure-preserving function
from
to
together with a probability distribution
on
for each element
with the property that
unless
We can think of an object of as a system with some finite set of states together with a probability distribution on its states. A morphism
then consists of two parts. First, there is a deterministic measurement process mapping states of the system being measured,
to states of the measurement apparatus,
The condition that
be measure-preserving says that, after the measurement, the probability that the apparatus be in any state
is the sum of the probabilities of all states of
leading to that outcome:
Second, there is a hypothesis : an assumption about the probability
that the system being measured is in the state
given any measurement outcome
Suppose we have any morphism
in From this we obtain two probability distributions on the states of the system being measured. First, we have the probability distribution
given by
This is our prior, given our hypothesis and the probability distribution of measurement results. Second we have the ‘true’ probability distribution which would be the posterior if we updated our prior using complete direct knowledge of the system being measured.
It follows that any morphism in has a relative entropy
associated to it. This is the expected amount of information we gain when we update our prior
to the posterior
In fact, this way of assigning relative entropies to morphisms defines a functor
where we use to denote the category with one object, the numbers
as morphisms, and addition as composition. More precisely, if
is any morphism in we define
where the prior is defined as in the equation (1).
The fact that is a functor is nontrivial and rather interesting. It says that given any composable pair of measurement processes:
the relative entropy of their composite is the sum of the relative entropies of the two parts:
We prove that is a functor. However, we go much further: we characterize relative entropy by saying that up to a constant multiple,
is the unique functor from
to
obeying three reasonable conditions.
The first condition is that vanishes on morphisms
where the hypothesis
is optimal. By this, we mean that Equation (1) gives a prior
equal to the ‘true’ probability distribution
on the states of the system being measured.
The second condition is that is lower semicontinuous. The set
of probability distibutions on a finite set
naturally has the topology of an
-simplex when
has
elements. The set
has an obvious topology where it’s homeomorphic to a closed interval. However, with these topologies, the relative entropy does not define a continuous function
The problem is that
and we define to be
when
and
but
when
So, it turns out that
is only lower semicontinuous, meaning that if
are sequences of probability distributions on
with
and
then
We give the set of morphisms in its most obvious topology, and show that with this topology,
maps morphisms to morphisms in a lower semicontinuous way.
The third condition is that is convex linear. We describe how to take convex linear combinations of morphisms in
and then the functor
is convex linear in the sense that it maps any convex linear combination of morphisms in
to the corresponding convex linear combination of numbers in
Intuitively, this means that if we take a coin with probability
of landing heads up, and flip it to decide whether to perform one measurement process or another, the expected information gained is
times the expected information gain of the first process plus
times the expected information gain of the second process.
Here, then, is our main theorem:
Theorem. Any lower semicontinuous, convex-linear functor
that vanishes on every morphism with an optimal hypothesis must equal some constant times the relative entropy. In other words, there exists some constant such that
for any any morphism in
Addendum
If you’re a maniacally thorough reader of this blog, with a photographic memory, you’ll recall that our theorem now says ‘lower semicontinuous’, where in Part 2 of this series I’d originally said ‘continuous’.
I’ve fixed that blog article now… but it was Tobias who noticed this mistake. In the process of fixing our proof to address this issue, he eventually noticed that the proof of Petz’s theorem, which we’d been planning to use in our work, was also flawed.
Here’s the paper we eventually wrote:
• John Baez and Tobias Fritz, A Bayesian characterization of relative entropy, Theory and Applications of Categories 29 (2014), 421–456.
And here’s my whole series of blog articles about it:
• Relative Entropy (Part 1): how various structures important in probability theory arise naturally when you do linear algebra using only the nonnegative real numbers.
• Relative Entropy (Part 2): a category related to statistical inference, and how relative entropy defines a functor on this category.
• Relative Entropy (Part 3): how to characterize relative entropy as a functor from to
• Relative Entropy (Part 4): wrap-up, and an invitation to read more about the underlying math at the n-Category Café.
As long as you’re giving aliases, include “cross-entropy”.
Actually it looks like cross-entropy is different:
rather than
What do the new results tell us about Probability Learning experiments?
I don’t know! Not much unless someone understands our results, knows about those experiments, and works hard to make a connection. No such person exists yet.
Typo: missing period after “given a certain prior”.
Thanks, fixed! Happy forthcoming New Year!
It seems to me that you doing something like rephrasing Shannon’s original characterization of entropy in category theory terms. Is that fair, or am I totally off base?
There are lots of characterizations of entropy and relative entropy. The original Shannon–Khinchine characterization of Shannon entropy includes a condition that entropy is maximized when all outcomes are equally likely. This seems awkward to formulate in category-theoretic terms. It’s easier to use characterizations that involve only equations together with a continuity requirement.
A guy named Faddeev characterized entropy using only equations and a continuity requirement. In a previous paper written with a mutual friend, Tobias and I reformulated Faddeev’s characterization in category-theoretic terms:
• John Baez, Tobias Fritz and Tom Leinster, A characterization of entropy in terms of information loss.
In our new paper we started by trying to phrase Petz’s characterization of relative entropy in category-theoretic terms. Petz’s theorem turned out to be flawed, so Tobias had to fix it. But we still owe our inspiration to him, and
we got a characterization involving only equations and a lower semicontinuity requirement. (For what it’s worth, lower semicontinuity can be thought of as continuity with respect to a funny topology on the real line.)
For a comparison of Faddeev’s characterization of entropy and the Shannon–Khinchine characterization, try this:
• Shigeru Furuichi, On uniqueness theorems for Tsallis entropy and Tsallis relative entropy.
This covers something more general called ‘Tsallis entropy’, but if you take the
limit you get back to Shannon entropy.
Our paper with Tom Leinster covered the more general Tsallis case, but now that we’re doing relative entropy we’re having enough trouble just handling the Shannon case!
Thanks! (Is the weird topology you’re referring to Scott topology or upper topology or…?)
It’s the upper topology, where the open sets are of the form
.
I said ‘real line’, but that wasn’t quite right: relative entropy can be infinite, so we should really think of it as taking values in
, and a function from a topological space taking values in
is lower semicontinuous iff it’s continuous when
is given the upper topology.
Tobias Fritz, a postdoc at the Perimeter Institute, is working with me on category-theoretic aspects of information theory. We published a paper on entropy with Tom Leinster, and we’ve got a followup on relative entropy that’s almost done. I should be working on it right this instant! But for now, read the series of posts here on Azimuth: Relative Entropy Part 1, Part 2 and Part 3. […]
Now Tobias Fritz and I have finally finished our paper on this subject:
• A Bayesian characterization of relative entropy.
Two comments: first, relative entropy does not reduce to ordinary entropy with a uniform prior, even up to a constant. It reduces to the negative of ordinary entropy up to a constant. (This is why I’m tempted to call “relative entropy” the negative of what you wrote.)
Second, it doesn’t follow from the definition that relative entropy is nonnegative. The problem is that you can’t a priori control the signs of the logarithms. It’s a nontrivial fact that this is true.
Qiaochu wrote:
Yeah, the missing minus sign annoys me too. Lots of people call this quantity relative entropy, but in a later paper I wrote:
Most of our results involve a quantity that is variously known as ‘relative information’, ‘relative entropy’, ‘information gain’ or the ‘Kullback–Leibler divergence’. We shall use the first term. Given two probability distributions
and
on a finite set
their relative information, or more precisely the information of
relative to
is
We use the word ‘information’ instead of ‘entropy’ because one expects entropy to increase with time, and the theorems we present will say that
decreases with time under various conditions. The reason is that the Shannon entropy
contains a minus sign that is missing from the definition of relative information.
Qiaochu wrote:
I’m not sure what you’re asserting is true.
However, when
and
are probability distributions on a finite set, I believe that
though it may be infinite: we define
but
when
, to handle cases involving the logarithm of zero.
I believe the same sort of inequality holds whenever
and
are probability measures and
is absolutely continuous with respect to
If you have a counterexample, I’d like to see it!
You might like the argument on page 16 here: it gives non-negativity for a generalization of relative information to ‘non-normalized probability distributions’ on finite sets.