Holidays are great. There’s nothing I need to do! Everybody is celebrating! So, I can finally get some real work done.
In the last couple of days I’ve finished a paper with Jamie Vicary on wormholes and entanglement… subject to his approval and corrections. More on that later. And now I’ve returned to working on a paper with Tobias Fritz where we give a Bayesian characterization of the concept of ‘relative entropy’. This summer I wrote two blog articles about this paper:
• Relative Entropy (Part 1): how various structures important in probability theory arise naturally when you do linear algebra using only the nonnegative real numbers.
• Relative Entropy (Part 2): a category related to statistical inference, and how relative entropy defines a functor on this category.
But then Tobias Fritz noticed a big problem. Our characterization of relative entropy was inspired by this paper:
• D. Petz, Characterization of the relative entropy of states of matrix algebras, Acta Math. Hungar. 59 (1992), 449–455.
Here Petz sought to characterize relative entropy both in the ‘classical’ case we are concerned with and in the more general ‘quantum’ setting. Our original goal was merely to express his results in a more category-theoretic framework! Unfortunately Petz’s proof contained a significant flaw. Tobias noticed this and spent a lot of time fixing it, with no help from me.
Our paper is now self-contained, and considerably longer. My job now is to polish it up and make it pretty. What follows is the introduction, which should explain the basic ideas.
A Bayesian characterization of relative entropy
This paper gives a new characterization of the concept of relative entropy, also known as ‘relative information’, ‘information gain’ or ‘Kullback-Leibler divergence’. Whenever we have two probability distributions and on the same finite set we can define the entropy of relative to :
Here we set
equal to when unless is also zero, in which case we set it equal to 0. Relative entropy thus takes values in
Intuitively speaking, is the expected amount of information gained when we discover the probability distribution is really when we had thought it was We should think of as a ‘prior’ and as a ‘posterior’. When we take to be the uniform distribution on relative entropy reduces to the ordinary Shannon entropy, up to an additive constant. The advantage of relative entropy is that it makes the role of the prior explicit.
Since Bayesian probability theory emphasizes the role of the prior, relative entropy naturally lends itself to a Bayesian interpretation: it measures how much information we gain given a certain prior. Our goal here is to make this precise in a mathematical characterization of relative entropy. We do this using a category where:
• an object consists of a finite set and a probability distribution on that set;
• a morphism consists of a measure-preserving function from to together with a probability distribution on $X$ for each element with the property that unless
We can think of an object of as a system with some finite set of states together with a probability distribution on its states. A morphism
then consists of two parts. First, there is a deterministic measurement process mapping states of the system being measured, to states of the measurement apparatus, The condition that be measure-preserving says that, after the measurement, the probability that the apparatus be in any state is the sum of the probabilities of all states of leading to that outcome:
Second, there is a hypothesis : an assumption about the probability that the system being measured is in the state given any measurement outcome
Suppose we have any morphism
in From this we obtain two probability distributions on the states of the system being measured. First, we have the probability distribution given by
This is our prior, given our hypothesis and the probability distribution of measurement results. Second we have the ‘true’ probability distribution which would be the posterior if we updated our prior using complete direct knowledge of the system being measured.
It follows that any morphism in has a relative entropy associated to it. This is the expected amount of information we gain when we update our prior to the posterior
In fact, this way of assigning relative entropies to morphisms defines a functor
where we use to denote the category with one object, the numbers as morphisms, and addition as composition. More precisely, if
is any morphism in we define
where the prior is defined as in the equation (1).
The fact that is a functor is nontrivial and rather interesting. It says that given any composable pair of measurement processes:
the relative entropy of their composite is the sum of the relative entropies of the two parts:
We prove that is a functor. However, we go much further: we characterize relative entropy by saying that up to a constant multiple, is the unique functor from to obeying three reasonable conditions.
The first condition is that vanishes on morphisms where the hypothesis is optimal. By this, we mean that Equation (1) gives a prior equal to the ‘true’ probability distribution on the states of the system being measured.
The second condition is that is lower semicontinuous. The set of probability distibutions on a finite set naturally has the topology of an -simplex when has elements. The set has an obvious topology where it’s homeomorphic to a closed interval. However, with these topologies, the relative entropy does not define a continuous function
The problem is that
and we define to be when and but when So, it turns out that is only lower semicontinuous, meaning that if are sequences of probability distributions on with and then
We give the set of morphisms in its most obvious topology, and show that with this topology, maps morphisms to morphisms in a lower semicontinuous way.
The third condition is that is convex linear. We describe how to take convex linear combinations of morphisms in and then the functor is convex linear in the sense that it maps any convex linear combination of morphisms in to the corresponding convex linear combination of numbers in Intuitively, this means that if we take a coin with probability of landing heads up, and flip it to decide whether to perform one measurement process or another, the expected information gained is times the expected information gain of the first process plus times the expected information gain of the second process.
Here, then, is our main theorem:
Theorem. Any lower semicontinuous, convex-linear functor
that vanishes on every morphism with an optimal hypothesis must equal some constant times the relative entropy. In other words, there exists some constant such that
for any any morphism in
If you’re a maniacally thorough reader of this blog, with a photographic memory, you’ll recall that our theorem now says ‘lower semicontinuous’, where in Part 2 of this series I’d originally said ‘continuous’.
I’ve fixed that blog article now… but it was Tobias who noticed this mistake. In the process of fixing our proof to address this issue, he eventually noticed that the proof of Petz’s theorem, which we’d been planning to use in our work, was also flawed.
Now I just need to finish polishing the rest of the paper!