In the first part of this mini-series, I describe how various ideas important in probability theory arise naturally when you start doing linear algebra using only the nonnegative real numbers.
But after writing it, I got an email from a rather famous physicist saying he got “lost at line two”. So, you’ll be happy to hear that the first part is not a prerequisite for the remaining parts! I wrote it just to intimidate that guy.
Tobias Fritz and I have proved a theorem characterizing the concept of relative entropy, which is also known as ‘relative information’, ‘information gain’ or—most terrifying and least helpful of all—‘Kullback-Leibler divergence’. In this second part I’ll introduce two key players in this theorem. The first,
is a category where:
• an object consists of a system with finitely many states, and a probability distribution on those states
and
• a morphism consists of a deterministic ‘measurement process’ mapping states of one system to states of another, together with a ‘hypothesis’ that lets the observer guess a probability distribution of states of the system being measured, based on what they observe.
The second,
is a subcategory of
It has all the same objects, but only morphisms where the hypothesis is ‘optimal’. This means that if the observer measures the system many times, and uses the probability distribution of their observations together with their hypothesis to guess the probability distribution of states of the system, they get the correct answer (in the limit of many measurements).
In this part all I will really do is explain precisely what
and
are. But to whet your appetite, let me explain how we can use them to give a new characterization of relative entropy!
Suppose we have any morphism in
In other words: suppose we have a deterministic measurement process, together with a hypothesis that lets the observer guess a probability distribution of states of the system being measured, based on what they observe.
Then we have two probability distributions on the states of the system being measured! First, the ‘true’ probability distribution. Second, the probability that the observer will guess based on their observations.
Whenever we have two probability distributions on the same set, we can compute the entropy of the first relative to to the second. This describes how surprised you’ll be if you discover the probability distribution is really the first, when you thought it was the second.
So: any morphism in
will have a relative entropy. It will describe how surprised the observer will be when they discover the true probability distribution, given what they had guessed.
But this amount of surprise will be zero if their hypothesis was ‘optimal’ in the sense I described. So, the relative entropy will vanish on morphisms in 
Our theorem says this fact almost characterizes the concept of relative entropy! More precisely, it says that any convex-linear lower semicontinuous functor
![F : \mathrm{FinStat} \to [0,\infty] F : \mathrm{FinStat} \to [0,\infty]](https://s0.wp.com/latex.php?latex=F+%3A+%5Cmathrm%7BFinStat%7D+%5Cto+%5B0%2C%5Cinfty%5D+&bg=ffffff&fg=333333&s=0)
that vanishes on the subcategory
must equal some constant times the relative entropy.
Don’t be scared! This should not make sense to you yet, since I haven’t said how I’m thinking of
as a category, nor what a ‘convex-linear lower semicontinuous functor’ is, nor how relative entropy gives one. I will explain all that later. I just want you to get a vague idea of where I’m going.
Now let me explain the categories
and
We need to warm up a bit first.
FinStoch
A stochastic map
is different from an ordinary function, because instead of assigning a unique element of
to each element of
it assigns a probability distribution on
to each element of
So you should imagine it as being like a function ‘with random noise added’, so that
is not a specific element of
but instead has a probability of taking on different values. This is why I’m using a weird wiggly arrow to denote a stochastic map.
More formally:
Definition. Given finite sets
and
a stochastic map
assigns a real number
to each pair
such that fixing any element
the numbers
form a probability distribution on
We call
the probability of
given 
In more detail:
•
for all

and
•
for all 
Note that we can think of
as a
-shaped matrix of numbers. A matrix obeying the two properties above is called stochastic. This viewpoint is nice because it reduces the problem of composing stochastic maps to matrix multiplication. It’s easy to check that multiplying two stochastic matrices gives a stochastic matrix. So, composing stochastic maps gives a stochastic map.
We thus get a category:
Definition. Let
be the category of finite sets and stochastic maps between them.
In case you’re wondering why I’m restricting attention to finite sets, it’s merely because I want to keep things simple. I don’t want to worry about whether sums or integrals converge.
FinProb
Now take your favorite 1-element set and call it
A function
is just a point of
But a stochastic map
is something more interesting: it’s a probability distribution on
Why? Because it gives a probability distribution on
for each element of
but that set has just one element.
Last time I introduced the rather long-winded phrase finite probability measure space to mean a finite set with a probability distribution on it. But now we’ve seen a very quick way to describe such a thing within 
And this gives a quick way to think about a measure-preserving function between finite probability measure spaces! It’s just a commutative triangle like this:
Note that the horizontal arrow
is not wiggly. The straight arrow means it’s an honest function, not a stochastic map. But a function is a special case of a stochastic map! So it makes sense to compose a straight arrow with a wiggly arrow—and the result is, in general, a wiggly arrow. So, it makes sense to demand that this triangle commutes, and this says that the function
is measure-preserving.
Let me work through the details, in case they’re not clear.
First: how is a function a special case of a stochastic map? Here’s how. If we start with a function
we get a matrix of numbers

where
is the Kronecker delta. So, each element
gives a probability distribution that’s zero except at 
Given this, we can work out what this commuting triangle really says:
If use
to stand for the probability distribution that
puts on
and similarly for
the commuting triangle says

or in other words:

or if you like:

In this situation people say
is
pushed forward along
, and they say
is a measure-preserving function.
So, we’ve used
to describe another important category:
Definition. Let
be the category of finite probability measure spaces and measure-preserving functions between them.
I can’t resist mentioning another variation:
A commuting triangle like this is a measure-preserving stochastic map. In other words,
gives a probability measure on
gives a probability measure on
and
is a stochastic map with

FinStat
The category we really need for relative entropy is a bit more subtle. An object is a finite probability measure space:
but a morphism looks like this:
The whole diagram doesn’t commute, but the two equations I wrote down hold. The first equation says that
is a measure-preserving function. In other words, this triangle, which we’ve seen before, commutes:
The second equation says that
is the identity, or in math jargon,
is a section for 
But what does that really mean?
The idea is that
is the set of ‘states’ of some system, while
is a set of possible ‘observations’ you might make. The function
is a ‘measurement process’. You ‘measure’ the system using
and if the system is in the the state
you get the observation
The probability distribution
says the probability that the system is any given state, while
says the probability that you get any given observation when you do your measurement.
Note: are assuming for now that that there’s no random noise in the observation process! That’s why
is a function instead of a stochastic map.
But what about
That’s the fun part:
describes your ‘hypothesis’ about the system’s state given a particular measurement! If you measure the system and get a result
you guess it’s in the state
with probability
And we don’t want this hypothesis to be really dumb: that’s what

says. You see, this equation says that

or in other words:

If you think about it, this implies
unless
So, if you make an observation
you will guess the system is in state
with probability zero unless
In short, you won’t make a really dumb guess about the system’s state.
Here’s how we compose morphisms:
We get a measure-preserving function
and a stochastic map going back,
You can check that these obey the required equations:


So, we get a category:
Definition. Let
be the category where an object is a finite probability measure space:
a morphism is a diagram obeying these equations:
and composition is defined as above.
FP
As we’ve just seen, a morphism in
consists of a ‘measurement process’
and a ‘hypothesis’ 
But sometimes we’re lucky and our hypothesis is optimal, in the sense that

Conceptually, this says that if you take the probability distribution
on our observations and use it to guess a probability distribution for the system’s state using our hypothesis
you get the correct answer:
Mathematically, it says that this diagram commutes:
In other words,
is a measure-preserving stochastic map.
There’s a subcategory of
with all the same objects, but only these ‘optimal’ morphisms. It’s important, but the name we have for it is not very exciting:
Definition. Let
be the subcategory of
where an object is a finite probability measure space
and a morphism is a diagram obeying these equations:
Why do we call this category
? Because it’s a close relative of
where a morphism, you’ll remember, looks like this:
The point is that for a morphism in
the conditions on
are so strong that they completely determine it unless there are observations that happen with probability zero—that is, unless there are
with
To see this, note that

actually says

for any choice of
But we’ve already seen
unless
so the sum has just one term, and the equation says
We can solve this for
so
is completely determined… unless
This covers the case when
We also can’t figure out
if
isn’t in the image of 
So, to be utterly precise,
is determined by
and
unless there’s an element
that has
Except for this special case, a morphism in
is just a morphism in
But in this special case, a morphism in
has a little extra information: an arbitrary probability distribution on the inverse image of each point
with this property.
In short,
is the same as
except that our observer’s ‘optimal hypothesis’ must provide a guess about the state of the system given an observation, even in cases of observations that occur with probability zero.
I’m going into these nitpicky details for two reasons. First, we’ll need
for our characterization of relative entropy. But second, Tom Leinster already ran into this category in his work on entropy and category theory! He discussed it here:
• Tom Leinster, An operadic introduction to entropy.
Despite the common theme of entropy, he arrived at it from a very different starting-point.
Conclusion
So, I hope that next time I can show you something like this:
and you’ll say “Oh, that’s a probability distribution on the states of some system!” Intuitively, you should think of the wiggly arrow
as picking out a ‘random element’ of the set 
I hope I can show you this:
and you’ll say “Oh, that’s a deterministic measurement process, sending a probability distribution on the states of the measured system to a probability distribution on observations!”
I hope I can show you this:
and you’ll say “Oh, that’s a deterministic measurement process, together with a hypothesis about the system’s state, given what is observed!”
And I hope I can show you this:
and you’ll say “Oh, that’s a deterministic measurement process, together with an optimal hypothesis about the system’s state, given what is observed!”
I don’t count on it… but I can hope.
Postscript
And speaking of unrealistic hopes, if I were really optimistic I would hope you noticed that
and
which underlie the more fancy categories I’ve discussed today, were themselves constructed starting from linear algebra over the nonnegative numbers
in Part 1. That ‘foundational’ work is not really needed for what we’re doing now. However, I like the fact that we’re ultimately getting the concept of relative entropy starting from very little: just linear algebra, using only nonnegative numbers!